A quick look at the GDELT Mentions schema

The GDELT Project introduced in a previous post discussed the origins and high-level information included in the project’s datasets. This post will specifically explore the version 2.0 features of the Mentions dataset, which records where GDELT Events are initially discovered or later referenced, and statistics surrounding the mentions of the event in question.

The GDELT Mentions dataset is compiled by the TABARI system from headlines and other reporting services on 15-minute intervals. The Mentions dataset drives the Event dataset in that new events detected in extracted Mentions are used to construct new Event records. If the event can be matched to a previously recorded Event, then a Mention references the event and provides a mechanism by which to track the impact of the event based on frequency, volume, and how the event is referenced. Since the Mentions dataset is strictly used to report the interest in an Event, the schema only contains 16 fields that provide relevancy information of the Event as it was referenced in the Mention. Here is the tl;dr based on the official GDELT 2.0 codebook.

Schema in a nutshell

Below is a very cursory overview of the GDELT 2.0 Mention schema. It is not intended to be authoritative, but instead give the reader an idea as to what kind of data is available in the Mentions dataset.

Base Event record

Every Mention is tied to a single event in GDELT; however, a single Mention source might reference multiple events (think about something like a single report on a G20 summit where several events related to leaders, countries, and activist might be referenced). The event related to this specific Mention is referenced by the following fields:

GlobalEventID

This is the unique identifier that maps to the Event dataset record.

EventTimeDate

This is the same 15-minute interval timestamp that is recorded on the Event record as to when it first appeared in a Mention. This can be useful for quickly determining how old the event record is without requiring a separate lookup in the Events dataset.

Mention provenance

The following fields are used to uniquely identify the document the Mention was extracted from, and provide some classification and analysis of the document as a whole. Note that the statistics reported are based on the English translation of the document.

MentionTimeDate

The 15-minute interval timestamp within which this Mention was discovered. Using this in combination with the EventTimeDate field can give an idea as to how recently the event being referenced occurs.

MentionType

This categorical field records what type of mention this is, e.g. from the web, JSTOR, some non-textual source like voice or video, etc:

1 = WEB
2 = CITATIONONLY
3 = CORE
4 = DTIC
5 = JSTOR
6 = NONTEXTUALSOURCE

MentionSourceName

The “name” of the source for this Mention. If a website, the top-level domain is recorded. Otherwise the news aggregator service name is given.

MentionIdentifier

This is a unique identifier for the Mention as defined by its type. For a web source, it will be the URL, if a paper or journal article the DOI; for other MentionType documents whatever unique identifier can be determined for the Mention is used.

MentionDocLen

The length in English characters of the source document.

MentionDocTone

The overall tone of the Mention document. This value is used to calculate the Event AvgTone field for all Mentions in the original 15-minute interval the Event record was first discovered in.

MentionDocTranslationInfo

This field is internally delimited by semicolons and is used to record provenance information for machine translated documents.

Event mention information

The following fields relate the event to the document regarding how it was used, where it was found, etc.

SentenceID

The sentence within the article where the event was mentioned (1-based).

Actor1CharOffset

The location within the article (in terms of English characters) where Actor1 for the Event was found.

Actor2CharOffset

The location within the article (in terms of English characters) where Actor2 for the event was found.

ActionCharOffset

The location within the article (in terms of English characters) where the core Action description of the event was found.

InRawText

This field is a simple boolean flag recording whether the event was explicitly referenced in the raw text of the document (1) or was discovered in a rewrite by the TABARI system resolving coreferences and disambiguations (0).

Confidence

Percent confidence (0-100) in the extraction of this event from this article; this confidence score tends to be tied to the InRawText field in that an extraction from rewritten text will have varying confidence scores, but a raw-text extraction will often score highly.

Additional fields

Extras

This field is currently blank, but is reserved for future use by the GDELT system.

Conclusion

As can be seen, the GDELT 2.0 Mention dataset contains a rich set of fields to describe an event’s mention in media, papers, or other sources. The data can be used to track the relative importance of an event via frequency, volume, and relative document location as well as the impact of historical events.

A quick look at the GDELT Mentions schema

Schema in a nutshell

Base Event record

GlobalEventID

EventTimeDate

Mention provenance

MentionTimeDate

MentionType

MentionSourceName

MentionIdentifier

MentionDocLen

MentionDocTone

MentionDocTranslationInfo

Event mention information

SentenceID

Actor1CharOffset

Actor2CharOffset

ActionCharOffset

InRawText

Confidence

Additional fields

Extras

Conclusion

You may also like...

Categories

Latest Tweets

Schema in a nutshell

Base Event record

GlobalEventID

EventTimeDate

Mention provenance

MentionTimeDate

MentionType

MentionSourceName

MentionIdentifier

MentionDocLen

MentionDocTone

MentionDocTranslationInfo

Event mention information

SentenceID

Actor1CharOffset

Actor2CharOffset

ActionCharOffset

InRawText

Confidence

Additional fields

Extras

Conclusion

Share this:

You may also like...

TIL observing the Pokémon GO phenomenon

Why the CMS has given the CMS a bad name

Natural Language Processing vs. Understanding

Categories

Latest Tweets