The GDELT Project introduced in a previous post discussed the origins and high-level information included in the project’s datasets. This post will specifically explore the version 2.0 features of the Mentions dataset, which records where GDELT Events are initially discovered or later referenced, and statistics surrounding the mentions of the event in question.
The GDELT Mentions dataset is compiled by the TABARI system from headlines and other reporting services on 15-minute intervals. The Mentions dataset drives the Event dataset in that new events detected in extracted Mentions are used to construct new Event records. If the event can be matched to a previously recorded Event, then a Mention references the event and provides a mechanism by which to track the impact of the event based on frequency, volume, and how the event is referenced. Since the Mentions dataset is strictly used to report the interest in an Event, the schema only contains 16 fields that provide relevancy information of the Event as it was referenced in the Mention. Here is the tl;dr based on the official GDELT 2.0 codebook.
Schema in a nutshell
Below is a very cursory overview of the GDELT 2.0 Mention schema. It is not intended to be authoritative, but instead give the reader an idea as to what kind of data is available in the Mentions dataset.
Base Event record
Every Mention is tied to a single event in GDELT; however, a single Mention source might reference multiple events (think about something like a single report on a G20 summit where several events related to leaders, countries, and activist might be referenced). The event related to this specific Mention is referenced by the following fields:
This is the unique identifier that maps to the Event dataset record.
This is the same 15-minute interval timestamp that is recorded on the Event record as to when it first appeared in a Mention. This can be useful for quickly determining how old the event record is without requiring a separate lookup in the Events dataset.
The following fields are used to uniquely identify the document the Mention was extracted from, and provide some classification and analysis of the document as a whole. Note that the statistics reported are based on the English translation of the document.
The 15-minute interval timestamp within which this Mention was discovered. Using this in combination with the EventTimeDate field can give an idea as to how recently the event being referenced occurs.
This categorical field records what type of mention this is, e.g. from the web, JSTOR, some non-textual source like voice or video, etc:
- 1 = WEB
- 2 = CITATIONONLY
- 3 = CORE
- 4 = DTIC
- 5 = JSTOR
- 6 = NONTEXTUALSOURCE
The “name” of the source for this Mention. If a website, the top-level domain is recorded. Otherwise the news aggregator service name is given.
This is a unique identifier for the Mention as defined by its type. For a web source, it will be the URL, if a paper or journal article the DOI; for other MentionType documents whatever unique identifier can be determined for the Mention is used.
The length in English characters of the source document.
The overall tone of the Mention document. This value is used to calculate the Event AvgTone field for all Mentions in the original 15-minute interval the Event record was first discovered in.
This field is internally delimited by semicolons and is used to record provenance information for machine translated documents.
Event mention information
The following fields relate the event to the document regarding how it was used, where it was found, etc.
The sentence within the article where the event was mentioned (1-based).
The location within the article (in terms of English characters) where Actor1 for the Event was found.
The location within the article (in terms of English characters) where Actor2 for the event was found.
The location within the article (in terms of English characters) where the core Action description of the event was found.
This field is a simple boolean flag recording whether the event was explicitly referenced in the raw text of the document (1) or was discovered in a rewrite by the TABARI system resolving coreferences and disambiguations (0).
Percent confidence (0-100) in the extraction of this event from this article; this confidence score tends to be tied to the InRawText field in that an extraction from rewritten text will have varying confidence scores, but a raw-text extraction will often score highly.
This field is currently blank, but is reserved for future use by the GDELT system.
As can be seen, the GDELT 2.0 Mention dataset contains a rich set of fields to describe an event’s mention in media, papers, or other sources. The data can be used to track the relative importance of an event via frequency, volume, and relative document location as well as the impact of historical events.