The GDELT Project introduced in a previous post discussed the origins and high-level information included in the project’s datasets. This post will specifically explore the version 2.0 features of the Event dataset, which is the core data that everything else is keyed off.
The GDELT Events dataset is compiled by the TABARI system from headlines and other reporting services on 15-minute intervals. The initial event record reports when and where the event was first discovered, how many media mentions were detected in the 15 minute update window, and various automatically extracted pieces of information. The event is structured as an (Actor1, Event, Actor2) triple with additional extended geospatial data to help pinpoint exactly what happened, where. Additional data reports on the original language of the media, tone, and other quantitative scales regarding the event recorded. All told, there are 61 features in the GDELT Events 2.0 schema. Here is the tl;dr based on the official GDELT 2.0 codebook.
Schema in a nutshell
Below is a very cursory overview of the GDELT 2.0 Event schema. It is not intended to be authoritative, but instead give the reader an idea as to what kind of data is available in the Event dataset.
Event ID and dates
The Event schema starts off with the following indexable fields:
This is the unique identifier for the event, which is effectively the primary key of the GDELT Event schema and what all other datasets relate to.
Day, MonthYear, Year, FractionDate
These fields all present the event discovery date in various formats, down to day granularity for the Day and FractionDate fields.
Next up is the CAMEO actor codes that reflect who (Actor1) did what (Event) to who (Actor2). These actor codes are compiled from up to five, three-letter CAMEO codes from the following categories, with an additional name if it was identified. Between Actor1 and Actor2, these codes account for 20 of the fields in the Event 2.0 schema.
The country codes reflect as closely as possible the nationality of the actors involved in the event. There are 261 available country codes, but even with that rich set of options there might be no identifiable or relevant nationality involved in one or both actors in the (Actor1, Event, Actor2) triple.
This set of codes currently consists of 117 identified groups that could have been involved in the event; again, there might not be an identifiable group, and even if a group is identified there might not be a related nationality for the country code field.
If possible the TABARI system will attempt to tag the actors with one of 646 identified ethnicities, which can sometimes help further filter events down to more localized events and identify events related specifically to ethnic interactions.
Many events are related to religions in some way or another, and the TABARI system is able to identify up to 31 religions and sub-religions via a composite two-field feature in the Event schema.
To accommodate additional classifications of actors that don’t necessarily fall into one of the above (or possibly better elaborate the actor) there are 40 CAMEO type codes, from which up to three values can be selected to complete the actor composite field.
In the (Actor1, Event, Actor2) triple, we now encounter the event-specific fields. These fields describe what kind of event it is, where it falls in the CAMEO event hierarchy, and the magnitude of the negative or positive impact for the type of event. There are 10 fields allocated to quantifying and categorizing the event.
IsRootEvent, EventCode, EventBaseCode, EventRootCode
These fields provide the CAMEO event classification, which is a 4-tier hierarchy of 310 event classes, ranging from strongly positive (e.g. “Retreat or surrender militarily”) to strongly negative (e.g. “Engage in ethic cleansing”).
This field provides a broader categorization of an event, classifying it as one of the following:
- Verbal Cooperation
- Material Cooperation
- Verbal Conflict
- Material Conflict
The Goldstein Scale is an intensity scale for measuring the impact of events, with positive, negative and neutral scores. This can be used to relatively compare events as to their perceived impact on the target of the event (Actor2) as well as the global scene. The values range from -10.0 (highly negative impact) to +10.0 (highly positive impact).
NumMentions, NumSources, NumArticles
These are counters for the number of total mentions, unique sources, and unique articles respectively for this event in the 15-minute window in which it was first identified. These are static values, and should only be used to identify the immediate reaction to an event. Even then, since the event could have been identified towards the end of a 15-minute window, these counts can be skewed. A better measure is processing the Mentions dataset (another day, another post) to map media mention volume to events.
This is the average tone of the initial articles the event was mentioned in. This can range from -100 (very negative tone) to +100 (very positive), with the average usually falling somewhere between -10 and +10; a tone score of zero indicates neutrality.
Additional geolocation fields
In addition to the previously identified location fields, the Event schema provides additional geolocation data that might provide more specific actor and event related information. For example, if the President of the United States (Actor1CountryCode) had a meeting with the Supreme Leader of North Korea (Actor2CountryCode) in Singapore, the Actor1Geo_* and Actor2Geo_* fields would reflect where Actor1 and Actor2 were when the Event occurred, as well as geolocating the Event via the ActionGeo_* fields.
First appearance fields
The last two fields in the Event 2.0 schema are the DATEADDED and SOURCEURL fields. DATEADDED provides the 15-minute-interval timestamp that the Event first appeared within, and SOURCEURL provides the URI of the media mention where the Event was first discovered.
As can be seen, the GDELT 2.0 Event dataset contains a rich set of fields to describe an event that has caught media attention. Some of the fields can be incredibly specific (latitude-longitude for geolocation) or provide a broader view (quad class of an event), as well as the perceived impact of the event (Goldstein score, average tone). My next post will explore the Mentions dataset which references the Event dataset for ongoing media attention to events, even older ones that might still pop up in the news from time to time, and gives a better picture of the media impact of an event.