Exploring the GDELT Project

person tossing globe

The Global Database of Events, Language and Tone (GDELT) Project is a massive and extensive database of global events and related media mentions, as well as extracted/inferred information, going back to 1979.  It provides a rich set of data that can be explored for macroeconomics, humanitarian relief, geopolitical movements, and so much more.

Background

The GDELT Project is the brainchild of Dr. Kalev Leetaru, who has spent a large part of his academic career exploring the web and how it impacts society – and how society in turn leverages the web.  GDELT was first reported in the literature in 2013 and according to Google Scholar has appeared in over 800 publications since.  It utilizes the TABARI processing engine with the CAMEO coding system to process published media from all over the web, in almost any language, and in multiple media. The data is available via visualization engines, Google’s BigQuery, and raw tab-delimited files.  It has also been loaded into a graph database that allows more advanced queries to find causal links and related actors.

Events are identified and tagged with 61 fields ranging from basic timestamps and source of the first report of the event to extracted information relating involved actors, the type of activity, geospatial information, the Goldstein score for the event, and the tone of the initial report.  Follow-on media mentions are also reported in a separate collection where the original event is referenced and 15 other fields of data are recorded regarding when and where the mention appeared, where in the media item it appeared, and the overall tone of the article itself to name a few.

The value of the data is apparent to the most casual observer, and the fact that so many papers in the past five years have cited it speaks volumes about its acceptance by the academic research community.

My Interest

I’ve been active in software engineering for many years, and am about to complete my MSCS with a specialization in Machine Learning.  I also happen to hold an MBA in Finance, where I focused on macro level analysis of financial systems and global economics.

It’s no secret that the Holy Grail of economics and finance is the ability to spot trends in global markets for a multitude of reasons.  I have seen several academic papers attempting to use supervised learning models (mostly MLP/ANN approaches) to predict markets, but many appear to be using quantitative trading technical indicators.  In my opinion, these are a bit of a self-fulfilling prophecy as all they are doing is predicting the aftermath of automated quant traders reacting to the same technical indicators – which is not much of a prediction, and usually very short-term to boot.

My hypothesis is that technical trading indicators for a single market are pretty much worthless unless you’re trading with automated systems and can react in under 100ms.  Further, I propose that financial markets are no longer strictly impacted by their respective domestic events and politics as many market makers have a global presence.  For example this week we have seen that what happens in Italy can have a regional and even global impact, and the impact of the GDPR might have serious consequences for the global web-based economy – neither of which would show up in a moving average or Sharpe ratio.

The GDELT dataset is massive – something like 400 million records and counting, with CAMEO codes in the hundreds and cross-indexing matrices several layers deep for the 61 event fields and 15 mention fields – updated in batches every 15 minutes.  My exploration of GDELT will consist of trying to reduce the feature set to a manageable size while maintaining as much of the semantics as possible (meaning PCA, ICA, other purely statistical dimensionality reduction algorithms are not really applicable).  In doing so, I hope to produce a condensed time-series dataset that can be used more effectively to predict macroeconomic shifts in the global markets with simple(r) supervised learning models, and also possibly provide data that is more easily leveraged by unsupervised learning approaches such as clustering.