There’s probably an HTTP error code for every situation; for this post, 538 seems to well-suit. It’s a Windows error that returns a dialog about ABIOS (Basic I/O) Subsystems, indicating invalid entries and corrupted drivers. Despite their obscurity to most of us, these are actually common and analogous issues in developing data projects for journalism…corrupted, dated, or invalid info being problematic in both cases. This is a post about one of those cases.
If you’ve been following journalistic tracking of the Nigeria kidnappings, then you might have come across 538, a collective of hackers and journalists who has been reporting on the topic and recently posted this set of maps using GDELT (Global Database of Events Language and Tone) data. This garnered a series of pretty solid rebuttals about integrity of their assertions; see @charlie_simpson’s Storify feed and Daniel Solomon on Source. The problem with the piece in question (to summarize the previous links), is that it provides time-series and mapped analysis of kidnapping in Nigeria but skews representation of the actual data plotted.
As someone who works with journo orgs, crowdsourced crisis-mapping projects, data, and Africa I thought I’d comment on some of the fallibilities briefly. The particular fumbles I see in the 538 representation of kidnapping incidents in Nigeria can be bundled under three issues that are persistently problematic in all data journalism projects.
ISSUE 1: REPRESENTATIONAL INTEGRITY
A lot of issues with data mapping/graphing projects boil down to human representational error: what is your map actually showing and what are you saying it’s showing? In this case, the equivalence of GDELT media data and actual incident data is a superfail, but not only in the (mis)representation of the source used. The failure to buttress that representation with clear disclaimers and other data is also unfortunate, worth commenting on here. Quotes below taken from the 538 article in question.
“Official kidnapping statistics for Nigeria aren’t available, and our numbers do provide a good relative picture; we can see where kidnappings in Nigeria are most prevalent.“
This points to data paucity, which is fair, definitely a speedbump, but not entirely excusable. We’ve been spoiled perhaps by the assumption that everything should have a .csv download or an API endpoint, or that you can get all of the things from one aggregation feed, but some more context here would help.
What about showing why/how your query was unsatisfactory? If you search in prognoz (the Nigerian Statistical Open Data Portal used the 538’s author to search) you do find data under “Public Order and Safety” as a data category, indicators (search terms) like “kidnapping” result in graphs from 2006 + .
Likewise, if a trend in one data set is notable, particularly a geographic density of “events” on the map, it’s worth looking at other data to supplement your assumptions.
“One possible explanation is the region’s oil wealth, otherwise known as the curse of the black gold. The United Nations news service has also highlighted how oil extraction in the south of Nigeria has been accompanied by violence and criminality.”
If a relationship to oil by region is of interest, Prognoz has data for that (Macro-Economic Data > Petroleum), or maybe there’s another relationship to geography worth exploring: topography, environmental influences. Perhaps a comparative analysis with other mapping projects devoted to those data, like Oil Spill Monitor – Nigeria or flood tracking and standing water in the regions where 538 notes a density of kidnappings would be of comparative interest. Are there other geographic factors that might affect crises worth exploring?
There’s a value in layering data sets and comparisons across mainstream and social media, and the real value of journalism’s take on these data is the comparative perspective it can provide, recognizing the weaknesses between data sets and using them to crosscheck each other rather than only “normalizing” to control for error in one set.
“This is a somewhat crude calculation. We’re counting all geolocated kidnappings in the GDELT database since 1982 and dividing that by each state’s current population.“
So, does that mean that the current population in a region was the denominator for that division across all decades (because at the time of this post, the population link provided in the post doesn’t load)? Where is the data? how can people access it, can I get a tooltip with counts and calcs in the timeseries (pretty sure cartodb supports this; I mean, really, man.)?
ISSUE 2: DATA BULLETPROOFING
This is the predominant criticism in both rebuttals, the refrain of all journo projects being a pretty neat alliterative philosophy: check, compare, contextualize.
As this has been well-covered by the other critics and is a pretty well-documented challenge in journalism (see: “verification by replication,” scientific method-style), I won’t belabor it here. Qualified outfits have written impressive how-tos (like this awesome one from ProPublica) though the process for bullet-proofing each piece is usually custom. There are also papers and projects like the Data Verification Handbook, and applications like Twittcred and Storyful aimed at affirming social media.
Early in this bullet-proofing process, it’s also helpful to take a look at comparative projects and use them to illustrate why your analysis is distinct, and how it contributes to a gap. Nigeria Security Tracker also has mapped violence and fatalities in a time series; Nigeria Watch provides a database of violence trends as well, and there are other authoritative and georeferenceable event data with downloadable datasets worth querying against to better verify GDELT.
ISSUE 3: SOCIAL OR SECONDARY MEDIA AS SOURCE
Lastly, and predictably, there are always hiccups when plotting social and secondary media accounts as events.
Analytics on postings and general media circulation can be valuable for viewing the conversation around a topic online, but they can also be speciously spun to represent the density of actual crises or activity in an area. Counting the tweets related to #nigeria isn’t entirely useful for modeling a threat without filters or ways to validate those postings. Even GDELT, in its ambitious program “to provide the global research community with its first open global multi-decade quantitative database of human society” is still researching how to best verify social data.
Let’s look at a more general example mapping data. GDELT represents media activity around topics, like how google trends represents search activity on topics, but both can be confused with representing incidents. In the later case, examples of secondary source and interpretive fumble abound.
Take Flu Trends:
or this Google Trends graph of a few JS libs one (note the rise of Angular JS in recent times):
Just as there’s a tendency to consider a social media campaign as solely-sufficient involvement in a crisis situation, there’s a tendency to tap a feed aggregation or media API as an authoritative representation of actual events. The distinction between social and mainstream media fuzzes when mainstream relies on social or secondary media as data, a problem in the 538 case, as they provide analysis of an aggregation feed of secondary media accounts of events.
Often, social media is incredibly powerful for plotting the general conversation about a topic (I’m looking at you, Westgate twitter tracking). Some of the most positive reactions to this crisis have been piloted by social media (#BringBackOurGirls), whose impact can be limited practically, but potentially epic as an indictment of the the government and mainstream media are doing comparatively. There’s little that’s less shameful in our digital world then having your government and formal press upstaged by hipster hashtag advocacy. That’s not to say, certainly, that these campaigns aren’t subject to their own epic blunders of failed verification (see: #yikes).
No one is be perfect all of the time, or capable of pleasing all the people, certainly. GDELT is an imperfect source of most things beyond tracking media reaction, so it fails in this effort to echo its output back as event data (see Source). However, media reaction is still interesting for other analyses, hence the media reaction to these maps; the integrity of a news organization and its output of (even aggregated) content is still worth indexing.
EOD, the ethics of data journalism and best practices haven’t been adequately codified for these kinds of stories. At last year’s Highway Africa conference, Peter Horrock (BBC) talked about the best indices of quality media covering Africa being somewhere at the intersection of how an organization covers domestic events and how it covers its mistakes (see his full talk here). In this latter case, media reaction is important, if for a different reason. We’ll see how 538 reacts, and maybe learn something about how to manage future code-fumbles. I’m looking forward to more verification protocols: representational integrity, data bulletproofing, and secondary sourc-ery 😉 </ERROR>
* Thanks to J. Morgan. E. Constantaras, and J. Rotich for contributing data, time, and thoughts to this post