538: Errors, Plotting Crises, and the Protocol of Re:Processing Data

There’s probably an HTTP error code for every situation; for this post, 538 seems to well-suit. It’s a Windows error that returns a dialog about ABIOS (Basic I/OSubsystems, indicating invalid entries and corrupted drivers. Despite their obscurity to most of us, these are actually common and analogous issues in developing data projects for journalism…corrupted, dated, or invalid info being problematic in both cases. This is a post about one of those cases.

ABIOS ERROR CODE

If you’ve been following journalistic tracking of the Nigeria kidnappings, then you might have come across 538, a collective of hackers and journalists who has been reporting on the topic and recently posted this set of maps using GDELT (Global Database of Events Language and Tone) data. This garnered a series of pretty solid rebuttals about integrity of their assertions; see @charlie_simpson’s Storify feed and Daniel Solomon on Source. The problem with the piece in question (to summarize the previous links), is that it provides time-series and mapped analysis of kidnapping in Nigeria but skews representation of the actual data plotted.

 

 

 

 

 

 

 

 

 

 

 

 

As someone who works with journo orgs, crowdsourced crisis-mapping projects, data, and Africa I thought I’d comment on some of the fallibilities briefly. The particular fumbles I see in the 538 representation of kidnapping incidents in Nigeria can be bundled under three issues that are persistently problematic in all data journalism projects.

ISSUE 1: REPRESENTATIONAL INTEGRITY

A lot of issues with data mapping/graphing projects boil down to human representational error: what is your map actually showing and what are you saying it’s showing? In this case, the equivalence of GDELT media data and actual incident data is a superfail, but not only in the (mis)representation of the source used. The failure to buttress that representation with clear disclaimers and other data is also unfortunate, worth commenting on here. Quotes below taken from the 538 article in question.

Official kidnapping statistics for Nigeria aren’t available, and our numbers do provide a good relative picture; we can see where kidnappings in Nigeria are most prevalent.

This points to data paucity, which is fair, definitely a speedbump, but not entirely excusable. We’ve been spoiled perhaps by the assumption that everything should have a .csv download or an API endpoint, or that you can get all of the things from one aggregation feed, but some more context here would help.

The link in this quote, for example, should be bracketed in context, linking to a 404 (“aren’t available”) like this is unhelpful when you don’t know the query that led to it:40Fail.superlame.com

 

What about showing why/how your query was unsatisfactory? If you search in prognoz (the Nigerian Statistical Open Data Portal used the 538’s author to search) you do find data under “Public Order and Safety” as a data category, indicators (search terms) like “kidnapping” result in graphs from 2006 + .

Likewise, if a trend in one data set is notable, particularly a geographic density of “events” on the map, it’s worth looking at other data to supplement your assumptions.

One possible explanation is the region’s oil wealth, otherwise known as the curse of the black gold. The United Nations news service has also highlighted how oil extraction in the south of Nigeria has been accompanied by violence and criminality.”

If a relationship to oil by region is of interest, Prognoz has data for that (Macro-Economic Data > Petroleum), or maybe there’s another relationship to geography worth exploring: topography, environmental influences. Perhaps a comparative analysis with other mapping projects devoted to those data, like Oil Spill Monitor – Nigeria or flood tracking and standing water in the regions where 538 notes a density of kidnappings would be of comparative interest. Are there other geographic factors that might affect crises worth exploring?

There’s a value in layering data sets and comparisons across mainstream and social media, and the real value of journalism’s take on these data is the comparative perspective it can provide, recognizing the weaknesses between data sets and using them to crosscheck each other rather than only “normalizing” to control for error in one set. 

This is a somewhat crude calculation. We’re counting all geolocated kidnappings in the GDELT database since 1982 and dividing that by each state’s current population.

So, does that mean that the current population in a region was the denominator for that division across all decades (because at the time of this post, the population link provided in the post doesn’t load)? Where is the data? how can people access it, can I get a tooltip with counts and calcs in the timeseries (pretty sure cartodb supports this; I mean, really, man.)?

ISSUE 2: DATA BULLETPROOFING

This is the predominant criticism in both rebuttals, the refrain of all journo projects being a pretty neat alliterative philosophy: check, compare, contextualize.

Validate Your Data

As this has been well-covered by the other critics and is a pretty well-documented challenge in journalism (see: “verification by replication,” scientific method-style), I won’t belabor it here. Qualified outfits have written impressive how-tos (like this awesome one from ProPublica) though the process for bullet-proofing each piece is usually custom. There are also papers and projects like the Data Verification Handbook, and applications like Twittcred and Storyful aimed at affirming social media.

Early in this bullet-proofing process, it’s also helpful to take a look at comparative projects and use them to illustrate why your analysis is distinct, and how it contributes to a gap. Nigeria Security Tracker also has mapped violence and fatalities in a time series; Nigeria Watch provides a database of violence trends as well, and there are other authoritative and georeferenceable event data with downloadable datasets worth querying against to better verify GDELT.

ISSUE 3: SOCIAL OR SECONDARY MEDIA AS SOURCE

Lastly, and predictably, there are always hiccups when plotting social and secondary media accounts as events.

what GDELT *will* tell you

Analytics on postings and general media circulation can be valuable for viewing the conversation around a topic online, but they can also be speciously spun to represent the density of actual crises or activity in an area. Counting the tweets related to #nigeria isn’t entirely useful for modeling a threat without filters or ways to validate those postings. Even GDELT, in its ambitious programto provide the global research community with its first open global multi-decade quantitative database of human society” is still researching how to best verify social data.

Let’s look at a more general example mapping data. GDELT represents media activity around topics, like how google trends represents search activity on topics, but both can be confused with representing incidents. In the later case, examples of secondary source and interpretive fumble abound. 

Take Flu Trends:

 

or this Google Trends graph of a few JS libs one (note the rise of Angular JS in recent times):

Angular vs. all other JS

What these graphs illustrate is not an actual density of flu incidents or a spike in public interest in Angular JS but rather the number of searches related to incidents, and perhaps public confusion about Angular JS. People who have the flu might also go straight to the doctor and not google it; people who understand and appreciate Angular are perhaps unlikely to google for Stack Overflow. Media discussion or focus on a topic does not always/often equate with actual activity, though the two are sometimes conflated.

Just as there’s a tendency to consider a social media campaign as solely-sufficient involvement in a crisis situation, there’s a tendency to tap a feed aggregation or media API as an authoritative representation of actual events. The distinction between social and mainstream media fuzzes when mainstream relies on social or secondary media as data, a problem in the 538 case, as they provide analysis of an aggregation feed of secondary media accounts of events.

Often, social media is incredibly powerful for plotting the general conversation about a topic (I’m looking at you, Westgate twitter tracking). Some of the most positive reactions to this crisis have been piloted by social media (#BringBackOurGirls), whose impact can be limited practically, but potentially epic as an indictment of the the government and mainstream media are doing comparatively. There’s little that’s less shameful in our digital world then having your government and formal press upstaged by hipster hashtag advocacy. That’s not to say, certainly, that these campaigns aren’t subject to their own epic blunders of failed verification (see: #yikes).

But beyond press campaigns and historical analyses of population/kidnapping trends, projects that pull in crowdsourced data are pretty impressively valuable for soliciting first-person information and sparking citizen-driven initiatives; Reuters’ blog just covered a bunch of them as relevant to the plight of Nigeria’s current victims. Ushahidi, for example, uses crowdsourced first-person reports that have been subcategorized and mapped by the admins of each instances’ deploy. It’s not a perfect representation of conflict, and it certainly has its limitations, but it is a distributed 1st-person reporting mechanism that can track violence relative to a geographic location depending on how the instance is customized. Secondary processors of this information can add a layer of interpretive error that weakens the integrity of the sources, if by only failing to admit their fallibilities. There are several Ushahidi projects that track violence in Nigeria, with their own foci and categorization schema (distinguishing between “trusted”/”verified” reports and public feeds). Like Niger Delta Watch, or Extrajudicial Killings – Nigeria, or Stop the Bribes, all of which provide first person accounts of violence as mapped to regions in Nigeria.

No one is be perfect all of the time, or capable of pleasing all the people, certainly. GDELT is an imperfect source of most things beyond tracking media reaction, so it fails in this effort to echo its output back as event data (see Source). However, media reaction is still interesting for other analyses, hence the media reaction to these maps; the integrity of a news organization and its output of (even aggregated) content is still worth indexing.

EOD, the ethics of data journalism and best practices haven’t been adequately codified for these kinds of stories. At last year’s Highway Africa conference, Peter Horrock (BBC) talked about the best indices of quality media covering Africa being somewhere at the intersection of how an organization covers domestic events and how it covers its mistakes (see his full talk here). In this latter case, media reaction is important, if for a different reason. We’ll see how 538 reacts, and maybe learn something about how to manage future code-fumbles. I’m looking forward to more verification protocols: representational integrity, data bulletproofing, and secondary sourc-ery 😉  </ERROR>

* Thanks to J. Morgan. E. Constantaras, and  J. Rotich for contributing data, time, and thoughts to this post

Advertisements
Tagged , ,

One thought on “538: Errors, Plotting Crises, and the Protocol of Re:Processing Data

  1. […] blog. The publication reported that kidnappings in Africa were on the rise. But, the blog was using the number of news reports of kidnappings to collect numbers, rather than numbers of actual kidnappings. It also did not […]

Comments are closed.

%d bloggers like this: