A Crash Course in Data Journalism

Some very rough notes about data journalism, with lots of links to get you started.

Data journalism

Data is than more just numbers. In its broadest definition, it’s anything you can store in a computer. This could be the traditional tables or spreadsheets, but also documents, class schedules, the text of laws… any type of information you could use a computer to analyze.

So what is data journalism? One definition that people seem to quote, from a post I wrote a couple years ago:

“Data journalism is obtaining, reporting on, curating and publishing data in the public interest.”

Recommended resources:

Data journalism also dovetails closely with the Open Government movement.

Some Examples

History
Data journalism draws on many fields, such as social science, statistics, computer programming, and visualization. And though there has been an explosion of work in the last few years, it’s actually been around for quite a long time. There are two big strands of work within the journalism community that are worth knowing about in particular:

Visualization
Why visualize data at all? It’s so we can use our eye to see patterns that might not be visible in the numbers. A famous example of the power of visualization is Anscombe’s quartet. But not all data journalism involves a visualization. Sometimes a simple table, or even a single number or an annotated document is all you need to communicate your story.

Narrative
The point of data journalism is a story. Why am I looking at this data? What does it mean for me? There has to be a reason. Often the “story” is combination of large-scale information, like the US unemployment rate, and the ability to look up personalized information, as in the New York Time’s Jobless Rate for People Like You.

Also, data isn’t “objective” any more than any other source. What the story “says” depends on the choices made. Consider how simple changes in color palette can lead to different interpretations of the same data.

Data sourcing
Data doesn’t appear magically. It was collected or created by some person or organization for some purpose, using some method. You have to understand where your data comes from if you want to understand the stories it can and cannot tell. A famous example is the collection (or non-collection!) of crime statistics.

Getting Started: Google Fusion Tables
This is a free and very powerful tool that can be used to create all sorts visualizations and maps. It’s used by many newsrooms.

Example data set: U.S. oil production. I downloaded the Excel file on that page, re-saved the data as a .CSV file, and then opened the CSV in Fusion Tables. Heres’s the resulting table, and a simple visualization created using Fusion Tables’ embed feature:

But Fusion Tables can do much more complex things, such as visualize multiple variables, filter and aggregate data, and merge (“fuse”) multiple data sets. And it has a built-in zoomable world map, for visualizing geographic data! Here’s how to make maps with Fusion Tables:

My other data journalism power tool is the humble spreadsheet, for basic calculations and manipulation of tabular data. Excel works fine, or you can use Google Spreadsheets. There are lots of sexy tools out there, but probably 80% of data journalism work can be done with a spreadsheet.

Some interesting data sources 

The Overview Project
This is my project at the AP to build an open-source system to analyze very large sets of documents — like the 90,000 Iraq war reports released by Wikileaks, or 4,500 pages of declassified documents from the U.S. State Department. We can use text visualization techniques to report on these kinds of large document sets, even if we can’t read every page. See this video about how the system works.