Some very rough notes about data journalism, with lots of links to get you started.
Data journalism
Data is than more just numbers. In its broadest definition, it’s anything you can store in a computer. This could be the traditional tables or spreadsheets, but also documents, class schedules, the text of laws… any type of information you could use a computer to analyze.
So what is data journalism? One definition that people seem to quote, from a post I wrote a couple years ago:
“Data journalism is obtaining, reporting on, curating and publishing data in the public interest.”
Recommended resources:
- The Data Journalism Handbook - the first comprehensive textbook. Free, online, and excellent!
- Data journalism at the Guardian: what is it and how do we do it? - great intro article by an outstanding data journalist
- Journalism in the Age of Data – an hour-long documentary
- In the age of big data, data journalism has profound importance for society - excellent discussion, including history of the field
Data journalism also dovetails closely with the Open Government movement.
Some Examples
- Spending review - visualization of UK budget cuts/increases from the BBC
- Four ways to Slice Obama’s 2013 Budget – US budget visualization by the New York Times
- Wikileaks War Logs, Every death mapped – Guardian
- US Public Schools: the opportunity gap - a comparison of all US schools, by ProPublica
- Gay Marriage laws by state – history of legislation, visualized, by AP Interactive
- ManyBills - Legislative text analysis and visualization by IBM
History
Data journalism draws on many fields, such as social science, statistics, computer programming, and visualization. And though there has been an explosion of work in the last few years, it’s actually been around for quite a long time. There are two big strands of work within the journalism community that are worth knowing about in particular:
- Precision journalism - a combination of journalism and quantitative social science, from the 1960s and 70s
- Computer-assisted Reporting - the community of data-savvy investigative journalists that formed in 1980s and 90s.
Visualization
Why visualize data at all? It’s so we can use our eye to see patterns that might not be visible in the numbers. A famous example of the power of visualization is Anscombe’s quartet. But not all data journalism involves a visualization. Sometimes a simple table, or even a single number or an annotated document is all you need to communicate your story.
Narrative
The point of data journalism is a story. Why am I looking at this data? What does it mean for me? There has to be a reason. Often the “story” is combination of large-scale information, like the US unemployment rate, and the ability to look up personalized information, as in the New York Time’s Jobless Rate for People Like You.
Also, data isn’t “objective” any more than any other source. What the story “says” depends on the choices made. Consider how simple changes in color palette can lead to different interpretations of the same data.
Data sourcing
Data doesn’t appear magically. It was collected or created by some person or organization for some purpose, using some method. You have to understand where your data comes from if you want to understand the stories it can and cannot tell. A famous example is the collection (or non-collection!) of crime statistics.
Getting Started: Google Fusion Tables
This is a free and very powerful tool that can be used to create all sorts visualizations and maps. It’s used by many newsrooms.
Example data set: U.S. oil production. I downloaded the Excel file on that page, re-saved the data as a .CSV file, and then opened the CSV in Fusion Tables. Heres’s the resulting table, and a simple visualization created using Fusion Tables’ embed feature:
But Fusion Tables can do much more complex things, such as visualize multiple variables, filter and aggregate data, and merge (“fuse”) multiple data sets. And it has a built-in zoomable world map, for visualizing geographic data! Here’s how to make maps with Fusion Tables:
- Video tutorial
- detailed walkthrough, with sample files
Some interesting data sources
The Overview Project
This is my project at the AP to build an open-source system to analyze very large sets of documents — like the 90,000 Iraq war reports released by Wikileaks, or 4,500 pages of declassified documents from the U.S. State Department. We can use text visualization techniques to report on these kinds of large document sets, even if we can’t read every page. See this video about how the system works.