A Crash Course in Data Journalism

Some very rough notes about data journalism, with lots of links to get you started.

Data journalism

Data is than more just numbers. In its broadest definition, it’s anything you can store in a computer. This could be the traditional tables or spreadsheets, but also documents, class schedules, the text of laws… any type of information you could use a computer to analyze.

So what is data journalism? One definition that people seem to quote, from a post I wrote a couple years ago:

“Data journalism is obtaining, reporting on, curating and publishing data in the public interest.”

Recommended resources:

Data journalism also dovetails closely with the Open Government movement.

Some Examples

History
Data journalism draws on many fields, such as social science, statistics, computer programming, and visualization. And though there has been an explosion of work in the last few years, it’s actually been around for quite a long time. There are two big strands of work within the journalism community that are worth knowing about in particular:

Visualization
Why visualize data at all? It’s so we can use our eye to see patterns that might not be visible in the numbers. A famous example of the power of visualization is Anscombe’s quartet. But not all data journalism involves a visualization. Sometimes a simple table, or even a single number or an annotated document is all you need to communicate your story.

Narrative
The point of data journalism is a story. Why am I looking at this data? What does it mean for me? There has to be a reason. Often the “story” is combination of large-scale information, like the US unemployment rate, and the ability to look up personalized information, as in the New York Time’s Jobless Rate for People Like You.

Also, data isn’t “objective” any more than any other source. What the story “says” depends on the choices made. Consider how simple changes in color palette can lead to different interpretations of the same data.

Data sourcing
Data doesn’t appear magically. It was collected or created by some person or organization for some purpose, using some method. You have to understand where your data comes from if you want to understand the stories it can and cannot tell. A famous example is the collection (or non-collection!) of crime statistics.

Getting Started: Google Fusion Tables
This is a free and very powerful tool that can be used to create all sorts visualizations and maps. It’s used by many newsrooms.

Example data set: U.S. oil production. I downloaded the Excel file on that page, re-saved the data as a .CSV file, and then opened the CSV in Fusion Tables. Heres’s the resulting table, and a simple visualization created using Fusion Tables’ embed feature:

But Fusion Tables can do much more complex things, such as visualize multiple variables, filter and aggregate data, and merge (“fuse”) multiple data sets. And it has a built-in zoomable world map, for visualizing geographic data! Here’s how to make maps with Fusion Tables:

My other data journalism power tool is the humble spreadsheet, for basic calculations and manipulation of tabular data. Excel works fine, or you can use Google Spreadsheets. There are lots of sexy tools out there, but probably 80% of data journalism work can be done with a spreadsheet.

Some interesting data sources 

The Overview Project
This is my project at the AP to build an open-source system to analyze very large sets of documents — like the 90,000 Iraq war reports released by Wikileaks, or 4,500 pages of declassified documents from the U.S. State Department. We can use text visualization techniques to report on these kinds of large document sets, even if we can’t read every page. See this video about how the system works.

Does State Income Affect NFL Revenue?

Looking at various sets of data on The Guardian Data Blog, we found a table portraying information about NFL and MLB revenue. We looked at a table that showed average annual player pay and stared to discuss which teams make the most money, and which factors play a role in the success of the team. We had several ideas and eventually ended up posing a question of whether or not the average state income affected the NFL revenue and tickets sales to various teams, and if state income does affect ticket sales and revenue, how much is that difference?

Ultimately we found that varying levels of income per state do not affect ticket sales or revenue of teams for the most part. Our “Y” axis portrays the money that various teams make per year (in millions) and the “X” axis shows average income rates per state (in thousands) of the states that have football teams.

Since this conclusion isn’t extremely shocking, we started to pose questions about why the NFL shares revenue, and the MLB doesn’t. The NFL has shared revenue (meaning that they spread out the money they make across the league) while the MLB has unshared revenue and therefore has the capability of outspending other teams (to get better players and pay them more etc.) This also sparked a discussion about how certain MLB teams (like the Yankees) can continue to be a great team with the best players because they have the capability of basically buying out quality. On the other end of the spectrum, bad teams will continue to be bad as the cycle of money making (or lack thereof).
The more interesting thing to look at would have been what teams have won national championships, how many times they’ve won, and how much money they make per year. Ultimately it would show that the same teams keep succeeding or failing because of monetary reasons – again not incredibly surprising but interesting to say the least.

Bicycle Commuting Vs. Reported Bicycle Thefts in Oregon

This is interesting.  Look. Between 2005 and 2009, the number of Oregon residents biking to work increased nearly 60 percent as reported by the US Census Bureau. During this same period, bike thefts in Oregon, as reported by the state government, went down.  They went down a lot actually, by nearly 40 percent.

So you have more people riding their bikes to work more regularly in oregon, but less bike theft.  Now this is just over time.  If we look at a scatter plot of commuters compared to reported bike thefts then we see that the more commuters there are the less bike thefts are reported.

What?

We’re not sure why this trend exists between the data, but it does beg an explanation.

One of the possible flaws in this analysis is that we’re using data from two different sources, the State of Oregon and U.S. Census Bureau.

Another weakness is that both of these sources are conservative in their reporting. Moreover, the Census data on bicycle commuting under reports the number of commuters because the Bureau classifies bicycle commuters as people who ride their bike three or more days per week. That means if someone rides their bicycle to work twice a week, they are not considered a bicycle commuter. Regarding bicycle thefts, many stolen bicycles are not reported to the police.

To recap, what we have are hard data as reported by government sources on the number of working people commenting on bike per year and the number of reported bike thefts per year. What’s unexpected is that these data seem to have an inverse relationship. We would expect bike thefts to increase with bike use, but apparently not in Oregon.

 

References:

oil

This is a bit of Google Fusion and iframe practice. CSV is our friend. WordPress takes a little coaxing.

 

I love oil!