Often the data that you get needs cleaning. Misspellings and data entry errors need to be fixed. Open Refine is a powerful tool that can clean datasets quickly. It also allows you to look at your data in different ways, filtering the categories. These different filters are known as “facets” in Open Refine.
If you need a refresher on Refine, you can walk back through the exercise we led in class (below), or try one of these tutorials/resources:
- Propublica’s primer on Open Refine: http://www.propublica.org/nerds/item/using-google-refine-for-data-cleaning
- Provincial Hosplitals in Zimbabwe from Africa Open Data
Good examples of the kinds of stories you can write with this data, from Gothamist Shady Groups Spend “Unprecedented” Amount Of Cash In Mayoral Election and from The New York Times, Loophole in a Rule on Ad Spending, Group Financed by Business Leaders Has Put Nearly $7 Million Into Council Races. The Times also built a nice interactive guide toHow Much the N.Y.C. Mayoral Candidates Have Raised and Spent — they haven’t updated it since September 2, and it just goes to show that it takes more than money to win an NYC primary race, but you’ve got the data.
Let’s walk through our class exercise using the NYC Mayoral campaign contributions. First, download the data from the NYC Campaign Finance Board.
- Choose 2013 Citywide elections
- Choose Bill deBlasio, candidate for Mayor
- Click Search
- Click Download Excel file button
- Launch Google Refine, which is a separate application. Refine opens in a browser. You can use the URL (127.0.0.1:3333) in another browser, if you wish. Choose Create Project > This Computer if you have your dataset on your hard drive, or you can choose Create Project > Google Data if you uploaded your dataset in a Google Spreadsheet (Refine will ask permission to access your Google Drive account). Click Next.
- Refine loads a preview of your dataset. The default settings for most cases should be fine. Click Create Project on the upper right corner.
- This dataset, from Bill DeBlasio’s campaign contributions, has 4510 entries. Those represent 4510 rows of data. Refine is only showing the first 10 rows, but you can choose to display more or less. On the far right side of the interface, you can display more rows.
- Use Facets to filter your data. At the top of the column for “Occupation”, click on the arrow and choose Facet > Text Facet.
- The Occupation facet appears in a window on the left. This lists all the unique entries in the Occupation column. Note that you can sort the entries by name (which lists the occupations alphabetically), or by count (which lists the occupations from highest to lowest). The most donors for deBlasio identified themselves as “Attorney” (there are 455 of them).
- Click on Cluster. Refine opens a window that shows similar entries that could potentially refer to the same value. The Method and Keying function are different options for the kind of algorithm that Refine uses to discover similarities. Notice that there are three similar entries–“Attorney”, “attorney”, and “ATTORNEY”. All three are the same, so you’ll merge them, and the other similar entries. Review all the similarly clustered entries. You can select each group separately, or choose Select All at the bottom of the window. Choose Merge Selected & Re-Cluster.
- Close the Cluster dialog box. You can try clustering with the other methods to see if you can catch other errors. For now, look at the results of your clustering. There are now more Attorneys listed in the Occupation facet.
8. See how Attorney and Lawyer are two separate categories? Those should be merged as well. You can manually merge a category with another by moving your mouse over the “Lawyer” category, and then clicking the Edit link next to it in the Facet box.
Replace Lawyer with Attorney, and click Apply. All the Lawyer entries will be merged with the Attorney entries.
- Now click on the Include link next to Attorney. Google Refine filters the dataset in the spreadsheet so you’re just seeing all the attorneys.
- Try other facets. You can choose Facet > Numeric facet for number data, or Facet > Timeline facet for date data. You’ll get another Facet window on the left where you can move the minimum and maximum brackets to filter the histogram (the distribution of values). For example, you can filter your spreadsheet to just see contributions over a certain amount.
- If a Facet doesn’t work, it may be that your column isn’t formatted correctly. For example, Refine may not know that the Date column holds date data. From the top of the column, choose Edit Cells > Common Transforms > To Date to change the values in that column to date data.
- To get rid of a Facet on the left, just click on the “X”, which closes the box.
- There are other options manipulate your spreadsheet (for example, split a column based on a delimiter), but you can also do that in Excel or Google Spreadsheets. I suggest you use Google Refine for what it does best–cleaning messy data and filtering data with Facets.
- When you want to export the filtered data as a new spreadsheet, choose Export > Comma Separated Value.