Introducing a new source for California elections data
How we took CAL-ACCESS to the cleaners
After years of work, the Coalition is excited to release nearly two decades worth of data on California elections. The information, blocked from public release by state officials, is now published daily here on this site in open formats and according to a new open-source standard.
This marks a major milestone in the Coalition’s effort to make it easier for reporters and researchers to explore the role of money in California politics. The new data files catalog every candidate, ballot measure and election found in the jumbled, dirty and difficult government database tracking money in state politics.
You can find the new data on our revamped download page, where it will be joined by a second, expanded series of files in the coming months.
How we got the data
Our original source is CAL-ACCESS, the California state government’s system for tracking the money political campaigns raise and spend on elections.
While containing some useful information, the bulk export of CAL-ACCESS data released by Secretary of State Alex Padilla does not include coherent and complete lists of elections, races, public offices, candidates or ballot measures.
To be clear, this information does reside in CAL-ACCESS. It is collected by the Secretary of State’s office, displayed on its website and outlined its official database schema.
But when we asked Padilla’s office to include it in their bulk data release, they said “no.”
That left us with only one option: Scraping it off the state’s site.
The Coalition’s student developer, Sahil Chinoy, was up to the task. He expanded on earlier contributions from an enterprising group of OpenNews fellows to train a computer script to navigate through the CAL-ACCESS website and parse out the essential data.
Chinoy’s work is now integrated into our open-source data pipeline and also available as a stand-alone application for the Django web framework. Anyone can download it package from PyPI, plug it into their project, read our docs and scrape away.
How we improved the data
Look, CAL-ACCESS is a mess. And you don’t have to take our word for it.
In a recent public filing, Padilla’s office described it as an “old,” “fragile” and “not well documented” system that “cannot be patched or modified” and is at risk of collapse.
Rather than force users to wade through its arcane data structures, we’ve modified our files to meet a new standard we authored with Open Civic Data, a community of leaders in our field aiming to define common schemas for consolidating public data.
OCD’s ranks include Forest Gregg of DataMade, James McKinney of Popolo Project and Rachel Shorey of The New York Times.
With their guidance, the Coalition’s James Gordon — that’s me — drafted a proposal outlining a new data schema for elections and related data types like candidates, contests and ballot measures. We then implemented those specs in Open Civic Data’s Django application for use in any project, including yours.
After many months of back-and-forth — and comments from our peers at Google, Socrata and elsewhere — python-opencivicdata version 2.0 was packaged and released on PyPI.
Our hope is that this work can help power other open-source projects working with similar data sets in other states and countries.
Who is already using this data?
Early versions of these files have been put to work by reporters at the Los Angeles Times.
Maloy Moore and Ryan Menezes have used the experimental release of our software (available to everyone on GitHub) to generate a series of pieces on the millions of dollars flooding the race to be California’s next governor.
Lieutenant Governor Gavin Newsom leads the pack with more campaign contributions than all competitors combined, according to the tally in their graphic, seen above.
Their reporting has uncovered Newsom’s connection to California’s burgeoning cannabis industry, as well as his heavy support from Hollywood.
Contrary to the candidate’s environmentalist image, Times reporters have also documented how Newsom has curried favor from controversial real estate developers in San Francisco.
What we’re doing next
As a companion to our work, Abraham Epton of Socrata has submitted a OCD proposal focused on standardizing campaign finance filings across states.
Our next mission is to implement Abe’s ideas so we can churn out cleaned up files containing the valuable data on campaign committees, contributions and expenditures now locked inside of CAL-ACCESS and its Form 460 filings.
What you can do
Download our files. Play with them. See something you don’t like. Tell us about it.
Whatever addition or change to our new processed data files that would make your life easier – no matter how small – we want to hear it. File a ticket, shoot an email or find us anytime in News Nerdery’s #california-civic-data.