Package data like software, and the stories will flow like wine

A humble suggestion from your friends at the California Civic Data Coalition

By Ben Welsh, Agustin Armendariz and Aaron Williams

September 24, 2014

The melodrama is so familiar it’s mundane. The government is asked to release an important dataset. They dither. We moan. We groan. Sometimes we sue, or even (gasp) organize. More and more, our pushback works. They make good and release the raw data, maybe even posting it online.

Next comes the tedious downloading, extracting, transforming, cleaning and exploring necessary before the creative work can begin. By the time we write the story, build the application or design the graphic, we’re mentally spent, eager to move on to the next project.

The pathetic result: All of the valuable work that prepared the data for analysis is discarded or kept locked away. Every newcomer must reinvent virtually identical tools simply to get started.

This must change. It’s a wasteful exercise so behind the times that even broadsheet newspaper reporters, a faction with revanchist delusions on par with the Putin administration, see the problem.

Open-data hackers should heed the words of Dave Guarino and work together to improve the pipeline that prepares data for meaningful analysis, crucial but unglamourous “plumbing” often overlooked in the rush to build the latest flashy user interface.

We’re here to demonstrate a way to make it happen. We call it “pluggable data.”

What we mean

If you have any experience as a developer, you’ve probably bumped into packaged software. Thousands of free and open-source libraries, typically unique to each programming language, are available for installation over the web from centralized servers. Command-line tools like pip (Python), gem (Ruby), CPAN (Perl) and npm (NodeJS) can make it easy to do.

For instance, if you are a Python developer interested in trying out the requests library, installing it on your laptop is as easy as:

$ pip install requests

And now using it is only a import away.

>>> import requests
>>> requests.get("http://www.californiacivicdata.org/").status_code
200

The concept has been expanded by web frameworks like Django to package not just freestanding utilities like requests, but entire applications that can be dropped into the framework’s rigid system and “just work.”

This approach, championed eloquently by Django leaders like James Bennett, is sometimes called “pluggable” or “reusable”, because its modular design makes it portable to a wide range of sites.

A good example is the Pinax project, which provides Django-ready components that furnish common features like comments, badges, phone confirmations and user accounts. Each contains code that builds database tables, configures administration panels and spells out application logic that can interact with users.

Our proposal is to bring the same approach to packaging data. If a series of simple installation commands can provide a Django application with everything necessary to build a social networking site, why can’t it also provide U.S. Census statistics, the massive federal database that tracks our country’s chemical polluters or something as simple as a list of every U.S. county?

How we do

With that idea in mind, a small group of programmers from the Los Angeles Times Data Desk, The Center for Investigative Reporting and Stanford’s new Computational Journalism Program met for two days last month at Mozilla offices in San Francisco.

Under the newly minted banner of the California Civic Data Coalition, we set out to package and refine raw data from CAL-ACCESS, the state of California’s campaign finance and lobbying activity database.

Thanks to a successful organizing effort last year, Secretary of State Debra Bowen committed to posting a nearly complete dump of the data online, updating it on a regular basis.

Weighing in at more than 650 megabytes, it contains 76 database tables and nearly 35 million records.

In the past, slices were only released on demand, for a small fee, via compact disc. Analysts, including one of the authors of this post, have spent months learning how to negotiate the system’s contours, overcome its quirks and grind out a story—only to abandon all of the code that made it happen when they moved on to the next topic.

Now that the dump is freely available and open to all, it offers an opportunity to pool efforts. Even though we represent rival media outlets, we’d rather compete at analyzing the data than downloading and parsing it. And we believe that the nature of open development will encourage us to write better code and documentation.

Today we’re ready to announce the release of django-calaccess-raw-data, our first pluggable Django dataset, hosted on GitHub and distributed via the Python Package Index. With a few simple commands, you can download the data, transform it into clean CSV files and then load it into a MySQL database.

Assuming you have a basic Django project already configured, here’s all it takes. First, install the pluggable application from the Python package repository.

$ pip install django-calaccess-raw-data

Add it to the INSTALLED_APPS list in Django’s standard settings.py file, as you would any other application.

INSTALLED_APPS = (
    # ... other apps up here ...
    'calaccess_raw',
)

Make sure that your MySQL installation can use the brutally effective, and tragically underused, LOAD DATA INFILE command by adding the following to the database configuration also found in settings.py.

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'my_calaccess_db',
        'USER': 'username',
        'PASSWORD': 'password',
        'HOST': 'localhost',
        'PORT': '3306',
        # Here's the thing we're talking about
        'OPTIONS': {
            'local_infile': 1,
        }
    }
}

In the same file, shut off Django’s safety valve so LOAD DATA INFILE can run wild.

DEBUG = False

Sync your database to create the CAL-ACCESS tables.

$ python manage.py syncdb

And, finally, run the custom management command that will download, parse, clean and load all of the data.

$ python manage.py downloadcalaccessrawdata

That’s it. You now have the state’s full database, including a set of administration panels.

You could use it to track the millions of dollars flowing into this November’s governor’s race, investigate what lobbyists are up to this session at the statehouse or impress everyone by designing a sophisticated analysis that stretches back over the nearly 15 years of data in the system to quantify the influence of money in California politics.

Of course, to do any of that, you’ll need to further regroup, filter and refine the data. But at least the initial headaches are out of the way, and any work you build on top of our application could be packaged and distributed in the exact same way.

In that scheme, our raw data application is simply one of your new package’s dependencies, much in the same way that the requests library we installed earlier depends on components of urllib3.

An example is already taking shape in the django-calaccess-campaign-browser repository, where our team is experimenting with a set of further-refined tables and a simple web application for exploring campaign filings aimed at power users, like statehouse reporters and newsroom analysts, who want a more flexible interface than the helpful but fundamentally limited closed-source sites that are now the only way to interact with this database online.

Where you come in

If you are interested in this effort and would like to contribute, here’s how you can help today.

Download and install django-calaccess-raw-data or django-calaccess-campaign-browser. Report bugs.
Fork our code and try to close one of the many tickets we’ve filed. If you’re knowledgable about how CAL-ACCESS works, we need your help!
Try to package and distribute an open data set you’ve worked with. If you write Python, the Django documentation and Scott Torborg have excellent tutorials that can introduce you to the process.

If nothing else, watch James Bennett’s excellent 2008 talk on designing reusable applications and ask yourself, every time you start a new project, how you could package it for future reuse. It’s a simple but powerful approach that has multiplied the reach and reuse of open source, and we hope can do the same for open data.