How to analyze California campaign cash in the cloud with Apache Spark and Databricks

Use the Coalition’s API to run SQL queries across millions of records from the comfort of your browser

October 24, 2016

Editor’s note: This post was written (and submitted without pay via pull request) by a volunteer to our open-source project who is also a Databricks employee.

As a past contributor to the California Civic Data Coalition I’ve been really pleased with how much the project has progressed. It’s amazing to see the state’s campaign finance and lobbying activity data in a simple, clean format that users can immediately work with.

However, with some tables stacking up millions of rows, the data are often too big for standard tools like Microsoft Excel spreadsheets.

The result is that most beginners will need to install a series of complicated computer programming tools before they can begin any analysis.

In my experience teaching at UC Berkeley and on Udemy, I’ve observed students often get caught up installing that kind software on their machines, which hinders their introduction to our field.

The Databricks fix

Now I’m working for Databricks, which is aiming to solve this problem by making starting and sharing data science easier.

We doing it by harnessing the power of Apache Spark, a free and open-source tool for high-speed data processing created by our founders that, for the most the part, is only used by super nerds.

Over the last 8 months, the Databricks team has been hard at work integrating Spark into a free community edition where, thanks to the power of cloud computing, even beginners can quickly import, transform and analyze gigantic data files using only their web browser.

Take a test drive today

When the California Civic Data Coalition announced last month it was launching an API for accessing the latest data from the state’s campaign finance database, I realized that our new Databricks notebook could make it easy for users to get started querying the data without having to install any software.

The Databricks notebook is an interactive workspace — similar to the Jupyter Notebook — where you can use your browser to code and collaborate in an easy-to-use environment that leverages our powerful Spark backend. You can see numerous examples on our site, or watch the following introductory video.

What’s amazing about Spark is that it can turn any structured files into an immediately queryable database. And the CAL-ACCESS database that is cleaned and served by the California Civic Data Coalition is no different.

To see it in action, all you have to do is make an account and clone my Databricks notebook that downloads data from the Coalition’s API, racks it up with Spark and starts in on analysis. Instructions on how to make your copy are here.

Then push the “run” button and the notebook will automatically download and import the state’s sprawling database. In mere minutes it will be available for exploration using SQL and other tools.

I’ve already used it to take a rough cut at the biggest lobbying payments in the data. What can you find?