Say hello to our new open-source software for loading bulk data into PostgreSQL
Take advantage of powerful COPY command to load big CSVs into Django faster than ever
This morning we’re releasing a new open-source software library that allows users of the Django web framework to more quickly load comma-delimited data into PostgreSQL databases.
django-postgres-copy, a new wrapper on the database’s powerful COPY command, is now published on GitHub and packaged for installation in the Python Package Index. Technical documentation has been published via ReadTheDocs.
Why and what for?
Our work as data journalists often calls on us to download, clean and analyze new data sets. That means we write a load of loaders.
And since we code with Django, we write a lot of Python scripts that loop through each row in a spreadsheet and drop the data into a database.
Scripts that look something like this:
But if you have a big CSV, Django will rack up database queries and it can take a long time to finish.
Lucky for us, PostgreSQL has a built-in tool called COPY that can hammer data into the database with one quick query. The only problem: Django doesn’t support it.
This package tries to make using COPY as easy any other database routine accessible with Django. It is largely based on the design of the LayerMapping utility for importing geospatial data.
Once you have our code hooked up, your loader can look more like this.
That may not seem like much difference, but the performance gains can be huge.
How huge?
As an experiment, we tested two common CSV loading patterns against our new tool using the political campaign expenditure file published as part of the California Secretary of State’s CAL-ACCESS database, which is the main focus of our coalition.
With 4.5 million records spread across dozens of columns, the file fills 1.3 gigabytes of hard drive space. That’s not “big data” by any formal definition, but it’s big enough to bring the laptop of your average hack journalist grinding to a halt.
(This size falls into a zone we think of as “medium data:” Too big and complex to be managable in Excel, but not so big that it requires heavy duty tools like Hadoop. In our opinion, this area is in critical need of better software to help analysts get a foothold.)
In the first test, we timed a simple loop like the one above that stepped through each row in the data file and loaded records into the database one by one. By design, Django will ferry each row into the database individually. This adds up quickly.
On our test machine, a three-year-old Lenovo ThinkPad x220, it took 1 hour and 23 minutes to complete.
In the second test, we reorganized our script to lump the rows into groups of 1,000 records. Along the way we used Django’s bulk_create command to drop those groups into the database as large batches.
By reducing the number of database queries, this second test finished in 18 minutes.
In our final test, we loaded the file using our wrapper on the COPY command. It finished in just 4 minutes and 45 seconds. Now imagine how much faster it would be if it wasn’t running on a three-year-old ThinkPad.
What now?
If you’d like to try our tool out for yourself, you can install it with Python’s pip
like so:
To learn more about how it works, visit the technical documentation. There you’ll find more a complete explanation and information about some fancier tricks not covered here, like the capability to transform and clean data on-the-fly as it’s loaded into the database.
We’ve already integrated the tool into django-calaccess-raw-data, a Django app to download, extract and load campaign finance and lobbying activity data from the CAL-ACCESS database.
This project is experimental and in the early stages of development. The aim is to make it easier for others to load “medium data.” But we know our code isn’t perfect and we want for feedback from hackers like you to make it better. So break it. Please. Then let us know.