# Configuring Sumatra for Postgres

For the last few months I’ve been using Sumatra to log the provenance data for simulations. It’s a really promising tool, and I’ve been hacking on it from time to time as I proceed with real research. Sumatra even has a web interface driven by Django and uses SQLite as the default back end database.

One of the issues with SQLite is concurrency. This issue manifests itself with Sumatra when dozens of jobs are launched simultaneously with each job having a similar life time. In this event most of the jobs are not recorded and the unrecorded jobs will sign off with the dreaded django.db.utils.DatabaseError: database is locked. For further discussion of this issue see the Sumatra mailing list thread on this topic.

### Quick and Dirty Solution

In the short term, I decided to use a quick and dirty solution to work around this issue using file locks. I couldn’t face learning to configure Postgres as I have little to no experience with databases. The file lock solution allowed me to proceed with my work without having to learn the ins and outs of configuring Django and Postgres.

The solution required creating a decorator class available at Github that encapsulated the Python function to be logged. The two Sumatra commands that need to be protected from concurrency issues are project.add_record and project.save. Here the with statement is used to set and release the lock:

where SMTLock is defined as

The above class requires the lockfile module. The file locking mechanism worked well enough and might be a solution for those that wish to maintain a lightweight database solution with Sumatra. However, it does require making changes to the script or program for which the provenance data is being logged. This goes against the grain of the “don’t change the existing workflow” approach of Sumatra.

### Postgres

Given that the above solution is unsatisfactory, another alternative is to use a database that properly handles concurrency. To install Postgres on Ubuntu use

and then use

to set the password. Then create a Sumatra user for Postgres using

To create a database do

Exit the Postgres shell prompts and edit /etc/postgresql/9.1/main/pg_hba.conf by adding

and relaunch Postgres

If Sumatra gives errors during configuration and the database has the wrong field sizes then you’ll need to repeat the process above to create a new database. You can delete the old database with

These instructions were snatched from iiilx’s blog. Now that Postgres is working we can move on to setting up Sumatra.

### Configuring Sumatra

Edit the Django database configuration in src/recordstore/django_store/__init__.py to swap SQLite for Postgres.

Ideally this would be the only change required, however, there is an issue with the field sizes in Sumatra. Using Sumatra with the above configuration will result in Postgres errors of the type DatabaseError: value too long for type character varying(100). This error is caused because the field sizes have never been checked with anything but SQLite and SQLite has no size limits in the way that Postgres does (see this stackoverflow thread for more details). Anyway, fixing the field size problem simply requires making a number of changes like this

in src/recordstore/django_store/models.py. There are about 10 of these altogether (see the full changeset for a complete list). The above instructions are valid as of commit d65bb4fa1f83.

### Testing Sumatra for Concurrency

The following is a test to see that it works. Set up a trivial script.py,

and a parameter file (default.param) with

Set up a Git repository.

Set up a Sumatra repository.

Check the repository with smtweb. There should be one record. Now to test the concurrency use

Check the repository. 101 records. No concurrency issues!

### What next?

In the near future I hope to submit a patch for this and include some kind of command line configuration for Sumatra to allow easy set up. Something like this