Configuring Sumatra for Postgres

April 8, 2013

For the last few months I’ve been using Sumatra to log the provenance data for simulations. It’s a really promising tool, and I’ve been hacking on it from time to time as I proceed with real research. Sumatra even has a web interface driven by Django and uses SQLite as the default back end database.

One of the issues with SQLite is concurrency. This issue manifests itself with Sumatra when dozens of jobs are launched simultaneously with each job having a similar life time. In this event most of the jobs are not recorded and the unrecorded jobs will sign off with the dreaded django.db.utils.DatabaseError: database is locked. For further discussion of this issue see the Sumatra mailing list thread on this topic.

Quick and Dirty Solution

In the short term, I decided to use a quick and dirty solution to work around this issue using file locks. I couldn’t face learning to configure Postgres as I have little to no experience with databases. The file lock solution allowed me to proceed with my work without having to learn the ins and outs of configuring Django and Postgres.

The solution required creating a decorator class available at Github that encapsulated the Python function to be logged. The two Sumatra commands that need to be protected from concurrency issues are project.add_record and project.save. Here the with statement is used to set and release the lock:

with SMTLock(project):
    project.add_record(record)
    project.save()

where SMTLock is defined as

class SMTLock:
    def __init__(self, project):
        self.lock = lockfile.FileLock(
            os.path.split(_get_project_file(project.path))[0])

    def __enter__(self):
        self.lock.acquire()

    def __exit__(self, type, value, tb):
        self.lock.release()

The above class requires the lockfile module. The file locking mechanism worked well enough and might be a solution for those that wish to maintain a lightweight database solution with Sumatra. However, it does require making changes to the script or program for which the provenance data is being logged. This goes against the grain of the “don’t change the existing workflow” approach of Sumatra.

Postgres

Given that the above solution is unsatisfactory, another alternative is to use a database that properly handles concurrency. To install Postgres on Ubuntu use

$ sudo apt-get install postgresql
$ sudo apt-get install python-psycopg2

and then use

$ sudo passwd postgres

to set the password. Then create a Sumatra user for Postgres using

$ sudo -u postgres createuser -P sumatra_user

To create a database do

$ su postgres
postgres$ psql template1

template1=# CREATE DATABASE sumatra_db OWNER sumatra_user ENCODING 'UTF8';

Exit the Postgres shell prompts and edit /etc/postgresql/9.1/main/pg_hba.conf by adding

local      sumatra_db   sumatra_user   trust

and relaunch Postgres

$ sudo /etc/init.d/postgresql restart

If Sumatra gives errors during configuration and the database has the wrong field sizes then you’ll need to repeat the process above to create a new database. You can delete the old database with

template1=# DROP DATABASE sumatra_db

These instructions were snatched from iiilx’s blog. Now that Postgres is working we can move on to setting up Sumatra.

Configuring Sumatra

Edit the Django database configuration in src/recordstore/django_store/__init__.py to swap SQLite for Postgres.

             self._settings['DATABASES'][label] = {
-                'ENGINE': 'django.db.backends.sqlite3',
-                'NAME': os.path.abspath(db_file)
+                'ENGINE':
+                'django.db.backends.postgresql_psycopg2',
+                'NAME': 'sumatra_db',
+                'USER': 'sumatra_user',
+                'PASSWORD': 'password',
+                'HOST': 'localhost'
             }

Ideally this would be the only change required, however, there is an issue with the field sizes in Sumatra. Using Sumatra with the above configuration will result in Postgres errors of the type DatabaseError: value too long for type character varying(100). This error is caused because the field sizes have never been checked with anything but SQLite and SQLite has no size limits in the way that Postgres does (see this stackoverflow thread for more details). Anyway, fixing the field size problem simply requires making a number of changes like this

-    type = models.CharField(max_length=20)
+    type = models.CharField(max_length=100)

in src/recordstore/django_store/models.py. There are about 10 of these altogether (see the full changeset for a complete list). The above instructions are valid as of commit d65bb4fa1f83.

Testing Sumatra for Concurrency

The following is a test to see that it works. Set up a trivial script.py,

1
2
3
4
5
6
7
8
9
10 import time
import sys

param_file = sys.argv[1]
f = open(param_file, 'r')
exec f.read()

print 'waiting for ' + str(wait) + '(s)'
time.sleep(wait)
print 'finished'

and a parameter file (default.param) with

1 wait=3

Set up a Git repository.

$ git init
$ git add script.py default.param
$ git ci -m "First commit."

Set up a Sumatra repository.

$ smt init postgres_test
$ smt configure --executable=python --main=script.py
$ smt configure --addlabel=cmdline
$ smt configure -g uuid
$ smt configure -c store-diff
$ smt run default.param wait=5

Check the repository with smtweb. There should be one record. Now to test the concurrency use

$ for i in $(seq 100); do smt run default.param wait=3 &> /dev/null & done

Check the repository. 101 records. No concurrency issues!

What next?

In the near future I hope to submit a patch for this and include some kind of command line configuration for Sumatra to allow easy set up. Something like this

$ smt configure --database=postgres --name=sumatra_db --user=sumatra_user --password=password