For the last few months I’ve been using Sumatra to log the provenance data for simulations. It’s a really promising tool, and I’ve been hacking on it from time to time as I proceed with real research. Sumatra even has a web interface driven by Django and uses SQLite as the default back end database.
One of the issues with SQLite is concurrency. This issue manifests
itself with Sumatra when dozens of jobs are launched simultaneously
with each job having a similar life time. In this event most of the
jobs are not recorded and the unrecorded jobs will sign off with the
django.db.utils.DatabaseError: database is locked. For
further discussion of this issue see the
Sumatra mailing list thread on this topic.
Quick and Dirty Solution
In the short term, I decided to use a quick and dirty solution to work around this issue using file locks. I couldn’t face learning to configure Postgres as I have little to no experience with databases. The file lock solution allowed me to proceed with my work without having to learn the ins and outs of configuring Django and Postgres.
The solution required creating a decorator class
available at Github
that encapsulated the Python function to be logged. The two Sumatra
commands that need to be protected from concurrency issues are
project.save. Here the
with statement is
used to set and release the lock:
SMTLock is defined as
The above class requires the lockfile module. The file locking mechanism worked well enough and might be a solution for those that wish to maintain a lightweight database solution with Sumatra. However, it does require making changes to the script or program for which the provenance data is being logged. This goes against the grain of the “don’t change the existing workflow” approach of Sumatra.
Given that the above solution is unsatisfactory, another alternative is to use a database that properly handles concurrency. To install Postgres on Ubuntu use
and then use
to set the password. Then create a Sumatra user for Postgres using
To create a database do
Exit the Postgres shell prompts and edit
/etc/postgresql/9.1/main/pg_hba.conf by adding
and relaunch Postgres
If Sumatra gives errors during configuration and the database has the wrong field sizes then you’ll need to repeat the process above to create a new database. You can delete the old database with
These instructions were snatched from iiilx’s blog. Now that Postgres is working we can move on to setting up Sumatra.
Edit the Django database configuration in
src/recordstore/django_store/__init__.py to swap SQLite for
Ideally this would be the only change required, however, there is an
issue with the field sizes in Sumatra. Using Sumatra with the above
configuration will result in Postgres errors of the type
DatabaseError: value too long for type character varying(100). This
error is caused because the field sizes have never been checked with
anything but SQLite and SQLite has no size limits in the way that
Postgres does (see this
for more details). Anyway, fixing the field size problem simply
requires making a number of changes like this
Testing Sumatra for Concurrency
The following is a test to see that it works. Set up a trivial
1 2 3 4 5 6 7 8 9 10 import time import sys param_file = sys.argv f = open(param_file, 'r') exec f.read() print 'waiting for ' + str(wait) + '(s)' time.sleep(wait) print 'finished'
and a parameter file (
Set up a Git repository.
Set up a Sumatra repository.
Check the repository with
smtweb. There should be one record. Now to
test the concurrency use
Check the repository. 101 records. No concurrency issues!
In the near future I hope to submit a patch for this and include some kind of command line configuration for Sumatra to allow easy set up. Something like this
comments powered by Disqus