Configuring Sumatra for Postgres
For the last few months I’ve been using Sumatra to log the provenance data for simulations. It’s a really promising tool, and I’ve been hacking on it from time to time as I proceed with real research. Sumatra even has a web interface driven by Django and uses SQLite as the default back end database.
One of the issues with SQLite is concurrency. This issue manifests
itself with Sumatra when dozens of jobs are launched simultaneously
with each job having a similar life time. In this event most of the
jobs are not recorded and the unrecorded jobs will sign off with the
dreaded django.db.utils.DatabaseError: database is locked
. For
further discussion of this issue see the
Sumatra mailing list thread on this topic.
Quick and Dirty Solution
In the short term, I decided to use a quick and dirty solution to work around this issue using file locks. I couldn’t face learning to configure Postgres as I have little to no experience with databases. The file lock solution allowed me to proceed with my work without having to learn the ins and outs of configuring Django and Postgres.
The solution required creating a decorator class
available at Github
that encapsulated the Python function to be logged. The two Sumatra
commands that need to be protected from concurrency issues are
project.add_record
and project.save
. Here the with
statement is
used to set and release the lock:
with SMTLock(project):
project.add_record(record)
project.save()
where SMTLock
is defined as
class SMTLock:
def __init__(self, project):
self.lock = lockfile.FileLock(
os.path.split(_get_project_file(project.path))[0])
def __enter__(self):
self.lock.acquire()
def __exit__(self, type, value, tb):
self.lock.release()
The above class requires the lockfile module. The file locking mechanism worked well enough and might be a solution for those that wish to maintain a lightweight database solution with Sumatra. However, it does require making changes to the script or program for which the provenance data is being logged. This goes against the grain of the “don’t change the existing workflow” approach of Sumatra.
Postgres
Given that the above solution is unsatisfactory, another alternative is to use a database that properly handles concurrency. To install Postgres on Ubuntu use
$ sudo apt-get install postgresql
$ sudo apt-get install python-psycopg2
and then use
$ sudo passwd postgres
to set the password. Then create a Sumatra user for Postgres using
$ sudo -u postgres createuser -P sumatra_user
To create a database do
$ su postgres
postgres$ psql template1
template1=# CREATE DATABASE sumatra_db OWNER sumatra_user ENCODING 'UTF8';
Exit the Postgres shell prompts and edit
/etc/postgresql/9.1/main/pg_hba.conf
by adding
local sumatra_db sumatra_user trust
and relaunch Postgres
$ sudo /etc/init.d/postgresql restart
If Sumatra gives errors during configuration and the database has the wrong field sizes then you’ll need to repeat the process above to create a new database. You can delete the old database with
template1=# DROP DATABASE sumatra_db
These instructions were snatched from iiilx’s blog. Now that Postgres is working we can move on to setting up Sumatra.
Configuring Sumatra
Edit the Django database configuration in
src/recordstore/django_store/__init__.py
to swap SQLite for
Postgres.
self._settings['DATABASES'][label] = {
- 'ENGINE': 'django.db.backends.sqlite3',
- 'NAME': os.path.abspath(db_file)
+ 'ENGINE':
+ 'django.db.backends.postgresql_psycopg2',
+ 'NAME': 'sumatra_db',
+ 'USER': 'sumatra_user',
+ 'PASSWORD': 'password',
+ 'HOST': 'localhost'
}
Ideally this would be the only change required, however, there is an
issue with the field sizes in Sumatra. Using Sumatra with the above
configuration will result in Postgres errors of the type
DatabaseError: value too long for type character varying(100)
. This
error is caused because the field sizes have never been checked with
anything but SQLite and SQLite has no size limits in the way that
Postgres does (see this
stackoverflow thread
for more details). Anyway, fixing the field size problem simply
requires making a number of changes like this
- type = models.CharField(max_length=20)
+ type = models.CharField(max_length=100)
in src/recordstore/django_store/models.py
. There are about 10 of
these altogether (see the
full changeset
for a complete list). The above instructions are valid as of commit
d65bb4fa1f83.
Testing Sumatra for Concurrency
The following is a test to see that it works. Set up a trivial script.py
,
1
2
3
4
5
6
7
8
9
10
import time
import sys
param_file = sys.argv[1]
f = open(param_file, 'r')
exec f.read()
print 'waiting for ' + str(wait) + '(s)'
time.sleep(wait)
print 'finished'
and a parameter file (default.param
) with
1
wait=3
Set up a Git repository.
$ git init
$ git add script.py default.param
$ git ci -m "First commit."
Set up a Sumatra repository.
$ smt init postgres_test
$ smt configure --executable=python --main=script.py
$ smt configure --addlabel=cmdline
$ smt configure -g uuid
$ smt configure -c store-diff
$ smt run default.param wait=5
Check the repository with smtweb
. There should be one record. Now to
test the concurrency use
$ for i in $(seq 100); do smt run default.param wait=3 &> /dev/null & done
Check the repository. 101 records. No concurrency issues!
What next?
In the near future I hope to submit a patch for this and include some kind of command line configuration for Sumatra to allow easy set up. Something like this
$ smt configure --database=postgres --name=sumatra_db --user=sumatra_user --password=password
comments powered by Disqus