Getting Started

This is a brief guide to setting up dbprocessing to support a new project.

Dependencies

Currently dbprocessing runs on Linux systems (Mac and Windows are in testing.)

Python is required, either 2.7 or 3.2+.

Other dependencies are automatically installed if you install dbprocessing using pip; these include SQLAlchemy and dateutil.

If you wish to use a PostgreSQL database, PostgreSQL is required, with appropriate permissions set up (but you can use an sqlite database with no database manager setup.) psycopg2 is also required for PostgreSQL and will not be installed automatically; one of the following lines will likely be appropriate, depending on your environment:

sudo apt-get install python-psycopg2
sudo apt-get install python3-psycopg2
conda install psycopg2
pip install psycopg2

It is recommended to use the same method (system package, conda, or pip) for psycopg2 as for SQLAlchemy.

Manual dependency installation

SQLAlchemy is required. This is available in most distributions; in Ubuntu, you can usually install it with:

sudo apt-get install python-sqlalchemy

or

sudo apt-get install python3-sqlalchemy

It is also usually available via pip:

pip install sqlalchemy

Finally, dateutil is required. In Ubuntu this can be installed with:

sudo apt-get install python-dateutil

or

sudo apt-get install python3-dateutil

or via pip:

pip install python-dateutil

Installation

dbprocessing itself is a Python package and must be installed.

This can usually be done with:

pip install dbprocessing

which will also install necessary dependencies.

But it can also be installed by downloading the distribution and running:

python setup.py install --user

--user is recommended to install for a particular user.

Scripts needed to run dbprocessing are installed into a default location which is usually on the path. Specify a different location (e.g. a directory devoted just to dbprocessing scripts) with --install-scripts=DIRECTORY.

Directory layout

There are several directories that should be reserved, usually one as a temporary location for incoming data files, one for data files once they have been brought into the database, and one for processing codes.

See also

Missions

Processing Codes

A processing code or script is specific to your project and takes less processed data into a more processed form. dbprocessing calls these codes, but they do not need to be aware of dbprocessing or interact with it. This is one of the interfaces between the generic dbprocessing and your specific project.

See also

Codes

Inspectors

An inspector is a small piece of Python code which can identify certain metadata about your data files and provide it to dbprocessing. This is the second interface between dbprocessing and your project.

Examples are forthcoming.

See also

Inspectors

Configuration file

The dbprocessing configuration file is a human-readable description of your project’s data files, processing codes, and the interactions between them. This human-readable description is parsed into the database structure. In principle these relationships can be defined directly in the database; in practice it is much easier to describe with this file.

This is the third and final interface between dbprocessing and your project.

See also

addFromConfig.py

Database creation

If using PostgreSQL, the database itself must first be created without any tables. This step is skipped for an sqlite database.

Then the tables and relations are created with CreateDB.py. This creates all dbprocessing structures, with no information specific to a project.

Finally, addFromConfig.py adds project-specific information from the configuration file.

Initial ingest

The first set of files to bring into dbprocessing should be placed in the incoming directory, and ProcessQueue.py -i used to ingest them into the database.

See also

Ingestion

Processing

Run ProcessQueue.py -p to produce all possible output files from the initial set of inputs.

See also

Processing

Automation

Although dbprocessing can be run “by hand” as above, normally it is recommended to perform the following sequence on an automated basis (e.g. in cron or from a daemon that calls them regularly.

  1. Place new files in the incoming directory (or link them).

  2. Call ProcessQueue.py -i.

  3. Call ProcessQueue.py -p.

Examples are pending.

A few considerations relating to automation:

  1. ProcessQueue.py should not be run with partially-copied files in the incoming directory; it doesn’t check if they are being written to. There are two ways to address this need:

    1. Ensure that the code which populates incoming never runs at the same time as ProcessQueue.py.

    2. Copy files to incoming with a name starting with ., so they will be ignored on ingest. Then perform a rename once the copy is done. This rename is atomic.

  2. Two instances of ProcessQueue.py cannot run on the same database at the same time. This means ingest must complete before processing, but it also means if, for instance, a processing run takes 90 minutes to complete, the process should not be run hourly. This suggests using a script that waits a predefined time between the end and the start of processing, rather than always starting processing at a fixed interval. A lock on the database ensures no data corruption if two instances are run at once; ProcessQueue.py will simply return with an error. Handling this error gracefully and trying later is also a reasonable approach.


Release: 0.1.0 Doc generation date: Feb 10, 2022