*************** Getting Started *************** This is a brief guide to setting up dbprocessing to support a new project. .. contents:: :depth: 2 :local: Dependencies ============ Currently dbprocessing runs on Linux systems (Mac and Windows are in testing.) Python is required, either 2.7 or 3.2+. Other dependencies are automatically installed if you install ``dbprocessing`` using ``pip``; these include SQLAlchemy and dateutil. If you wish to use a PostgreSQL database, PostgreSQL is required, with appropriate permissions set up (but you can use an sqlite database with no database manager setup.) ``psycopg2`` is also required for PostgreSQL and will not be installed automatically; *one* of the following lines will likely be appropriate, depending on your environment: .. code-block:: sh sudo apt-get install python-psycopg2 sudo apt-get install python3-psycopg2 conda install psycopg2 pip install psycopg2 It is recommended to use the same method (system package, conda, or pip) for psycopg2 as for SQLAlchemy. Manual dependency installation ------------------------------ `SQLAlchemy `_ is required. This is available in most distributions; in Ubuntu, you can usually install it with: .. code-block:: sh sudo apt-get install python-sqlalchemy or .. code-block:: sh sudo apt-get install python3-sqlalchemy It is also usually available via pip: .. code-block:: sh pip install sqlalchemy Finally, `dateutil `_ is required. In Ubuntu this can be installed with: .. code-block:: sh sudo apt-get install python-dateutil or .. code-block:: sh sudo apt-get install python3-dateutil or via pip: .. code-block:: sh pip install python-dateutil Installation ============ dbprocessing itself is a Python package and must be installed. This can usually be done with: .. code-block:: sh pip install dbprocessing which will also install necessary dependencies. But it can also be installed by downloading the distribution and running: .. code-block:: sh python setup.py install --user ``--user`` is recommended to install for a particular user. Scripts needed to run dbprocessing are installed into a default location which is usually on the path. Specify a different location (e.g. a directory devoted just to dbprocessing scripts) with ``--install-scripts=DIRECTORY``. Directory layout ================ There are several directories that should be reserved, usually one as a temporary location for incoming data files, one for data files once they have been brought into the database, and one for processing codes. .. seealso:: :ref:`concepts_missions` Processing Codes ================ A processing code or script is specific to your project and takes less processed data into a more processed form. dbprocessing calls these codes, but they do not need to be aware of dbprocessing or interact with it. This is one of the interfaces between the generic dbprocessing and your specific project. .. seealso:: :ref:`concepts_codes` Inspectors ========== An inspector is a small piece of Python code which can identify certain metadata about your data files and provide it to dbprocessing. This is the second interface between dbprocessing and your project. Examples are forthcoming. .. seealso:: :ref:`concepts_inspectors` Configuration file ================== The dbprocessing configuration file is a human-readable description of your project's data files, processing codes, and the interactions between them. This human-readable description is parsed into the database structure. In principle these relationships can be defined directly in the database; in practice it is much easier to describe with this file. This is the third and final interface between dbprocessing and your project. .. seealso:: :ref:`configurationfiles_addFromConfig` Database creation ================= If using PostgreSQL, the database itself must first be created without any tables. This step is skipped for an sqlite database. Then the tables and relations are created with :ref:`scripts_CreateDB_py`. This creates all dbprocessing structures, with no information specific to a project. Finally, :ref:`scripts_addFromConfig_py` adds project-specific information from the configuration file. Initial ingest ============== The first set of files to bring into dbprocessing should be placed in the incoming directory, and :option:`ProcessQueue.py -i` used to ingest them into the database. .. seealso:: :ref:`concepts_ingest` Processing ========== Run :option:`ProcessQueue.py -p` to produce all possible output files from the initial set of inputs. .. seealso:: :ref:`concepts_processing` Automation ========== Although dbprocessing can be run "by hand" as above, normally it is recommended to perform the following sequence on an automated basis (e.g. in cron or from a daemon that calls them regularly. 1. Place new files in the incoming directory (or link them). 2. Call :option:`ProcessQueue.py -i`. 3. Call :option:`ProcessQueue.py -p`. Examples are pending. A few considerations relating to automation: 1. :ref:`ProcessQueue.py ` should not be run with partially-copied files in the incoming directory; it doesn't check if they are being written to. There are two ways to address this need: a. Ensure that the code which populates incoming never runs at the same time as ``ProcessQueue.py``. b. Copy files to incoming with a name starting with ``.``, so they will be ignored on ingest. Then perform a rename once the copy is done. This rename is atomic. 2. Two instances of ``ProcessQueue.py`` cannot run on the same database at the same time. This means ingest must complete before processing, but it also means if, for instance, a processing run takes 90 minutes to complete, the process should not be run hourly. This suggests using a script that waits a predefined time between the end and the start of processing, rather than always starting processing at a fixed interval. A lock on the database ensures no data corruption if two instances are run at once; ``ProcessQueue.py`` will simply return with an error. Handling this error gracefully and trying later is also a reasonable approach.