Scripts¶
Specifying a database¶
Most scripts require a mission database to be specified. If this is an existing file, this is usually interpreted as an sqlite database. Otherwise it is assumed to be the name of a Postgresql database, and the following environment variables are used. Unless otherwise specified, these are optional and the Postgresql default is used if not specified (i.e., there is no special dbprocessing-based handling.)
Note
This can result in unusual behavior when a filename that doesn’t exist is specified as a mission database, as the “fall through” assumes Postgresql and might raise unexpected errors if these environment variables are not defined.
- PGUSER¶
Username to use to connect to the database. This is required when using Postgresql databases.
- PGHOST¶
Hostname of the database. If not specified, will use
''
, which is usually treated as a domain socket connection onlocalhost
.
- PGPORT¶
Port to connect to.
- PGPASSWORD¶
User’s database password. If not specified, no password provided.
Postgresql support is not as heavily tested and argument handling is not yet normalized across all scripts.
Maintained scripts¶
These scripts are of general use in dbprocessing and either are fully
tested and verified to work, or are moving to that status. They are
maintained as part of dbprocessing. They are in the scripts
directory.
All scripts will take an option -h
to provide brief usage help.
The most commonly used scripts are:
Create dbprocessing tables in a database |
|
Add project-specific relationships to db |
|
Ingest input files; process to new files |
|
Reset the lock if processing crashes |
addFromConfig.py¶
Adds data to a database from a config file. This is the second step in setting up a new processing chain.
See the configuration file documentation for a full description of the config file format and capability.
This can be run multiple times against a database to populate information from several config files; this is a means of, for instance, having multiple satellites or instruments in a single database. Existing entries in the database are left as-is; entries which do not exist are added.
- config_file¶
The name of the config file to ingest
- -m <dbname>, --mission <dbname>¶
The database to apply the config file to
- -v, --verify¶
Verify the config file then stop (do not apply to database)
Example usage:
addFromConfig.py –m mychain.sqlite setup.config
addProductProcessLink.py¶
Add a new entry to the product-process link table, so that an existing product is added as a new input to an existing process.
- -n <dbname>, --name <dbname>¶
The mission database to update.
- -c <process>, --proc <process>¶
Name or ID of the process to update.
changeProductDir.py¶
Change the directory storing a product, and move all files of that product to the new directory.
- -m <dbname>, --mission <dbname>¶
The mission database to update.
- product¶
Name or ID of the product to change.
- newdir¶
New directory to move the file to.
clearProcessingFlag.py¶
Clear a processing flag (lock) on a database that has crashed.
The DButils.startLogging()
method locks the database
to avoid conflicts from simultaneous processing. This is only
currently used by ProcessQueue.py; if it crashes before
completion, the lock will still be set and needs to be cleared before
running scripts_ProcessQueue_py again.
- database¶
Filename of the database to unlock
- message¶
Log message to insert into the database, noting reason for the unlock.
Example usage:
clearProcessingFlag.py mychain.sqlite "crash fix"
compareDB.py¶
Compares two databases for having the same products, processes, codes, and files; matching is done by name not ID, as ID may differ. The input files for each file, and the codes used to make each file, are also compared by filename. Output is printed to the screen.
- -m <dbname>, --mission <dbname>¶
Mission database. Specify twice, for the two missions to compare.
configFromDB.py¶
Build a config file from an existing database.
Warning
This is untested and not fully useful yet.
- filename¶
The filename to save the config
- -m <dbname>, --mission <dbname>¶
The database to connect to
- -f, --force¶
Force the creation of the config file, allows overwrite
- -s <satellite>, --satellite <satellite>¶
The name of the satellite for the config file
- -i <instrument>, --instrument <instrument>¶
The name of the instrument for the config file
- -c, --nocomments¶
Make the config file without a comment header block on top
coveragePlot.py¶
Creates a coverage plot based on config file input. This script is useful for determining which files may be missing from a processing chain. Either this or htmlCoverage.py works (probably this).
- configfile¶
The config file to read. See the configuration file documentation.
Warning
Has some bugs, possibly not catching most recent files reliably.
CreateDB.py¶
Create an empty database with all dbprocessing tables.
This is the first step in the setup of a new processing chain.
- -d <dialect>, --dialect <dialect>¶
sqlalchemy dialect to use,
sqlite
(default) orpostgresql
. Ifpostgresql
, database must exist, this script will set up the tables.
- dbname¶
The name of the database to create (filename if using sqlite).
Example usage:
CreateDB.py mychain.sqlite
dbOnlyFiles.py¶
Show file ID of files which are recorded in the database as being on disk, but where the file is not present on disk. Optionally mark these missing files in the database as not being on disk.
- -s <date>, --startDate <date>¶
First date to check (e.g. 2012-10-02)
- -e <date>, --endDate <date>¶
Last date to check, inclusive (e.g. 2012-10-25)
- -f, --fix¶
Update database
exists_on_disk
toFalse
for files which are not present.
- -m <dbname>, --mission <dbname>¶
Selected mission database
- --echo¶
echo sql queries for debugging
- -n, --newest¶
Only check the newest files
- --startID <file_id>¶
The File id to start on
- -v, --verbose¶
Print out each file as it is checked
DBRunner.py¶
Directly execute codes in the database. Although primarily used in testing, this can also be used to reprocess files as needed, or to execute codes with no input products.
As is typical, processes for which there are no input files for a date will
not be run. However, if a process has no input products, dates specified
will be run, depending on the values of --force
and
--update
. This is unlike ProcessQueue.py, which
has no way of triggering such processing.
- process_id¶
Process ID or process name of process to run.
- -d, --dryrun¶
Only print what would be done (not currently working).
- -m <dbname>, --mission <dbname>¶
Selected mission database
- --echo¶
Start sqlalchemy with echo in place for debugging
- -s <date>, --startDate <date>¶
First date to run code for (e.g. 2012-10-02 or 20121002)
- -e <date>, --endDate <date>¶
Last date to run code, inclusive (e.g. 2012-10-25 or 20121025)
- --nooptional¶
Do not include optional inputs
- -n <count>, --num-proc <count>¶
Number of processes to run in parallel
- -i, --ingest¶
Ingest created files into the database. This will also add them to the process queue, to be built into further products by
ProcessQueue.py -p
. (Default: create in current directory and do not add to database.)
deleteAllDBFiles.py¶
Deletes all file entries in the database. Removes all references in other tables; does not remove file from disk.
- -m <dbname>, --mission <dbname>¶
Selected mission database
fast_data.py¶
Delete old versions of files, by date. Used for files that may be rapidly reprocessed, and thus old versions may not be of interest. The assumption is that files before a certain cutoff date have potentially been referenced and should be retained, and only files after that cutoff date are subject to removal.
Removes all Level0 files, and all of their children, that are not the newest version and are newer than the cut off date. It will still keep the records of the files in the dbprocessing database, but sets exists_on_disk to false.
The newest version of a file is never deleted. Files which are in the release table are also not deleted.
- -m <dbname>, --mission <dbname>¶
Selected mission database
- --cutoff <date>¶
Specify the cutoff date; only delete files newer than this date. This is specified by the file date, i.e. the data of data in the file, not the timestamp of the file on the disk. Required, in form YYYY-MM-DD.
- -a <directory>, --archive <directory>¶
If specified, move files to this archive directory rather than deleting.
- --reap-files¶
Remove all matching files from disk (or archives if using
-a
). Files remain in the database but are marked as not existing on disk.
- --reap-records¶
Remove matching files from the database if they are marked as not existing on disk. Will also remove all references to the file from other tables.
- --verbose¶
Print the name of files as they are deleted (from disk or database).
flushProcessQueue.py¶
Clears the ProcessQueue of a database.
- database¶
The name of the database to wipe the ProcessQueue of.
histogramCodes.py¶
Reads log files to find how long codes took to run; creates a histogram (PNG output) for each code, showing the number of runs for each runtime.
- logfile¶
Log file to read, specify multiple times to read many log files.
htmlCoverage.py¶
Create HTML file with table showing the versions of products present in the database by date.
Note
Either this or coveragePlot.py works, not both.
- -m <dbname>, --mission <dbname>¶
Desired mission database
- -d <deltadays>, --deltadays <deltadays>¶
Provide output this many days past the last file in the database. (Default: 3)
- outbase¶
String to use at the beginning of each html output file.
linkUningested.py¶
Find all files that are in a directory associated with a product and match the product’s file format, but are not in the database. Make a symbolic link to the incoming directory for each file (so they will be ingested on next run).
- -m <dbname>, --mission <dbname>¶
Selected mission database.
- -p <product>, --product <product>¶
Product name or product ID to check. Optional (default will check all products), but highly recommended, since in particular ingestion of files that are normally created rather than ingested as first-order inputs might lead to odd results. Multiple products can be specified by specifying more than once.
makeLatestSymlinks.py¶
Create symbolic links to the latest version of files.
For a directory containing files (“source”), creates symlinks in a different directory (“destination”). For each file in source, only the latest version will be linked in the destination. Useful for having one directory with all version of files and a different directory with just the latest versions for each product and date.
Note
Operates strictly on the basis of filenames; does not access the database.
- config¶
The config file. See the configuration file documentation for details.
- --verbose¶
Print out verbose information
- -l, --list¶
Instead of syncing list the sections of the conf file
- -f <filter_list>, --filter <filter_list>¶
Comma separated list of strings that must be in the sync conf name (e.g.
-f hope,rbspa
)
MigrateDB.py¶
Migrate a database to the latest structure.
Right now this only adds a Unix time table that stores the UTC start/end time as seconds since Unix epoch, but planned to extend to support all other database changes to date.
Will display all possible changes and prompt for confirmation.
- -m <dbname>, --mission <dbname>¶
Selected mission database
- -y, --yes¶
Process possible changes without asking for confirmation.
missingFilesByProduct.py¶
Find files which appear to be missing (based on gaps in the sequence) and, optionally, attempt to reprocess them.
Note
90% solution, not used much, but did work
- -m <dbname>, --mission <dbname>¶
Selected mission database
- product_id¶
ID of product to check for gaps.
- -s <date>, --startDate <date>¶
First date to check (e.g. 2012-10-02). Default 2021-08-30.
- -e <date>, --endDate <date>¶
Last date to check, inclusive (e.g. 2012-10-25). Default today.
- -p, --process¶
Add missing dates to the queue for processing. Files added are from the parent product of the missing product, so
--parent
is required.
- --parent <parent_id>¶
Product ID of the parent product, i.e. the product which is used as input to
product_id
.
- --echo¶
echo sql queries for debugging
- -f <filter>, --filter <filter>¶
Unused. Intended to be space-separated globs to filter filenames.
missingFiles.py¶
Reprocesses all missing files, based on noncontiguous date ranges. Implemented as multiple calls to missingFilesByProduct.py.
Warning
Maybe works, maybe not
- -m <dbname>, --mission <dbname>¶
Selected mission database
- -s <date>, --startDate <date>¶
First date to check (e.g. 2012-10-02). Default 2021-08-30.
- -e <date>, --endDate <date>¶
Last date to check, inclusive (e.g. 2012-10-25). Default today.
possibleProblemDates.py¶
Check for various possible database inconsistencies. See also scrubber.py.
- -m <dbname>, --mission <dbname>¶
Selected mission database
- --fix¶
Fix the issues. No backups are made, and not all problems are fixable.
- --echo¶
Echo sql queries for debugging
Warning
Worth looking into and cleaning up a bit; may have sharp edges.
printInfo.py¶
Print summary information about entries in the database.
- database¶
The name of the database to print table of
- field¶
Table for which to print information:
Code
,File
,Mission
,Process
, orProduct
.
printProcessQueue.py¶
Prints the process queue, i.e., the list of files to consider as potential inputs for processing.
- database¶
The name of the database to print the queue of
- -c, --count¶
Set the return code to the number of files in the queue. If there are more than 255 files, set to 255. With this option, it is impossible to differentiate between an error and a single-element process queue based on return code. Mutually exclusive with
-e
,--exist
.
- -e, --exist¶
Set the return code based on whether there are any files in the process queue: 0 (shell True) if there are, 1 (shell False) if there are no files. With this option, it is impossible to differentiate between an error and an empty process queue based on return code. Mutually exclusive with
-c
,--count
.
- --html¶
Provide output in HTML (default text).
- -o <filename>, --output <filename>¶
The name of the file to output to (if not specified, output to stdout).
- -p <product> [<product> ...], --product <product> [<product> ...]¶
Product IDs or name to include in output. May specify multiple products; all other products will be ignored (not included in output or
-c
and-e
counts). Because this may be used to specify multiple (space-separated) options, use--
to end the list of products before specifying the database (or use-p
as the last option). Adds a table of included products to the output, before the queue output itself.
- -q, --quiet¶
Quiet mode: produce no output. Mutually exclusive with
--html
,-o
,--output
,-s
,--sort
.
- -s, --sort¶
Sort the output. Primary sort by UTC file date, secondary by product name. Default is to output by the order in the process queue, i.e., the order in which files are considered for processing.
printRequired.py¶
Print all required input products for one or more processes. For each process, will print the product ID and product name of all required input files; ends with a summary of all unique product IDs on one line. Handy for use with reprocessByProduct.py.
- -m <dbname>, --mission <dbname>¶
The database to read.
- process¶
Process names or IDs for which to print inputs.
ProcessQueue.py¶
The main script of dbprocessing. Operates in one of two modes. If
-i
is specified, attempts to ingest new files from the
incoming directory into the database. As files are ingested, they are
added to the process queue. If -p
is specified, processes
the process queue. For each file on the queue, consider all possible
files that can be made from it. If those files are not up-to-date
(i.e., are not newer than the codes that make those files and all its
input files), run the relevant codes to make those new files. These
new files are ingested, added to the process queue, and similarly
evaluated; the script does not return until the process queue is
empty.
See also
The normal use of dbprocessing is regular calls to
ProcessQueue.py -i
followed by ProcessQueue.py -p
.
- -i, --ingest¶
Ingest files: evaluate all files in the incoming directory, attempt to add them to the database, move them to the appropriate directory for their identified product, and add them to the process queue.
- -p, --process¶
Process files: make all possible out-of-date outputs of all of the inputs on the process queue, and add these new files to the process queue. Repeat until the queue is empty.
Common options¶
These options are used with ProcessQueue.py -i
and
ProcessQueue.py -p
.
- -m <dbname>, --mission <dbname>¶
The mission database to connect to
- -l <loglevel>, --log-level <loglevel>¶
Set the logging level; messages of at least this priority are written to the log. Default
debug
. SeesetLevel()
for valid levels.
- --echo¶
echo sql queries for debugging
- -d, --dryrun¶
Only perform a dry run, do not perform ingest/process.
Warning
This is implemented via the
dryrun
kwarg toProcessQueue
and has not been fully tested (there may be side effects).
- -r, --report¶
Make an HTML report
Note
Not implemented.
Ingest mode options¶
These options are only used with ProcessQueue.py -i
.
Process mode options¶
These options are only used with ProcessQueue.py -p
.
- -n <numproc>, --num-proc <numproc>¶
Number of processes to run at once. This is the number of processing codes to launch at a given time to create new files; each may itself use multiple processors. Default 2.
- -o <process>, --only <process>¶
Comma-separated list of processes (IDs or names) to run. Other processes will not be run, as if they did not exist. This does not affect the removal of files from the process queue: a file is removed from the queue and evaluated for possible processing, and processing only proceeds if potential processes are on the provided list. The file is not returned to the queue if any other processes are skipped.
- -s¶
Skip processes with a RUN timebase. Because these processes do not create an output file, they are never “up to date” and it may be useful to skip them to avoid extra processing time.
purgeFileFromDB.py¶
Deletes individual files from the database. Also removes all references to each deleted feile from the database. Does not remove from disk.
- filename¶
Name of the file to remove; specify multiple files to remove them all.
- -m <dbname>, --mission <dbname>¶
Selected mission database
- -r, --recursive¶
Recursive removal: remove not only this file, but all files for which it is an input.
- -v, --verbose¶
Verbose: print all files removed.
replaceArgsWithRootdir.py¶
Replace all references to the root directory of a mission in code
arguments with {ROOTDIR}
, so that future changes to the mission’s
root directory will propagate to the arguments. I.e. replace explicit
hardcoded references to a reference that will always expand to the
current value.
Note
Currently only works on sqlite databases.
- mission¶
Mission database to update
reprocessByCode.py¶
Add all files made by a given code to the process queue, so they will
be evaluated as inputs on the next run of ProcessQueue.py
-p
.
Warning
Should work, probably doesn’t
- code¶
Name or ID of code to reprocess. Files made by this code will be added to the process queue to be considered as inputs; this is not the code which will be run when those files are reprocessed.
- -s <date>, --startDate <date>¶
Date to start reprocessing (e.g. 2012-10-02)
- -e <date>, --endDate <date>¶
Date to end reprocessing (e.g. 2012-10-25)
- -m <dbname>, --mission <dbname>¶
Selected mission database
- --force {0,1,2}¶
Force the reprocessing. Specify which version number to increment (0,1,2)
reprocessByDate.py¶
Goes through the database and adds all the files that are in a date
range to the process queue so that the next ProcessQueue.py
-p
will run them.
This code works and is likely the one that should be used most of the time for reprocessing files. (Used as the default for do everything on a date range.)
- -s <date>, --startDate <date>¶
Date to start reprocessing (e.g. 2012-10-02)
- -e <date>, --endDate <date>¶
Date to end reprocessing (e.g. 2012-10-25)
- -m <dbname>, --mission <dbname>¶
Selected mission database
- --echo¶
Echo sql queries for debugging
- --force {0,1,2}¶
Force the reprocessing. Specify which version number to increment (0,1,2)
- --level <level>¶
Only reprocess files of this level.
reprocessByInstrument.py¶
Adds all database files of a particular instrument to the process
queue so that the next ProcessQueue.py -p
will run them.
- instrument¶
The instrument to reprocess; only products of this instrument are added to the process queue. Name or ID.
- -s <date>, --startDate <date>¶
Date to start reprocessing (e.g. 2012-10-02)
- -e <date>, --endDate <date>¶
Date to end reprocessing (e.g. 2012-10-25)
- -m <dbname>, --mission <dbname>¶
Selected mission database
- -l <level>, --level <level>¶
The level to reprocess for the given instrument
- --echo¶
Echo sql queries for debugging
- --force {0,1,2}¶
Force the reprocessing. Specify which version number to increment (0,1,2)
reprocessByProduct.py¶
Adds all database files of a particular product to the process
queue so that the next ProcessQueue.py -p
will run them.
This reprocessing script works and is used all the time; it’s been tested much more heavily than the others and is used all the time for individual processing.
- product¶
Add files of this product, ID or name.
- -s <date>, --startDate <date>¶
Date to start reprocessing (e.g. 2012-10-02)
- -e <date>, --endDate <date>¶
Date to end reprocessing (e.g. 2012-10-25)
- -m <dbname>, --mission <dbname>¶
Selected mission database
- --echo¶
Echo sql queries for debugging
- --force {0,1,2}¶
Force the reprocessing. Specify which version number to increment (0,1,2)
testInspector.py¶
Run an inspector against a specific
product in a database and file. Prints contents of Diskfile
if it is a match.
- -m <dbname>, --mission <dbname>¶
Selected mission database.
- -f <file>, --file <file>¶
Path to data file to test inspector on.
- -i <inspector>, --inspector <inspector>¶
Path to inspector source file.
- -p <product>, --product <product>¶
Product ID of the product the file belongs to, i.e. test if inspector considers the file to be a match to this product.
- -a <args>, --args <args>¶
Keyword arguments to pass to inspector (optional), space-separated list of
key=value
pairs, as ininspector.arguments
.
scrubber.py¶
Checks a database for possible inconsistencies or problems. See also possibleProblemDates.py.
- -m <dbname>, --mission <dbname>¶
Mission database to check
updateSHAsum.py¶
Update the stored shasum for a file; useful if the file were changed after ingestion.
- infile¶
File to update the shasum of
- -m <dbname>, --mission <dbname>¶
Selected mission database
updateUnixTime.py¶
Rewrites all Unix timestamps in a file, recalculating them from the UTC start/stop time. This is not needed if adding a Unix timestamp table to an existing database (see MigrateDB.py); it is only required if the algorithm for populating the Unix timestamps changes and a database has been created with the older algorithm.
- -m <dbname>, --mission <dbname>¶
Selected mission database
Examples¶
These scripts are meant as reference for specific tasks that might be
required for a particular mission. They may not be fully tested or may
be mission-specific. They are not generally maintained; some are
candidates for eventually transferring to maintained scripts. They are
in the directory examples/scripts
.
addVerboseProvenance.py¶
Go into the database and get the verbose provenance for a file then add that to the global attrs for the CDF file. Either put out to the same file or a different file
Warning
This code has not been fully tested or used; never worked.
- infile¶
Input CDF file
- outfile¶
Output CDF file; input is copied to this file with the provenance added.
- -m <dbname>, --mission <dbname>¶
Selected mission database
- -i, --inplace¶
Edit the existing CDF file in place instead of making a new output file.
CreateDBsabrs.py¶
Variant of CreateDB.py that was used for a project with PostgresSQL before that functionality was integrated, but also used slightly different table definitions.
dataToIncoming.py¶
Concept, never actually used. Intended as a single script which would be used (in conjunction with a configuration file) to handle all incoming data for RBSP-ECT, to ingest all new files to the database. In practice, used separate scripts for each sensor on the suite.
hopeCoverageHTML.py¶
Produce a table with days that had coverage of HOPE data. See coveragePlot.py and htmlCoverage.py for more generic implementation.
hope_query.py¶
Print information on HOPE files for particular days, and particular spacecraft. See printInfo.py for similar generic output.
link_missing_ql_mag_l2_mag.py¶
RBSP-ECT had some inputs available initially in a quicklook format and then later in a definitive level 2 format. The database treated QL as “required,”, L2 “optional”. dbprocessing doesn’t support “either or but prefer this one”, so this links them together and the wrapper handles the actual selection of the file according to priority.
magephem-pre-CoverageHTML.py¶
Produce a table with days that had coverage of predictive magnetic ephemeris data. See coveragePlot.py and htmlCoverage.py for more generic implementation.
newestVersionProblemFinder.py¶
Untested script to check for cases where the newest version of a file is not consistent with version numbering and creation dates.
updateCode.py¶
Helper to help deploy a new version of a code. Designed to copy an existing code entry and increment its version.
Ideally would also add all files that are inputs to the code to the process queue, but this was not implemented.
updateProducts.py¶
Intended to update products based on an updated configuration file. Probably broken.
weeklyReport.py¶
Reads dbprocessing log files to produce an HTML report of activity over a given period of time. Unused and probably broken.
writeDBhtml.py¶
Produces an HTML summary of a mission products and processes. Unused and probably broken.
writeProcessConf.py¶
Write the configuration file fragment for a particular process. Not used. See configFromDB.py.
writeProductsConf.py¶
Write the configuration file fragment for a particular product. Not used. See configFromDB.py.
Release: 0.1.0 Doc generation date: Feb 10, 2022