A Python application for automating the archiving of sequencing data.
Dataflow Archive monitors sequencing run directories and orchestrates the encryption and archiving of sequencing data. It supports multiple sequencer types (Illumina, Oxford Nanopore and Element) and logs archiving completion in a CouchDB-based status database.
- Illumina: NextSeq, MiSeqi100, NovaSeqXPlus, MiSeq
- Oxford Nanopore (ONT): PromethION, MinION
- Element: AVITI
- Python 3.14+
- Dependencies listed in pyproject.toml:
- PyYAML
- click
- ibmcloudant
- run-one
- Clone the repository:
git clone <repository-url>
cd dataflow_archive- Install the package:
pip install -e .Or with development dependencies:
pip install -e ".[dev]"dataflow_archive [OPTIONS] COMMAND-c, --config-file PATH: Path to configuration YAML file. Defaults to~/.df_archive/df_archive.yaml. Can also be set viaARCHIVE_CONFIGenvironment variable.-r, --run RUN_ID: Archive a specific run (e.g.,20250528_LH00217_0219_A22TT52LT4).--version: Show version and exit.
encrypt: Tar and encrypt run directories based on the provided configuration.upload: Upload encrypted runs to PDC.
# Encrypt all runs (uses configuration for sequencing directories)
dataflow_archive encrypt
# Encrypt a specific run
dataflow_archive --run 20250528_LH00217_0219_A22TT52LT4 encrypt
# Upload all encrypted runs
dataflow_archive upload
# Upload a specific run
dataflow_archive --run 20250528_LH00217_0219_A22TT52LT4 upload
# Use a custom config file
dataflow_archive --config-file /path/to/config.yaml encryptCreate a YAML configuration file with the following structure:
log:
file: /path/to/dataflow_archive.log
statusdb:
username: couchdb_user
password: couchdb_password
url: couchdb.host.com
database: sequencing_runs
data_dirs:
- /sequencing/MiSeqi100
- /sequencing/NextSeq/Runs
- /sequencing/PromethION
ignore_folders:
- nosync
archive_dir: /sequencing/archiving
sequencer_specific_settings:
Illumina:
tar_exclude:
- Demultiplex*
final_file:
- CopyComplete.txt
ONT:
tar_exclude:
- Pod5*
final_file:
- final_summary.txt
# ... additional sequencer configurations- Run directories are named according to sequencer-specific ID formats (defined in RUN_TYPES)
- Final completion is indicated by the presence of a sequencer-specific final file (e.g.
CopyComplete.txtfor Illumina) - CouchDB is accessible and the database exists
The logic of the script relies on the following status files:
run.inal_file- The final file written by each sequencing machine. Used to indicate when the sequencing has completed.
pytestWith coverage:
pytest --cov --cov-branchRun linting and formatting checks:
ruff check .
ruff format --check .dataflow_archive/
├── cli.py # Command-line interface
├── dataflow_archive.py # Main transfer orchestration
├── log/ # Logging utilities
├── utils/ # Utility modules (filesystem, statusdb)
└── tests/ # Unit tests
To add support for a new sequencer, add the following to dataflow_transfer:
- Update the RUN_TYPES dictionary to cover the format of the new run folder name
- Add entries for the sequencer in the config file (
data_dirsand any nessesary changes/additions tosequencer_specific_settings)