Dataflow Archive

A Python application for automating the archiving of sequencing data.

Overview

Dataflow Archive monitors sequencing run directories and orchestrates the encryption and archiving of sequencing data. It supports multiple sequencer types (Illumina, Oxford Nanopore and Element) and logs archiving completion in a CouchDB-based status database.

Supported Sequencers

Illumina: NextSeq, MiSeqi100, NovaSeqXPlus, MiSeq
Oxford Nanopore (ONT): PromethION, MinION
Element: AVITI

Installation

Requirements

Python 3.14+
Dependencies listed in pyproject.toml:
- PyYAML
- click
- ibmcloudant
run-one

Setup suggestions

Clone the repository:

git clone <repository-url>
cd dataflow_archive

Install the package:

pip install -e .

Or with development dependencies:

pip install -e ".[dev]"

Usage

Command Line Interface

dataflow_archive [OPTIONS] COMMAND

Options

-c, --config-file PATH: Path to configuration YAML file. Defaults to ~/.df_archive/df_archive.yaml. Can also be set via ARCHIVE_CONFIG environment variable.
-r, --run RUN_ID: Archive a specific run (e.g., 20250528_LH00217_0219_A22TT52LT4).
--version: Show version and exit.

Commands

encrypt: Tar and encrypt run directories based on the provided configuration.
upload: Upload encrypted runs to PDC.

Examples

# Encrypt all runs (uses configuration for sequencing directories)
dataflow_archive encrypt

# Encrypt a specific run
dataflow_archive --run 20250528_LH00217_0219_A22TT52LT4 encrypt

# Upload all encrypted runs
dataflow_archive upload

# Upload a specific run
dataflow_archive --run 20250528_LH00217_0219_A22TT52LT4 upload

# Use a custom config file
dataflow_archive --config-file /path/to/config.yaml encrypt

Configuration

Create a YAML configuration file with the following structure:

log:
  file: /path/to/dataflow_archive.log

statusdb:
  username: couchdb_user
  password: couchdb_password
  url: couchdb.host.com
  database: sequencing_runs

data_dirs:
  - /sequencing/MiSeqi100
  - /sequencing/NextSeq/Runs
  - /sequencing/PromethION
ignore_folders:
  - nosync
archive_dir: /sequencing/archiving
sequencer_specific_settings:
  Illumina:
    tar_exclude:
      - Demultiplex*
    final_file:
      - CopyComplete.txt
  ONT:
    tar_exclude:
      - Pod5*
    final_file:
      - final_summary.txt
  # ... additional sequencer configurations

Assumptions

Run directories are named according to sequencer-specific ID formats (defined in RUN_TYPES)
Final completion is indicated by the presence of a sequencer-specific final file (e.g. CopyComplete.txt for Illumina)
CouchDB is accessible and the database exists

Status Files

The logic of the script relies on the following status files:

run.inal_file - The final file written by each sequencing machine. Used to indicate when the sequencing has completed.

Development

Running Tests

pytest

With coverage:

pytest --cov --cov-branch

Code Quality

Run linting and formatting checks:

ruff check .
ruff format --check .

Project Structure

dataflow_archive/
├── cli.py                 # Command-line interface
├── dataflow_archive.py    # Main transfer orchestration
├── log/                   # Logging utilities
├── utils/                 # Utility modules (filesystem, statusdb)
└── tests/                 # Unit tests

Adding a new sequencer

To add support for a new sequencer, add the following to dataflow_transfer:

Update the RUN_TYPES dictionary to cover the format of the new run folder name
Add entries for the sequencer in the config file (data_dirs and any nessesary changes/additions to sequencer_specific_settings)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dataflow_archive		dataflow_archive
.gitignore		.gitignore
.python-version		.python-version
LICENCE		LICENCE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataflow Archive

Overview

Supported Sequencers

Installation

Requirements

Setup suggestions

Usage

Command Line Interface

Options

Commands

Examples

Configuration

Assumptions

Status Files

Development

Running Tests

Code Quality

Project Structure

Adding a new sequencer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dataflow Archive

Overview

Supported Sequencers

Installation

Requirements

Setup suggestions

Usage

Command Line Interface

Options

Commands

Examples

Configuration

Assumptions

Status Files

Development

Running Tests

Code Quality

Project Structure

Adding a new sequencer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages