Pipeline

The MIMOSA pipeline fetches sample data, runs allele-based clustering with ReporTree, and uploads the results to the database so the frontend can display them.

Data sources

The pipeline supports two data sources that can be used independently or together.

Bonsai (default)

Samples and their cgMLST allele profiles are fetched from a Bonsai LIMS instance. Authentication uses a credentials file or environment variables (see Credentials below).

chewBBACA import

Allele profiles imported directly into MIMOSA via the web UI (see chewBBACA Import) or by passing TSV files on the command line with --chewbbaca. When Bonsai is disabled (--bonsai false) the pipeline clusters only from locally imported data.

When both sources are active, Bonsai data and locally imported chewBBACA profiles are merged before clustering. If any Bonsai-sourced features already exist in the database, Bonsai is always included regardless of settings to avoid incomplete clustering.

Credentials

When running the pipeline manually you must supply credentials for both Bonsai and the MIMOSA backend. Provide them as a JSON file passed with --credentials:

{
  "bonsai_username": "your_bonsai_user",
  "bonsai_password": "your_bonsai_password",
  "mimosa_username": "your_mimosa_user",
  "mimosa_password": "your_mimosa_password"
}

Alternatively, when running inside Docker or in CI the pipeline reads credentials from environment variables (see Automation).

chewBBACA input formats

The --chewbbaca option accepts three input types:

Single TSV file

python scripts/main.py \
  --credentials credentials.json \
  --chewbbaca /data/results_alleles.tsv \
  --chewbbaca_profile staphylococcus_aureus

Directory — all .tsv files in the directory are loaded:

python scripts/main.py \
  --credentials credentials.json \
  --chewbbaca /data/chewbbaca_results/ \
  --chewbbaca_profile staphylococcus_aureus

CSV manifest — a file listing paths and optional per-row profile and sample ID overrides:

ID,file_path,profile
MRSA_SAMPLE_001,/data/run1/results_alleles.tsv,staphylococcus_aureus
KPN_SAMPLE_001,/data/run2/results_alleles.tsv,klebsiella_pneumoniae

Accepted column names: ID or sample_id (optional — overrides the sample name in the TSV for single-sample files); file_path, path, or file (required); profile or analysis_profile (optional — overrides --chewbbaca_profile for that row).

--chewbbaca can be repeated to supply multiple inputs:

python scripts/main.py \
  --credentials credentials.json \
  --chewbbaca /data/staph.tsv   --chewbbaca_profile staphylococcus_aureus \
  --chewbbaca /data/kpn.tsv     --chewbbaca_profile klebsiella_pneumoniae

Conflict and priority behaviour

When running in mixed mode (Bonsai + chewBBACA), two types of conflict can occur.

Bonsai conflicts

A local sample ID matches a sample already present in Bonsai. Set CHEWBBACA_CONFLICT to control the outcome:

CHEWBBACA_CONFLICT=use_bonsai    # keep the sample using Bonsai alleles (default)
CHEWBBACA_CONFLICT=use_chewbbaca # use the locally imported allele profile instead
CHEWBBACA_CONFLICT=skip          # exclude the conflicting sample from this run

In interactive mode (running the script directly, not via automation), you are prompted for each conflicting sample if this variable is not set.

The conflict check strips common local filename suffixes — for example a local file named SAMPLE_001_chewbbaca.tsv is compared against the Bonsai sample ID SAMPLE_001.

Local store duplicates

When importing a chewBBACA sample that already exists in the local MongoDB store, CHEWBBACA_DUPLICATE_ACTION controls the behaviour:

CHEWBBACA_DUPLICATE_ACTION=skip     # keep the existing record (default)
CHEWBBACA_DUPLICATE_ACTION=replace  # overwrite and record replacement history

These variables can be set in .env or passed as environment variables before running the pipeline.

Supplementary metadata

Supplementary metadata (postcode, hospital, date, LIMS ID, etc.) can be attached to samples at pipeline time via a CSV file:

python scripts/main.py \
  --credentials credentials.json \
  --supplementary_metadata supplementary_metadata.csv \
  --profile staphylococcus_aureus

The CSV must contain a sample column (matching the sample ID from Bonsai or the TSV) and one or more metadata columns:

sample,lims_id,PostCode,Hospital,Date,Latitude,Longitude
SAMPLE_001,lims_001,71131,Örebro Universitetssjukhus,2025-03-05,,
SAMPLE_002,lims_002,,,,52.2053,0.1218

Supported metadata columns: PostCode, Hospital, Date, Latitude, Longitude. All are optional — include only the columns you have data for. If Latitude is provided, Longitude must also be provided (and vice versa); samples with only one coordinate are skipped with a warning. Samples with valid coordinates are plotted on the map even when no postcode or hospital is available.

To prepare the file, generate a template with sample IDs pre-populated from Bonsai:

python scripts/prepare_supplementary_metadata.py \
  --credentials credentials.json \
  --output ./metadata_templates/ \
  --profile staphylococcus_aureus

The template contains sample and lims_id columns filled in from Bonsai, plus empty columns for PostCode, Hospital, Date, Latitude, and Longitude. Fill in the relevant fields manually before passing the file to the pipeline.

Supplementary metadata can also be added or corrected after import via the Samples page in the dashboard (see Pending Samples for pre-registering metadata before samples arrive).

Command-line reference

All commands are run from the repository root.

Option

Description

--credentials PATH

Path to a JSON file containing bonsai_username, bonsai_password, mimosa_username, mimosa_password. Required when not using environment-variable credentials.

--supplementary_metadata PATH

Path to a CSV file with additional metadata (PostCode, Hospital, Date, etc.) keyed on sample ID. See Supplementary metadata above.

--profile PROFILE [...]

One or more profiles to process. Defaults to all profiles. Pass all to process every profile explicitly.

--groups GROUP [...]

Restrict processing to specific Bonsai groups.

--bonsai true|false

Whether to fetch from Bonsai. Default: true.

--chewbbaca PATH

Path to a chewBBACA TSV file, directory, or CSV manifest. Repeatable.

--chewbbaca_profile PROFILE

Analysis profile for the corresponding --chewbbaca input. Repeatable.

--update-only

Sync metadata only — skip clustering.

--re-cluster

Force clustering even when no new samples are detected.

--run-similarity

Run similarity analysis after clustering.

--exclude-samples ID [...]

Sample IDs to exclude for this run. A single value may be a file path with one ID per line. In interactive mode you will be offered the option to persist new exclusions to the database. See Exclusion List.

--exclude-groups ID [...]

Group IDs to exclude for this run (file path also accepted). Same persistence prompt as --exclude-samples.

--output DIR

Directory to save intermediate files. Requires --save_files.

--save_files

Save intermediate files to --output.

--debug

Enable verbose debug logging.

--email [ADDRESS]

Send a failure-alert email. Without an address, sends to the authenticated user.

Examples

Process all profiles using Bonsai:

python scripts/main.py --credentials credentials.json --profile All

Process a single profile without Bonsai (chewBBACA-only):

python scripts/main.py \
  --credentials credentials.json \
  --profile staphylococcus_aureus \
  --bonsai false

Import a local TSV and cluster alongside Bonsai:

python scripts/main.py \
  --credentials credentials.json \
  --chewbbaca /data/results_alleles.tsv \
  --chewbbaca_profile staphylococcus_aureus

Force re-clustering for all profiles:

python scripts/main.py --credentials credentials.json --re-cluster