Classification approach

See Analysis Results for classification metrics, per-sample plots, and feature effect sizes from the chr21 cohort.

Problem framing

The task is binary classification of plasma cell-free DNA (cfDNA) samples as ALS or healthy control, using features extracted from bisulfite-sequenced paired-end BAM files. The cohort is 22 samples (12 ALS, 10 CTRL), with features extracted from chromosome 21 only. This is a high-dimensional, small-n setting: the feature space (256 motif frequencies + 10 scalar statistics = 266 features at default k=4) is an order of magnitude larger than the sample count, and chr21 represents a small fraction of the genome. L2 regularisation shrinks uninformative features toward zero, making the full vocabulary tractable without explicit feature selection.

Feature set

Three feature classes are extracted, each motivated by established cfDNA biology.

End-motif frequencies

cfDNA fragments are generated by nuclease cleavage at preferred sequence contexts. The 5′ end of each fragment (the "end motif") reflects the nuclease responsible: DNase I, caspase-activated DNase (CAD), and others have distinct k-mer preferences. Disease states may alter cfDNA nuclease activity or cell-death processes, which can shift end-motif spectra. All 4^k possible k-mers are counted per sample, normalised to frequencies, and the full vocabulary enters the feature matrix. With default k=4 this gives 256 features; k=8 gives 65,536 (each sample ~500 KB). The full 4^k vocabulary is generated deterministically from k alone via itertools.product('ACGT', repeat=k), with no dependency on training data. L2 regularisation suppresses uninformative motifs by shrinking their weights toward zero, replacing the need for explicit selection.

Fragment length features

These features constitute the standard cfDNA liquid biopsy feature set and were included as specified for this task.

Plasma cfDNA fragments reflect the chromatin packaging state of their cell-of-origin tissue. Nucleosomes protect ~147 bp of DNA; the canonical mono-nucleosomal fragment runs 147–200 bp (peak at ~167 bp). Fragments shorter than ~120 bp are consistent with nucleosome-free regions (NFRs) occupied by transcription factors or other DNA-binding proteins. The di-nucleosomal peak falls around 280–400 bp.

Disease alters the tissue-of-origin mixture in plasma cfDNA, shifting the relative abundances of these populations. Five features are computed from the length histogram:

Feature	Definition	Rationale
`fl_mean`	Mean fragment length	Shifts with nucleosomal occupancy
`fl_std`	Standard deviation of fragment length	Captures distribution breadth
`fl_frac_subnucleosomal`	Fraction with length < 120 bp	Subnucleosomal fragments can be enriched near open chromatin or TF-bound regions
`fl_frac_nucleosomal`	Fraction with length 120–200 bp	Mono-nucleosomal occupancy
`fl_ratio_mono_di`	Count(120–200 bp) / Count(280–400 bp)	Nucleosomal spacing signal; sensitive to chromatin remodelling

fl_ratio_mono_di is a well-established discriminating feature in cfDNA liquid biopsy work; it is the strongest single discriminator in this dataset (effect size 0.84, see Results).

CpG methylation features

Bisulfite sequencing converts unmethylated cytosines to thymine; Bismark records the per-base methylation call in the XM auxiliary tag. The C++ core accumulates per-genomic-position counts (methylated reads / unmethylated reads) for every covered CpG site. Five summary statistics are derived from this per-site distribution:

Feature	Definition	Rationale
`methylation_mean`	Coverage-weighted mean	Weights high-confidence sites proportionally; more robust than equal-weight mean across sites with heterogeneous depth
`methylation_entropy`	Mean binary entropy across sites: `mean(-p log2(p) − (1−p) log2(1−p))`	Epipolymorphism — high entropy at a site means a mix of methylated and unmethylated reads, indicating epigenetic heterogeneity. Elevated epipolymorphism is associated with cancer and neurodegeneration
`methylation_frac_high`	Fraction of sites with rate > 0.8	strongly methylated loci
`methylation_frac_low`	Fraction of sites with rate < 0.1	strongly unmethylated loci
`methylation_median`	50th percentile of per-site rates	less sensitive to outlier sites than the mean

The coverage-weighted mean is used since site supported by 2 reads contributes the same to an equal-weight mean as a site supported by 50 reads, but the former has far higher binomial noise. Coverage weighting is equivalent to treating the problem as estimating the methylation rate across all sequenced CpG positions, which is a more natural estimator.

Entropy is used over variance since variance penalises sites near 0 and near 1 equally. Entropy specifically peaks for sites at 0.5 methylation and falls to zero for sites that are uniformly methylated or unmethylated. This makes it a more targeted measure of epipolymorphism.

Classifier

Algorithm: L2-penalised logistic regression (sklearn.linear_model.LogisticRegression).

Justification: - With n=22 samples and ~27 features, non-linear methods (SVM-RBF, random forest) have more free parameters and are prone to overfitting. L2 logistic regression is the natural choice for a linearly separable, high-dimensional, small-n problem: it has a single hyperparameter (C), is convex, and produces calibrated probabilities. - The L2 penalty shrinks all feature weights uniformly toward zero, which is appropriate when the features are z-score standardised and there is no strong prior reason to zero out a subset (as L1 would). In the absence of external feature selection, uniform shrinkage is conservative. - Interpretability: the learned weight vector directly indicates which features (motifs, fragment length fractions, methylation statistics) discriminate the two classes, which is useful for biological interpretation.

Regularisation strength (C): selected per fold via inner 5-fold GridSearchCV over C {0.01, 0.1, 1.0, 10.0}. A fixed C chosen from the full dataset would constitute hyperparameter leakage; tuning inside each fold prevents the test sample from influencing model selection.

Feature standardisation: StandardScaler (zero mean, unit variance) is fit on training samples only and applied to both train and test within each fold. This is required because end-motif frequencies (~0.01–0.05) and fragment length ratios (~2–10) are on very different scales; standardisation ensures the L2 penalty treats all features comparably.

Missing value handling: np.nan_to_num replaces any NaN (which arises if a feature was absent for a sample, e.g. no CpG coverage) with zero post-scaling. This is equivalent to imputing the training-set mean for the affected sample.

Cross-validation methodology

Leave-one-out CV (LOO-CV): with n=22, standard k-fold CV would leave only ~4 samples per fold for testing, producing noisy estimates. LOO-CV uses n−1 = 21 samples for training in each fold and evaluates on the single held-out sample, yielding 22 test predictions that collectively cover the entire dataset. This is the standard approach for cohorts of this size in cfDNA literature.

Leakage prevention: two sources of leakage are guarded against: 1. Feature scaling: StandardScaler fit on training fold only (not whole dataset). 2. Hyperparameter selection: C chosen by inner cross-validation on the training fold. The held-out sample is invisible to both scaling and model selection.

End-motif vocabulary is the full theoretical 4^k set — it does not depend on training data, so there is no leakage from this source.

Inner CV configuration: StratifiedKFold with n_splits = max(2, min(5, n_train // 2)) preserves the ALS/CTRL ratio in each inner fold and degrades gracefully when the update command is called with small caches.

Model persistence and deployment

The train subcommand fits a final L2 logistic regression on all labelled samples (not LOO-CV) and serialises a ModelBundle to <out-dir>/model.pkl:

class ModelBundle(TypedDict):
    model: LogisticRegression  # fitted classifier
    scaler: StandardScaler     # fitted scaler (applied at predict time)
    k: int                     # motif length used during training
    flat_keys: list[str]       # non-motif feature names, for alignment
    contig: str                # genomic region used for extraction
    n_training_samples: int    # audit field

The predict subcommand loads this bundle and classifies a single new BAM. train always runs LOO-CV and saves the model in one step, so evaluation report and deployable model are produced together. Feature extraction uses the stored k and contig, and the stored flat_keys list ensures the feature vector is aligned with the training matrix regardless of which scalar features are present in the new sample.

The update subcommand enables incremental training: it appends a new labelled sample to the companion feature cache (model.pkl.features.json) and retrains the model from scratch on all cached data. This is full retraining from stored feature vectors — LogisticRegression does not support partial update — but retraining is fast at the cohort sizes encountered in cfDNA research (seconds on CPU). Extracted feature vectors are cached so BAMs do not need to be re-processed when new samples are added.

Known limitations

Chr21 only: chromosome 21 is ~1.5% of the autosomal genome. Extending analysis to the more regions may substantially increase discriminative power.
Sample size: n=22 is too small to estimate generalisation error reliably regardless of CV strategy. LOO-CV is approximately unbiased but has high variance. Results should be interpreted as a proof-of-concept. No held-out test set was available.
Bisulfite-specific caveats: bisulfite conversion (C->T) collapses the cytosine channel, making end-motif distributions T/A-rich and relatively uniform across samples. This reduces the effective information content of the end-motif features compared to non-bisulfite cfDNA assays.
Fragment length via TLEN: fragment length is derived from the TLEN/isize field set by the aligner. This field may not accurately reflect the true insert size and its use is discouraged by the htslib team. A more precise estimate would derive insert size from PNEXT + cigar_ref_length(MC tag) − POS, but this relies on the MC tag being present, which Bismark-aligned BAMs may not populate. TLEN is used here as a pragmatic approximation that is consistent across all samples in the cohort.

Implementation

The full LOO-CV implementation, annotated (hover icons to see more info):

cfclassify/classify.py — run_loo_cv()
def run_loo_cv(samples: list[Sample], motif_k: int = 4) -> dict[str, object]: # (1)!
    predictions: list[str] = []
    all_labels = [s["label"] for s in samples]
    motifs = generate_vocabulary(motif_k) # (2)!

    for train_idx, test_idx in LeaveOneOut().split(range(len(samples))): # (3)!
        train_samples = [samples[i] for i in train_idx]
        test_samples  = [samples[i] for i in test_idx]

        X_train, y_train, _ = build_feature_matrix(train_samples, motifs)
        X_test,  _,       _ = build_feature_matrix(test_samples,  motifs)

        scaler = StandardScaler()
        X_train_s = scaler.fit_transform(X_train) # (4)!
        X_test_s  = scaler.transform(X_test)
        np.nan_to_num(X_train_s, copy=False)
        np.nan_to_num(X_test_s,  copy=False)

        grid = GridSearchCV(
            LogisticRegression(max_iter=1000, random_state=42),
            {"C": [0.01, 0.1, 1.0, 10.0]},
            cv=_make_inner_cv(len(train_samples)),
            refit=True,
        ) # (5)!
        grid.fit(X_train_s, y_train)
        predictions.append(str(grid.predict(X_test_s)[0]))

    return {
        "predictions": predictions,
        "labels": all_labels,
        "report": classification_report(all_labels, predictions),
    }

All scaling and hyperparameter tuning happens inside this function — the caller never sees individual folds.
The full 4^k vocabulary is generated once before the outer loop and shared across all folds. No training data is involved — the vocabulary is deterministic from motif_k alone.
Outer LOO loop: 22 iterations, each holding out exactly one sample. LeaveOneOut().split() yields index arrays, not data.
Leakage guard 1: fit_transform() on training data; transform() only on the test sample. The held-out mean and variance do not influence scaling parameters.
Leakage guard 2: GridSearchCV selects C by inner CV on training data only. The held-out sample is invisible to model selection and scaling.