Architecture
The pipeline has two layers connected by the RegionMetrics boundary type. Everything upstream of the boundary is C++; everything downstream is pure Python.
cfextract (C++ / pybind11)
│ src/core/core.{cpp,hpp} — RegionMetrics struct; extract_metrics() entry point
│ src/core/access.{cpp,hpp} — safe BAM/index file handle
│ src/core/stats.{cpp,hpp} — binned data -> summary stats
│ src/core/extract.{cpp,hpp} — per-read motif and methylation accumulators
│ src/bindings/python/ — bindings for python usage
│
│ ── RegionMetrics boundary ──────────────────────────────────────────────────
│
cfclassify (Python)
cfclassify/types.py — Data schemas (Features, Sample, ModelBundle)
cfclassify/features.py — metrics_to_features()
cfclassify/classify.py — LOO-CV, final model training, inference
cfclassify/model.py — model bundle and feature cache I/O
cfclassify/plots.py — matplotlib figures
cfclassify/__main__.py — CLI entry point (train/predict/update)
The boundary is deliberately narrow. RegionMetrics carries only compact, pre-reduced data: histograms (fixed-size arrays and small maps) and 5-element stats structs. Raw read data does not cross the C++/Python interface.
Python bindings are provided such that the extracted features can be used with common frameworks for downstream analysis, but the C++ core could equally be driven from R, Julia, or any other language with a C FFI. Additionally, when processing whole-genome BAM, I/O throughput will likely dominate total runtime. Processing alignment data directly in C++ with htslib is significantly more performant than using e.g. pysam due to inevitable Python overhead per record. This has not been explicitly benchmarked but the speedup is likely in the tens with respect to order of magnitude.
Core Extractor Scalability
Time complexity
Per extract_metrics() call: O(R × L) where R = reads in the region, L = read length. The inner per-read cost is O(1) for end-motif lookup and O(L) for CIGAR walking to parse the Bismark XM methylation tag. Post extraction computation is O(N log N) for the median computation over N covered CpG sites and O(1001) for fragment-length statistics. These latter demands are negligible relative to the BAM streaming loop.
Memory model
Peak runtime memory (inside extract_metrics()):
The cpg_sites accumulation map is the dominant runtime allocation. The data stored for a given CpG site requires 24 bytes. Accounting for the overhead of the map itself, that figure grows to approximately 56 bytes per CpG. As such the expected memory usage for e.g. ~5 million covered CpG sites is 5M × 56 ~= 280 MB. On the subsampled chr12 Li data used for building this pipeline, memory usage peaks at about 140 KB for a typical chr21 alignment. This memory is freed before the function returns, the Python side is never required to handle it.
Memory at the C++/Python boundary (returned RegionMetrics):
| Structure | Size |
|---|---|
end_motifs |
≤ 256 entries (all 44 4-mers), ~8 KB |
frag_len_hist |
1001 × 8 B = ~8 KB fixed |
fl_stats |
5 doubles = 40 B |
methylation |
5 doubles = 40 B |
start_pos_hist / end_pos_hist |
O(contig_len / 100 kbp): ~4 KB chr21 |
Python receives only the compact RegionMetrics struct regardless of coverage depth, i.e. only ~20 KB is serialised across the C++/Python boundary per call.
Measured performance
End-to-end cfextract.extract_features() across 22 chr21 BAMs on an Apple M-series machine and using a Release build of this pipeline gave the following performance results:
| Metric | Mean ± std | Median |
|---|---|---|
| Wall time | 0.10 ± 0.00 s | 0.09 s |
| Peak RSS increase | 5.9 ± 0.3 MB | 5.8 MB |
Reproduce with python scripts/bench_extract.py --repeat 3.
Extending the tool
Adding a new feature to the pipeline requires changes in exactly two places:
-
C++ side (
src/core/core.hppandsrc/core/core.cpp): add the new field toRegionMetricsand compute its final value insideextract_metrics(). If the feature requires per-read accumulation, add a helper insrc/core/extract.{hpp,cpp}(alongsideaccumlate_motifsandaccumulate_methylation) and call it from the per-read loop inextract_metrics(). -
Python side (
cfclassify/features.py,metrics_to_features()): read the new attribute from theRegionMetricsobject and write derived value(s) into theFeaturesdict. If the result is a flat numeric scalar (likefl_meanormethylation_entropy),build_feature_matrix()inclassify.pypicks it up automatically — no further changes needed. If it is a histogram or dict (likeend_motifs), add explicit handling tobuild_feature_matrix()as well.
No changes are needed to the classification, model I/O, or plotting infrastructure.
Engineering limitations
- Model storage:
ModelBundleis serialised withjoblib(pickle). There is no versioning, schema migration, or compatibility guarantee across Python or sklearn releases. A bundle saved with one sklearn version may not load cleanly after a major upgrade. - Model scalability: the extractor scales to large cohorts (I/O-bound, memory bounded by region size). The classifier does not:
LogisticRegressionhas nopartial_fit, soupdatealways retrains from scratch on all cached feature vectors. This is fast at cfDNA cohort sizes (seconds on CPU) but would not scale to thousands of samples without switching to a solver that supports incremental updates. - Single-threaded extraction:
extract_metrics()processes one BAM serially. The CLI usesjoblibto parallelise across samples at the process level, but per-BAM throughput is single-threaded. This is sufficient for chr21 BAMs (~0.1 s each) but would be a bottleneck for whole-genome runs without introducing intra-file parallelism. - Single target region: The current implementation of feature extraction takes a single target region, and the Python CLI tool simply defaults to targeting chr21. A python side programmer would need to make multiple calls
extract_metricsto target multiple regions or the entire genome. A future implementation could efficiently accumulate data from multiple target regions using existing htslib multi-region iterator infrastructure.