Python API — cfclassify
All public symbols live in the cfclassify package. Import paths are shown on each entry.
Types — cfclassify.types
Three TypedDict classes define the data schema flowing through the Python layer.
class Features(TypedDict, total=False):
end_motifs: dict[str, float] # normalised k-mer frequencies
frag_len_hist: list[int] # 1001-bin histogram (0–1000 bp)
fl_mean: float
fl_std: float
fl_frac_subnucleosomal: float # fraction < 120 bp
fl_frac_nucleosomal: float # fraction 120–200 bp
fl_ratio_mono_di: float # mono / di-nucleosomal count ratio
methylation_mean: float # coverage-weighted mean
methylation_entropy: float # mean binary entropy across sites
methylation_frac_high: float # fraction of sites with rate > 0.8
methylation_frac_low: float # fraction of sites with rate < 0.1
methylation_median: float
start_pos_hist: dict[int, int] # 100 kbp bins, for plotting only
end_pos_hist: dict[int, int]
class Sample(TypedDict):
sample_id: str
label: str
features: Features
class ModelBundle(TypedDict):
model: LogisticRegression
scaler: StandardScaler
k: int # motif length used during training
flat_keys: list[str] # non-motif feature names, for vector alignment
contig: str
n_training_samples: int
Features is total=False: every key is optional. Missing data (e.g. no CpG coverage in a region) is represented by the absence of the corresponding key — not by a NaN value. This keeps downstream code honest: a consumer must explicitly handle the absent-key case rather than silently propagating NaN.
Sample wraps a single processed BAM file.
ModelBundle holds everything needed to classify a new sample: the fitted model, fitted scaler, k-mer length k, the ordered list of non-motif feature names flat_keys (for vector alignment at inference time), and the contig and sample count used during training.
Feature conversion — cfclassify.features
metrics_to_features() is the sole extension point for adding new features to the pipeline. When a new field is added to RegionMetrics on the C++ side:
- Add the conversion logic here.
- If the new field is a flat numeric scalar,
build_feature_matrix()inclassify.pypicks it up automatically — no other changes needed. - If the new field is a histogram or dict (like
end_motifs), add explicit handling to bothmetrics_to_features()andbuild_feature_matrix().
Classification — cfclassify.classify
def generate_vocabulary(k: int) -> list[str]: ...
def run_loo_cv(samples: list[Sample], motif_k: int = 4) -> dict[str, object]: ...
def train_final_model(samples: list[Sample], motif_k: int = 4) -> tuple[...]: ...
def predict_sample(sample_features: Features, model, scaler, flat_keys, motif_k: int = 4) -> tuple[str, float]: ...
def build_feature_matrix(samples: list[Sample], motifs: list[str]) -> tuple[...]: ...
generate_vocabulary(k)
Returns the full sorted k-mer vocabulary as a list[str] with 4^k entries over {A, C, G, T}, generated via itertools.product. This is the vocabulary used as classifier features — deterministic from k alone, no training data required.
run_loo_cv(samples, motif_k=4)
Runs leave-one-out cross-validation and returns a dict:
{
"predictions": list[str], # one label per sample, in input order
"labels": list[str], # ground-truth labels in the same order
"report": str, # sklearn classification_report string
}
The two leakage guards (scaling, hyperparameter search) are applied inside the fold loop. See Classification — Implementation for an annotated walkthrough.
train_final_model(samples, motif_k=4)
Trains a final L2 logistic regression on all samples (not LOO-CV). Returns (model, scaler, flat_keys) — the three components needed to build a ModelBundle for deployment.
predict_sample(sample_features, model, scaler, flat_keys, motif_k=4)
Classifies a single Features dict using a pre-trained model bundle. Returns (label, probability) where probability is the confidence for the predicted class.
build_feature_matrix(samples, motifs)
Builds (X, y, feature_names) from a sample list and an explicit motif list. Flat numeric keys in Features (all fields except end_motifs, frag_len_hist, start_pos_hist, end_pos_hist) are appended automatically. Call this directly if you want to inspect the feature matrix outside the LOO loop.
Plots — cfclassify.plots
All plot functions return a matplotlib.figure.Figure. Pass out_path to save to disk, or omit it to get the figure object for further manipulation.
| Function | Output | Key design notes |
|---|---|---|
plot_end_motifs(samples, top_n, out_path) |
Grouped bar chart | Top-N motifs ranked by mean frequency across all samples; SEM error bars |
plot_frag_lengths(samples, out_path) |
Line plot | Normalised per-sample before averaging; reference lines at 147 bp and 200 bp |
plot_methylation(samples, out_path) |
Bar + scatter | Bars show group means; scatter overlays individual sample points with jitter |
plot_position_dist(samples, feature_key, title, out_path) |
Line plot | feature_key is "start_pos_hist" or "end_pos_hist"; x-axis in Mbp |
Model I/O — cfclassify.model
save_model(bundle, path) / load_model(path)
Serialise/deserialise a ModelBundle using joblib. The bundle is self-describing: it carries the k-mer length, contig, and feature key ordering needed to classify new samples without any external configuration.
save_feature_cache(samples, model_path) / load_feature_cache(model_path)
Write/read the feature cache JSON at <model_path>.features.json. The cache stores extracted feature dicts alongside labels, so the update subcommand can retrain on all historical samples without re-processing BAMs.
CLI entry point — cfclassify.__main__
Three subcommands: train, predict, update. See Usage → Running for full flag reference.