API Reference

mepylome.dtypes.idat

Contains a IDAT file parser.

class mepylome.dtypes.idat.IdatParser(file, *, intensity_only=False, array_type_only=False)[source]

Reads and parses an IDAT file.

Stores all extracted values from the IDAT file as attributes.

Parameters:

file (str or file-like object) – Path to the IDAT file or a file-like object. Can also be a gzipped IDAT file.
intensity_only (bool, optional) – Whether to read only intensity values, which makes parsing faster. Defaults to False.

Examples

>>> filepath = "/path/to/idat/file_Grn.idat"
>>> idat_data = IdatParser(filepath)
>>> ids = idat_data.illumina_ids
>>> print(idat_data)

mepylome.dtypes.manifests

Module for handling Illumina array manifest files.

This module contains a single class Manifest for reading and processing Illumina array manifest files, which contain information about probes and their characteristics.

class mepylome.dtypes.manifests.Manifest(array_type=None, raw_path=None, proc_path=None, download_proc=True)[source]

Provides an object interface to an Illumina array manifest file.

This class provides functionality for reading and processing Illumina array manifest files. A manifest can be loaded automatically based on the array type or provided as a raw manifest file. On first use, the necessary data is automatically downloaded if needed, transformed, and saved locally, which might take some time. The processed manifest is then saved locally and loaded in its processed form on subsequent uses. During a running session, all loaded manifests are cached in memory.

Parameters:

array_type (str or ArrayType) – The type of array to process. Use either ArrayType (ArrayType.ILLUMINA_450K, ArrayType.ILLUMINA_EPIC, ArrayType.ILLUMINA_EPIC_V2) or corresponding string (‘450k’, ‘epic’, ‘epicv2’, ‘msa48’)
proc_path (str or Path) – The path to the local processed manifest file (default: None).
raw_path (str or Path, optional) – Path to the raw manifest file. Default is None.
download_proc (bool, optional) – If True and there is no locally saved processed manifest file, attempts to download the processed manifest file instead of the raw one. Defaults to True.

Examples

>>> # To initialize a manifest object for Illumina 450k array:
>>> manifest = Manifest("450k")
>>> manifest

>>> # To initialize a manifest object for Illumina EPIC array
>>> manifest = Manifest(ArrayType.ILLUMINA_EPIC)
>>> type_1 = manifest.probe_info(ProbeType.ONE)

>>> # To load all manifests when first used:
>>> Manifest.load()

control_address(control_type=None)[source]

Returns address IDs of all control probes of the specified type.

Return type:: Series

property control_data_frame: DataFrame: Pandas data frame of all manifest control probes.

property data_frame: DataFrame: Pandas data frame of all manifest probes.

static load(array_types=None)[source]

Loads specified manifests into memory.

Parameters:: array_types (list or ArrayType, optional) – List of array types or a single array type to load. Defaults to all available types.
Return type:: None

Examples

>>> # Load all manifests:
>>> Manifest.load()

>>> # Load specific manifests:
>>> Manifest.load(
>>>     [ArrayType.ILLUMINA_450K, ArrayType.ILLUMINA_EPIC]
>>> )
>>> Manifest.load("epicv2")

property methylation_probes: ndarray: All type I and II probes.

probe_info(probe_type, channel=None)[source]

Retrieves information about probes of a specified type and channel.

Parameters:

probe_type (ProbeType) – The type of probe (I, II, SnpI, SnpII, Control).
channel (Channel, optional) – The color channel (RED or GRN). Defaults to None.

Returns:

DataFrame containing information about the specified: probes.

Return type:

DataFrame

Raises:

ValueError – If probe_type is not a valid ProbeType or if channel is
not a valid Channel. –

property snp_data_frame: DataFrame: SNP probes from the manifest data frame.

mepylome.dtypes.beads

Contains classes and function for processing Illumina methylation arrays.

It includes methods for extracting methylation information, various preprocessing techniques, normalization, and data handling.

class mepylome.dtypes.beads.MethylData(data=None, file=None, prep='illumina', seed=None)[source]

Represents methylated and unmethylated intensity data from RawData.

This class provides methods for preprocessing Illumina methylation data and computing beta values from methylated and unmethylated intensities.

Parameters:

data (RawData) – RawData object containing raw intensity data.
file (str) – Path to file or dir or list of paths containing raw intensity data.
prep (str) – Preprocessing method. Options: “illumina”, “swan”, “noob”.
seed (int, optional) – Seed value used for random number generation in the SWAN preprocessing method. Default is None.

Note

If ‘data’ is not provided, it will attempt to create a RawData object using the specified ‘file’.

Raises:

ValueError – If neither ‘data’ nor ‘file’ is provided.
ValueError – If ‘data’ is provided but is not of type ‘RawData’.

Examples

>>> methyl_data = MethylData(raw_data)
>>> methyl_data = MethylData(file=file_path, prep="swan")

property betas: DataFrame: Returns beta values.

betas_at(cpgs=None, fill=0.5)[source]

Calculates beta values for specified CpG sites.

Parameters:

cpgs (array-like) – Array of CpG IDs.
fill (float) – Value to fill for CpGs not found in the used manifest or equal to NaN.

Returns:

DataFrame containing beta values for specified: CpGs.

Return type:

pandas.DataFrame

Note

If ‘cpgs’ is None, all CpGs from the used manifest are considered.

detection_p()[source]

Detection p-values for probe signal vs background noise.

Computes whether each probe signal (M+U) is distinguishable from a Gaussian background estimated from negative control probes. The p-value is the right-tail probability under this model.

Low values indicate reliable detection above background. Samples with many high p-values (failed probes) may be low quality.

Returns:

Detection p-values (n_probes × n_samples), indexed by: IlmnID.

Return type:

pd.DataFrame

Notes

Background is derived from negative control probes.
Uses robust statistics (median and MAD-like estimator).
Variance is stabilized to avoid degeneracy.

Reference:: Implements the Illumina detectionP method used in minfi.

property grn: DataFrame

Normalized green intensity by probe ID.

Type:: DataFrame

property intensity: DataFrame: Calculates DataFrame intensity values from methylation data.

property intensity_array: ndarray: Calculates numpy intensity values from methylation data.

load_log_intensity()[source]

Calculates log_intensity so this can be saved to disk.

Return type:: None

property log_intensity_fit: ndarray

log2 intensity with appended intercept column for linear regression.

Shape: (n_probes, n_samples + 1) - intercept in last column.

property methylated: DataFrame

Methylated intensity values indexed by IlmnID.

Type:: DataFrame

property mvalues: DataFrame: Returns M-values.

mvalues_at(cpgs=None, fill=0.0)[source]

Calculates m-values for specified CpG sites.

Parameters:

cpgs (array-like) – Array of CpG IDs.
fill (float) – Value to fill for CpGs not found in the used manifest or equal to NaN.

Returns:

DataFrame containing m-values for specified: CpGs.

Return type:

pandas.DataFrame

Note

If ‘cpgs’ is None, all CpGs from the used manifest are considered.

plot_betas_density(bins=256)[source]

Plot beta-value density distributions.

Return type:: None

plot_intens_vs_betas(sample_id=None, n_cols=3)[source]

Plot total signal intensity vs beta value.

Parameters:

sample_id (str, list, optional) – Sample(s) to plot. Defaults to all samples (one subplot per sample).
n_cols (int) – Number of subplot columns when plotting multiple samples. Defaults to 3.

Return type:

None

plot_red_green_qq(n_points=512)[source]

Red vs Green channel intensity quantile-quantile plot.

Compares sorted intensity distributions of Red and Green channels using quantile downsampling.

Parameters:: n_points (int) – Number of quantile-subsampled points per sample.
Return type:: None

poobah()[source]

Compute pOOBAH detection p-values for all probes.

Compute pOOBAH detection p-values using out-of-band (OOB) hybridization signals.

pOOBAH estimates whether probe intensities are distinguishable from empirical background distributions derived from OOB measurements.

The method uses empirical cumulative distribution functions (ECDFs) computed from OOB probe intensities in each channel.

Low p-values indicate reliable detection above background. A probe is considered to have failed detection (unreliable) when its pOOBAH p-value is greater than a threshold (usually 0.05).

Returns:: Detection p-values (0–1), shape (n_probes × n_samples), indexed by IlmnID. NaN indicates missing probes.
Return type:: pandas.DataFrame

Note

In SeSAMe, some probes are filtered using backgroundMask. This step is not implemented here, which may lead to small differences in the resulting p-values compared to SeSAMe.

Reference:: SeSAMe: reducing artifactual detection of DNA methylation by Infinium BeadChips in genomic deletions. Wanding Zhou, Timothy J. Triche Jr., Peter W. Laird, Hui Shen. Nucleic Acids Research, 46(e123), 2018. https://doi.org/10.1093/nar/gky691

Examples

>>> methyl = MethylData(file=idat_basepath)
>>> pvals = methyl.poobah()
>>> mask = pvals >= 0.05

pred_sex()[source]

Predict sex from X/Y chromosome methylation intensities.

Uses median log2 intensity difference between Y and X probes. Threshold-based classifier trained/validated on ~3k tumor samples, achieving ~95% accuracy. Algorithm needs to be refined in future versions.

Return type:: ndarray

preprocess_illumina()[source]

Performs preprocessing usings Illuminas method.

This function implements preprocessing for Illumina methylation microarrays as used in Genome Studio, the standard software provided by Illumina.

Return type:: None

Details:: This implementation is adapted from ‘minfi’.

preprocess_noob(offset=15, dye_method='single')[source]

The Noob preprocessing method.

Description:: Noob (normal-exponential out-of-band) is a background correction method with dye-bias normalization.

Parameters:

offset (float) – An offset for the normexp background correction.
dye_method (str) – How should dye bias correction be done: “single” for single sample approach, or “reference” for a reference array.

Return type:

None

References

TJ Triche, DJ Weisenberger, D Van Den Berg, PW Laird and KD Siegmund _Low-level processing of Illumina Infinium DNA Methylation BeadArrays. Nucleic Acids Res (2013) 41, e90. doi:10.1093/nar/gkt090.

preprocess_raw()[source]

Calculates methylated/unmethylated arrays without preprocessing.

Converts the Red/Green channel for an Illumina methylation array into methylation signal, without using any normalization.

Return type:: None

preprocess_swan()[source]

Subset-quantile Within Array Normalization (SWAN).

Return type:: None

Details:

The SWAN method has two parts. First, an average quantile distribution is created using a subset of probes defined to be biologically similar based on the number of CpGs underlying the probe body. This is achieved by randomly selecting N Infinium I and II probes that have 1, 2 and 3 underlying CpGs, where N is the minimum number of probes in the 6 sets of Infinium I and II probes with 1, 2 or 3 probe body CpGs. This results in a pool of 3N Infinium I and 3N Infinium II probes. The subset for each probe type is then sorted by increasing intensity. The value of each of the 3N pairs of observations is subsequently assigned to be the mean intensity of the two probe types for that row or “quantile”. This is the standard quantile procedure. The intensities of the remaining probes are then separately adjusted for each probe type using linear interpolation between the subset probes.

Implementation adapted from ‘minfi’

Note

SWAN uses a random subset of probes for between array normalization. To achieve reproducible results, set the seed.

References

J Maksimovic, L Gordon and A Oshlack (2012). SWAN: Subset quantile Within-Array Normalization for Illumina Infinium HumanMethylation450 BeadChips. Genome Biology 13, R44.

quality_metrics()[source]

Compute per-sample median methylated and unmethylated intensities.

This function reproduces the QC summary used in minfi::getQC, returning log2-transformed median intensities per sample.

These values are commonly used for sample quality assessment. Samples with unusually low median intensities may indicate poor DNA quality or assay failure.

Return type:: DataFrame

property red: DataFrame

Normalized red intensity by probe ID.

Type:: DataFrame

property unmethylated: DataFrame

Unmethylated intensity values indexed by IlmnID.

Type:: DataFrame

class mepylome.dtypes.beads.RawData(basenames, *, manifest=None)[source]

Represents raw intensity data extracted from IDAT files.

This class initializes with a list of basepaths to IDAT files and parses them to extract raw intensity data from the green and red channels.

Parameters:

basenames (list) – List of basepaths to IDAT files.
manifest (Manifest, optional) – The manifest associated with the array. If not provided, it will be determined from the probe count.

array_type

Type of Illumina array.

Type:: str

sample_ids

List of sample IDs corresponding to the IDAT files.

Type:: list

illumina_ids

Array of probe IDs.

Type:: array

_grn

Array of raw intensity values from the green channel.

Type:: array

_red

Array of raw intensity values from the red channel.

Type:: array

Example

>>> idat_basepath0 = directory_path / "200925700125_R07C01"
>>> idat_basepath1 = directory_path / "200925700133_R02C01_Grn.idat"
>>> raw_data = RawData(idat_basepath0)
>>> raw_data = RawData([idat_basepath0, idat_basepath1])

property grn: DataFrame

Green channel raw intensity indexed by probe IDs.

Type:: DataFrame

property red: DataFrame

Red channel raw intensity indexed by probe IDs.

Type:: DataFrame

class mepylome.dtypes.beads.ReferenceMethylData(file, prep='illumina', save_to_disk=False)[source]

Stores and manages reference cases for different array types.

This class categorizes and processes reference IDAT files to create MethylData objects for different array types. It is intended for CNV neutral reference cases used in CNV calculation.

Parameters:

file (list) – List of file paths to IDAT files or directory containing IDAT files.
prep (str) – Preprocessing method. Options: “illumina”, “swan”, “noob”.

_methyl_data

Internal dictionary to cache MethylData objects for each array type.

Type:: dict

Raises:: ValueError – If no reference files are found for the specified array type.

Examples

>>> # 'directory' contains 450k, EPIC and EPICv2 idat files
>>> reference = ReferenceMethylData(file=directory, prep="illumina")
>>> sample_450k = MethylData(file=idat_file_450k)
>>> sample_epic = MethylData(file=idat_file_epic)
>>> sample_epicv2 = MethylData(file=idat_file_epicv2)
>>> # reference can be used for all types
>>> cnv_450k = CNV(sample_450k, reference)
>>> cnv_epic = CNV(sample_epic, reference)
>>> cnv_epicv2 = CNV(sample_epicv2, reference)

mepylome.dtypes.beads.idat_basepaths(files, only_valid=False)[source]

Returns unique basepaths from IDAT files or directory.

This function processes a list of IDAT files or a directory containing IDAT files and returns their basepaths by removing the file endings. The function ensures that there are no duplicate basepaths in the returned list and maintains the order of the files as they appear in the input.

Parameters:

files (path or list) – A file or directory path or a list of file paths.
only_valid (bool) – If True, only returns basepaths that point to valid IDAT file pairs. Defaults is ‘False’.

Returns:

A list of unique basepaths corresponding to the provided IDAT: files. If a directory is provided, all IDAT files are recursively considered.

Return type:

list

Example

>>> idat_basepaths("/path/to/dir")
[PosixPath('/path/to/dir/file1'), PosixPath('/path/to/dir/file2')]

>>> idat_basepaths(["/path1/file1_Grn.idat", "/path2/file2_Red.idat"])
[PosixPath('/path1/file1'), PosixPath('/path2/file2')]

>>> idat_basepaths("/path/to/idat/file_Grn.idat.gz")
[PosixPath('/path/to/idat/file')]

mepylome.dtypes.beads.idat_paths_from_basenames(basenames)[source]

Returns paths to green and red IDAT files.

Parameters:: basenames (list) – List of basepaths for IDAT files.
Returns:: Paths to green and red IDAT files.
Return type:: tuple
Raises:: FileNotFoundError – If any IDAT file is not found.

mepylome.dtypes.beads.is_valid_idat_basepath(basepath)[source]

Checks if the given basepath(s) point to valid IDAT files.

Return type:: bool

mepylome.dtypes.cnv

Provides CNV analysis functionality including segmentation and plotting.

This module provides classes and functions for copy number variation (CNV) analysis.

class mepylome.dtypes.cnv.Annotation(manifest=None, array_type=None, gap=None, detail=None, bin_size=50000, min_probes_per_bin=15)[source]

Genomic annotations for CNV such as as binning and gene locations.

Parameters:

manifest (Manifest, optional) – The manifest containing annotation details. Can be determined from array_type.
array_type (str, optional) – The type of array used for annotation. Can be determined from manifest.
gap (pyranges.PyRanges) – The genomic gaps. If unset default values will be used.
detail (pyranges.PyRanges, optional) – Detailed annotation (usually genes).
bin_size (int, optional) – The base-pair size of annotation bins. Defaults to 50000.
min_probes_per_bin (int, optional) – The minimum number of probes per bin. Defaults to 15.

manifest

The manifest to use.

Type:: Manifest

array_type

The array type of the manifest.

Type:: str

probes

The Illumina ID’s of the manifest after adjusting the manifest to relevant genomic ranges.

Type:: list

gap

The genomic gaps except for the CNV analysis.

Type:: pyranges.PyRanges

detail

Detailed annotation information (usually genes).

Type:: pyranges.PyRanges

bin_size

The base-pair size of the bins.

Type:: int

min_probes_per_bin

The minimum number of probes per bin.

Type:: int

chromsizes

Dictionary containing chromosome sizes.

Type:: dict

static default_gaps()[source]

Default genomic gaps.

Return type:: PyRanges

Details:: The default value of conumee2.

static default_genes()[source]

Default PyRanges object including gene names with coordinates.

Return type:: PyRanges

Details:: Data downloaded from: https://grch37.ensembl.org/biomart/martview

static load(array_types=None)[source]

Loads specified annotation into memory.

Parameters:: array_types (list or ArrayType, optional) – List of array types or a single array type to load. Defaults to all available types.
Return type:: None

Examples

>>> # Load all annotations:
>>> Annotation.load()

>>> # Load specific annotation:
>>> Annotation.load(
>>>     [ArrayType.ILLUMINA_450K, ArrayType.ILLUMINA_EPIC]
>>> )
>>> Annotation.load("epicv2")

make_bins()[source]

Creates equidistant bins and then removes genomic gaps.

Return type:: PyRanges

merge_bins(bins)[source]

Merges adjacent bins until all contain a minimum of probes.

Return type:: PyRanges

static merge_bins_in_chromosome(bin_df, min_probes_per_bin)[source]

Merges adjacent bins until all contain a minimum of probes.

Parameters:

bin_df (DataFrame) – DataFrame containing bin information for a single chromosome.
min_probes_per_bin (int) – Minimum number of probes per bin required for merging.

Returns:

Merged bins in the chromosome.

Return type:

DataFrame

class mepylome.dtypes.cnv.CNV(sample, reference, annotation=None)[source]

Class for Copy Number Variation (CNV) analysis.

sample

MethylData object representing the sample.

Type:: MethylData

reference

MethylData object representing the CNV- neutral references.

Type:: MethylData

annotation

Annotation object containing genomic annotation information.

Type:: Annotation

bins

PyRanges object representing genomic bins.

Type:: PyRanges

probes

Index of probe IDs.

Type:: Index

coef: Coefficient of linear regression.

_ratio: Difference between observed sample intensity and expected intensity calculated by linear regression from references.

ratio: The values from _ratio as DataFrame with Illumina ID’s as indices.

noise: Noise level. A quality measure for the sample bead.

detail: Detailed information (usually Genes).

segments: Segments calculated by circular binary segmentation.

Parameters:

sample (MethylData) – MethylData object representing the sample.
reference (MethylData or ReferenceMethylData) – MethylData object representing the reference, or ReferenceMethylData object for multiple references.
annotation (Annotation, optional) – Annotation object containing genomic annotation information. Defaults to annotation associated with the sample array type.

Examples

>>> sample = MethylData(file="path/to/idat/file")
>>> reference = MethylData(file="path/to/idat/reference/dir")
>>> cnv = CNV(sample, reference)
>>> cnv.set_bins()
>>> cnv.set_detail()
>>> cnv.set_segments()
>>> cnv.plot()

Raises:: ValueError – If sample does not contain exactly 1 probe, or if reference is not of type MethylData or ReferenceMethylData.

Reference:: Daenekas, B., Pérez, E., Boniolo, F., Stefan, S., Benfatto, S., Sill, M., Sturm, D., Jones, D. T. W., Capper, D., Zapatka, M., & Hovestadt, V. (2024). Conumee 2.0: enhanced copy-number variation analysis from DNA methylation arrays for humans and mice. In J. Kelso (Ed.), Bioinformatics (Vol. 40, Issue 2). Oxford University Press (OUP). https://doi.org/10.1093/bioinformatics/btae029

fit()[source]

Fits linear regression model to calculate CNV at every CpG site.

This method fits a linear regression model to the intensity data of the sample and reference and calculates the CNV at every CpG site.

Return type:: None

plot()[source]

Generates and displays a plot of the CNV data.

Return type:: None

classmethod set_all(sample, reference, annotation=None, *, do_seg=True)[source]

Create a CNV object and perform CNV analysis.

Parameters:

sample (MethylData) – MethylData object representing the sample.
reference (MethylData or ReferenceMethylData) – MethylData object representing the reference, or ReferenceMethylData object for multiple references.
annotation (Annotation, optional) – Annotation object containing genomic annotation information. Defaults to annotation associated with the sample array type.
do_seg (bool, optional) – Indicates whether to perform segmentation, which can be computationally intensive. Defaults to True.

Returns:

CNV object with fitted data and optionally segmented.

Return type:

CNV

Examples

>>> cnv = CNV.set_all(sample, reference, do_seg=do_seg)
>>> # Note: This command is equivalent to:
>>> cnv = CNV(sample, reference)
>>> cnv.set_bins()
>>> cnv.set_detail()
>>> if do_seg:
>>>     cnv.set_segments()

set_bins()[source]

Calculates CNV within each bin based on the results of ‘fit’.

This method calculates copy number variation (CNV) within each bin by taking the median of the ratios obtained from the linear regression model fit in the ‘fit’ method.

Return type:: None

set_detail()[source]

Calculates CNV for the detail object based on the results of ‘fit’.

This method calculates copy number variation (CNV) for the detail object (usually genes) by aggregating the ratios obtained from the linear regression model fit in the ‘fit’ method for each genomic region specified in the detail object. The result includes the median ratio, variance, and count of probes within each region.

Return type:: None

set_segments()[source]

Sets CNV segments based on circular binary segmentation.

This method applies the circular binary segmentation (CBS) algorithm to identify copy number variation (CNV) segments in the dataset. It calculates the CNV segments for each chromosome and stores them in the ‘segments’ attribute of the object.

Return type:: None

write(path, data='all')[source]

Writes CNV data to disk as a zip file.

This method writes the CNV data to disk as a zip file containing CSV files. It allows specifying which data to include in the zip file, such as bins, detail, segments, and metadata.

Parameters:

path (str) – The path to save the zip file.
data (str or list of str, optional) – Specifies which data to include in the zip file. Valid options are “all”, “bins”, “detail”, “segments”, and “metadata”. Defaults to “all”.

Raises:

ValueError – If an invalid data option is specified.

Return type:

None

mepylome.analysis.core

Methylation analysis tools including a Dash-based browser application.

This module provides a comprehensive set of tools for conducting methylation analysis. The core functionality is encapsulated in the MethylAnalysis class, which manages the methylation analysis process and executes an interactive web application for the exploration of methylation data.

class mepylome.analysis.core.MethylAnalysis(analysis_dir=None, *, annotation=None, reference_dir=None, output_dir=None, test_dir=None, prep='illumina', cpgs='auto', cpg_blacklist=None, n_cpgs=25000, classifiers=None, cv_default=5, n_jobs_clf=1, n_jobs_cnv=None, precalculate_cnv=False, load_full_betas=True, feature_matrix=None, overlap=False, analysis_ids=None, test_ids=None, cpg_selection='top', do_seg=False, host='localhost', port=8050, debug=False, umap_parms=None, use_gpu=False, verbose=1, balancing_feature=None)[source]

Main class for methylation analysis including a GUI application.

Main class for methylation analysis, providing methods for setting up analysis parameters, reading data, and running a Dash-based web application for data visualization.

Parameters:

analysis_dir (str or Path) – Directory containing IDAT files for analysis.
annotation (str or Path) – Path to an annotation spreadsheet used to map sample files located in both analysis_dir and test_dir. One of the columns must contain the ID corresponding to the IDAT files (such as SentrixID or ID from files downloaded from GEO). If not provided, the system will attempt to identify the correct column automatically. If the annotation file is missing, it will search for a spreadsheet within the analysis_dir if available. (default: None)
reference_dir (str or Path) – Directory containing CNV neutral reference IDAT files. Must be provided if you wanna generate CNV plots. (default: None)
output_dir (str or Path) – Directory where output files will be saved (default: “/tmp/mepylome/analysis”).
test_dir (Path or None) – Directory for test files, including new cases for analysis or validation. Files uploaded via the GUI will be placed here. If set to None, the application will automatically use a temporary directory. (default: None)
prep (str) – Prepreparation method used for methylation microarrays: ‘illumina’, ‘swan’, or ‘noob (default: ‘illumina’).
cpgs (str, np.ndarray, list, set, or Path, optional) –
Specifies the CpG sites to analyze. Possible values:
1. A list, set, or NumPy array of official Illumina CpG site names.
2. A path to a CSV file containing the CpG sites.
3. A string specifying a predefined array type:
  - ’450k’ : The CpG sites from the Illumina 450k array.
  - ’epic’ : The CpG sites from the Illumina EPIC array.
  - ’epicv2’ : The CpG sites from the Illumina EPIC v2 array.
  - ’msa48’ : The CpG sites from the Illumina MSA array.
4. A ‘+’-joined string of the options above combining multiple array types, returning the intersection of their CpG sites. For .. rubric:: Example
- ’450k+epic’ : CpG sites both in the 450k and EPIC arrays.
- ’epic+epicv2’: CpG sites both in the EPIC and EPICv2 arrays.
5. ‘auto’ (default): Automatically detects all array types from IDAT files in analysis_dir and returns the intersection of CpG sites. This process may take longer as all files need to be read and, if necessary, decompressed.
cpg_blacklist (set or list, optional) – A list or set of CpG sites to exclude. Default is None.
n_cpgs (int) – Number of CpG sites to select for UMAP (default: 25000).
classifiers (object or list of objects, optional) –
Classifier model(s) (default: None). Each classifier can be provided as:
- A dictionary containing:
  - ’model’ (object): The classifier model object as defined below (required).
  - ’name’ (str, optional): A name for the classifier (default: “Custom_Classifier_<index>”).
  - ’cv’ (int or cross-validation generator, optional): Cross-validation strategy (default: self.cv_default).
- A classifier model object (e.g., RandomForestClassifier(), vtl-kbest-rf), in which case the ‘name’ and ‘cv’ are automatically generated (see above). A classifier model can be one of:
  - A scikit-learn classifier object (trained or untrained).
  - A string in the format “scaler-selector-classifier”. See the documentation of fit_and_evaluate_clf in mepylome.analysis.classifiers for all valid values.
  - A custom class, that inherits from TrainedClassifier.
cv_default (int or cross-validation generator, optional) – Determines the default cross-validation splitting strategy (default: 5).
n_jobs_clf (int) – Number of parallel processes to run for classifying (default: 1). Choose -1 for using all available cores.
n_jobs_cnv (int, optional) – Number of parallel processes to use for CNV precalculation. If None, a reasonable number of cores will be automatically chosen based on the system and workload. (default: None)
precalculate_cnv (bool) – If set to True, CNV data will be precalculated before the main analysis. This process takes approximately 2-5 seconds per case initially, but it will improve performance during runtime by reducing computation time. (default: False)
load_full_betas (bool) –
If True, loads beta values for all CpG sites into memory (when needed), enabling fast random access to the full methylation matrix. This can significantly increase memory usage.

If False, only the specified n_cpgs CpG sites are loaded on demand. For supervised classifier training, the same reduced matrix (betas_sel) used for UMAP visualization is used. This greatly reduces memory consumption and is typically sufficient, though it may be slightly slower (default: True).
feature_matrix (pandas.DataFrame or numpy.ndarray, optional) – A user-provided feature matrix to be used for UMAP dimensionality reduction. If provided, this matrix will be used instead of betas_sel. If not provided (default is None), the betas_sel containing methylation beta values will be used for UMAP. (default: None)
overlap (bool) – Flag to analyze only samples that are both in the analysis directory and within the annotation file (default: False).
analysis_ids (list, optional) – A list of sample IDs. If provided, the analysis will be restricted to these samples only. If None, the analysis will include all available samples. (default: None)
test_ids (list, optional) – A list of sample IDs within test_dir. - If provided, only these samples will be used. - If None, all available IDAT files in test_dir will be used. (default: None)
cpg_selection (str) –
Method to select CpG sites for UMAP (‘top’, ‘random’, or ‘balanced’) (default: ‘top’).
- ’top’: Selects CpG sites with the highest variance.
- ’random’: Selects CpG sites randomly.
- ’balanced’: Selects the most varying CpG sites while ensuring a balanced distribution across groups based on balancing_feature. This method takes an equal number of sample files from `self.analysis_dir` for each group defined by balancing_feature. It is especially useful when the dataset is imbalanced, where some groups have significantly more samples than others.
balancing_feature (str) – Column in self.annotation used for balancing when cpg_selection=’balanced’. The balancing feature determines the groups/categories used to create a stratified selection of CpG sites.
do_seg (bool) – If set, enables segmentation analysis on CNV data and adds horizontal segmentation lines to the CNV plot. This will take an additional 2-5 seconds per sample. (default: False)
host (str) – Host address for the Dash application (default: ‘localhost’).
port (int) – Port number for the Dash application (default: 8050).
debug (bool) – Flag to enable debug mode for the Dash application (default: False).
umap_parms (dict) – Parameters for UMAP algorithm (default: {‘metric’: ‘manhattan’, ‘min_dist’: 0.1, ‘n_neighbors’: 15, ‘verbose’: True}).
use_gpu (bool) – Whether to use GPU acceleration for UMAP via cuML and CuPy (default: False). Set to True to enable GPU-backed UMAP computations, which can significantly speed up large datasets. This requires the cuml and cupy libraries to be installed, along with appropriate NVIDIA drivers and a working CUDA setup.
verbose (int) – Sets the (global) logging verbosity level: - 0: Errors and warnings only. - 1: Info, warnings, and errors (default). - 2: Debug, info, warnings, and errors.

Note

Many parameters can be modified within the GUI application after initialization, but not all.

analysis_dir

Path to the directory containing IDAT files for analysis.

Type:: Path

annotation

Path to an annotation spreadsheet used to map sample files located in both analysis_dir and test_dir.

Type:: str or Path

overlap

Flag to analyze only samples that are both in the analysis directory and within the annotation file (default: False).

Type:: bool

analysis_ids

A list of sample IDs. The analysis will be restricted to these samples only. If None, the analysis will include all available samples.

Type:: list

test_ids

A list of sample IDs in ‘test_dir’ that will be used.

Type:: list

n_cpgs

Number of CpG sites to select for UMAP (default: 25000).

Type:: int

n_jobs_clf

Number of parallel processes to run for classifying. If equal to -1 all available cores will be used.

Type:: int

n_jobs_cnv

Number of parallel processes to use for CNV precalculation. If None, a reasonable number of cores will be automatically chosen based on the system and workload.

Type:: int

reference_dir

Directory containing CNV neutral reference IDAT files. Must be provided if you wanna generate CNV plots.

Type:: str or Path

output_dir

Path to the directory where output files will be saved (default: “/tmp/mepylome/analysis”).

Type:: Path

test_dir

Directory for test files, including new cases for analysis or validation. Files uploaded via the GUI will be placed here. If set to None, the application will automatically use a temporary directory.

Type:: Path or None

prep

Prepreparation method used for methylation microarrays: ‘illumina’, ‘swan’, or ‘noob (default: ‘illumina’).

Type:: str

cpg_selection

Method to select CpG sites for UMAP (‘top’, ‘random’, or ‘balanced’) (default: ‘top’).

‘top’: Selects CpG sites with the highest variance.
‘random’: Selects CpG sites randomly.
‘balanced’: Selects the most varying CpG sites while ensuring a balanced distribution across groups based on balancing_feature. This method takes an equal number of sample files from `self.analysis_dir` for each group defined by balancing_feature. It is especially useful when the dataset is imbalanced, where some groups have significantly more samples than others.

Type:: str

balancing_feature

Column in annotation used for balancing when cpg_selection=’balanced’. The balancing feature determines the groups/categories used to create a stratified selection of CpG sites.

Type:: str

host

Host address for the Dash application (default: ‘localhost’).

Type:: str

port

Port number for the Dash application (default: 8050).

Type:: int

debug

Flag to enable debug mode for the Dash application (default: False).

Type:: bool

cnv_dir

Directory for CNV (Copy Number Variation) data, initially set to None.

Type:: Path

umap_dir

Directory for UMAP (Uniform Manifold Approximation and Projection) data, initially set to None.

Type:: Path

umap_cpgs

CpG sites for UMAP analysis, initially set to None.

Type:: numpy.array

precalculate_cnv

Flag to precalculate CNV information by invoking ‘precompute_cnvs’ (default: False).

Type:: bool

load_full_betas

If True, loads beta values for all CpG sites into memory (when needed), enabling fast random access to the full methylation matrix. This can significantly increase memory usage.

If False, only the specified n_cpgs CpG sites are loaded on demand. For supervised classifier training, the same reduced matrix (betas_sel) used for UMAP visualization is used. This greatly reduces memory consumption and is typically sufficient, though it may be slightly slower (default: True).

Type:: bool

betas_sel

DataFrame containing a selected subset of beta values used for dimensionality reduction. Initially set to None.

Type:: pandas.DataFrame

betas_all

Dataframe containing beta values for all CpG sites, initially set to None.

Type:: pandas.DataFrame

feature_matrix

A user-provided feature matrix to be used for UMAP dimensionality reduction. If provided, this matrix will be used instead of betas_sel for UMAP plots and instead of betas_all for classifying (default: None).

Type:: pandas.DataFrame or numpy.ndarray, optional

betas_dir

Path to the betas directory, initially set to None.

Type:: Path

umap_plot

Plot for UMAP, initially set to EMPTY_FIGURE.

Type:: plotly.Figure

umap_plot_path

Path to the CSV file containing the UMAP plot data, initially set to None.

Type:: Path

umap_df

Dataframe containing UMAP data, initially set to empty data frame.

Type:: pandas.DataFrame

umap_parms

Parameters for UMAP algorithm (default: {‘metric’: ‘manhattan’, ‘min_dist’: 0.1, ‘n_neighbors’: 15, ‘verbose’: True}).

Type:: dict

use_gpu

Whether to use GPU acceleration for UMAP via cuML and CuPy (default: False). Set to True to enable GPU-backed UMAP computations, which can significantly speed up large datasets. This requires the cuml and cupy libraries to be installed, along with appropriate NVIDIA drivers and a working CUDA setup.

Type:: bool

raw_umap_plot

Raw UMAP plot data, initially set to None.

Type:: plotly.Figure

cnv_plot

Plot for CNV (Copy Number Variation) visualization, initially set to EMPTY_FIGURE.

Type:: plotly.Figure

cnv_id

ID for CNV (Copy Number Variation) sample, initially set to None.

Type:: str

dropdown_id

ID for dropdown selection, initially set to None.

Type:: list

ids

List of IDs, initially empty.

Type:: list

ids_to_highlight

IDs to highlight in the plot, initially set to empty list.

Type:: list

app

Dash application object, initially set to None.

Type:: dash.dash.Dash

Raises:: ValueError – If cpg_selection is not ‘top’, ‘balanced’, or ‘random’.

Examples

>>> # Basic usage
>>> from mepylome import MethylAnalysis
>>> analysis0 = MethylAnalysis()
>>> analysis0.run_app()
>>> # Usage if directories are known in advance
>>> analysis1 = MethylAnalysis(
>>>     analysis_dir='/path/to/idat/dir',
>>>     reference_dir='/path/to/reference/idat/dir',
>>>     annotation='/path/to/annotation/spread/sheat/with/2/cols',
>>>     output_dir='/path/to/mepylome/output',
>>> )
>>> analysis1.run_app()

property classifiers: list[dict[str, Any]]

Retrieves the configuration for classifiers.

This property returns a list of dictionaries, where each dictionary includes:

‘name’ (str): A human-readable name for the classifier (e.g., ‘Random Forest’).
‘model’ (object): The classifier model instance.
‘cv’ (int or cross-validation generator): Determines the cross-validation splitting strategy.

Returns:: Classifier configurations.
Return type:: list of dict

classify(*, ids=None, values=None, clf_list)[source]

Classify samples using specified classifiers.

This method performs classification on given samples, defined either by ids or by values, using one or more supervised classifiers. The labels for classification are derived from the selected_columns. Classification can either use a provided feature_matrix (custom features), or default to CpG methylation data (betas_all). All samples in analysis_dir resp. those in analysis_ids with valid label will be used for learning.

Classifiers are applied to the data, and the method returns their predictions and performance reports.

Parameters:

ids (str, list, tuple, or None) – Sample IDs for prediction/classification. If values is provided, ids must be None.
values (pd.DataFrame, np.ndarray, or None) – Feature matrix for prediction/classification. If ids is provided, values must be None.
clf_list (object or list of objects) – A classifier model or a list of classifier models and configurations. This argument is handled the same way as self.classifiers. For full details on the format and options, refer to the docstring for self.classifiers.

Returns:

A list of ClassifierResult objects, each containing the following attributes:

prediction (pd.DataFrame): A DataFrame containing the predicted labels with their associated probabilities.

model (sklearn.base.BaseEstimator or TrainedClassifier): The trained classifier object used for prediction.

metrics (dict): A dictionary of evaluation metrics for the classifier, such as accuracy, precision, recall, etc.

reports (dict): A dictionary containing textual and HTML reports of the classifier’s performance. The keys are:

”txt”: A plain-text report (e.g., classification report).

”html”: An HTML-formatted report for richer visualization.

Return type:

list[ClassifierResult]

Outputs:

Log file: Contains training time, classifier performance metrics,: and evaluation results for each classifier.

Raises:: ValueError – If not exactly one if ids or values is set.

cn_summary(ids)[source]

Create a copy number summary plot for the given samples.

This method generates an overview of CNV gain and loss patterns across chromosomes for a list of sample IDs. It returns both the visual plot and the data used to generate it.

Parameters:

ids (list of str) – A list of sample IDs to include in the CNV
summary.

Returns:

A Plotly figure showing CNV: summary results.
df_cn_summary (pd.DataFrame): A DataFrame containing the data: behind the plot.

Return type:

plot (plotly.graph_objects.Figure)

Raises:

ValueError – If do_seg is not True. CNV summary plots require segmentation to be enabled.

compute_umap()[source]

Applies the UMAP algorithm on ‘betas_sel’.

Saves the 2D embedding in ‘umap_df’ and and on disk.

Raises:: AttributeError – If a dimension mismatch occurs, or if ‘betas_sel’ is not set.
Return type:: None

property cpgs: ndarray

Array of CpG sites to analyze, sorted in order.

When setting, the input should be the same as the cpgs argument in the constructor (__init__).

Raises:

ValueError – If the provided cpgs value is not a valid type or
format. –

get_app()[source]

Returns a Dash application object for methylation analysis.

Return type:: Dash

get_cnv(sample_id, extract=None)[source]

Retrieves the CNV information for a specified sample.

This method locates the IDAT file corresponding to the provided sample_id, processes it to generate CNV data if not already available, and reads the resulting CNV information from disk.

Parameters:

sample_id (str) – The identifier for the sample whose CNV data is to be retrieved.
extract (list) – Specifies the data to extract from the CNV analysis. Available options include: - “bins”: Raw CNV data at the bin level. - “detail”: Detailed CNV information (generally genes). - “segments”: Segmented CNV regions. - “metadata”: CNV analysis metadata.

Returns:

A tuple containing the following elements:

bins (DataFrame): Data representing CNV bins.
detail (DataFrame): Gene CNV information.
segments (DataFrame): Segmented CNV data.

If CNV data is not found or cannot be generated, returns None for each extract value.

Return type:

tuple

property idat_handler: IdatHandler

Handles the management of IDAT files and associated metadata.

Returns:: An instance of IdatHandler configured with current settings.
Return type:: IdatHandler

make_cnv_plot(sample_id, genes_sel=None)[source]

Generates a copy number variation (CNV) plot for a specific sample.

This method generates a CNV plot for the specified sample and optionally highlights specific genes within the plot.

Parameters:

sample_id (str) – ID of the sample for which CNV plot is generated.
genes_sel (list or None, optional) – List of specific genes to highlight in the plot.

Raises:

FileNotFoundError – If the specified sample ID is not found in the analysis directory or if the reference directory does not exist.

Return type:

None

make_umap()[source]

Generates the UMAP plot.

This method extracts the beta values required for UMAP computation, computes the UMAP 2D embedding, and creates and displays the UMAP plot based on the computed embedding.

Return type:: None

make_umap_plot()[source]

Generates a UMAP plot from the given 2D embedding.

Generates the UMAP plot from the data provided in ‘umap_df’. The scatter plot color is based on selected columns in ‘idat_handler.selected_columns’.

Raises:: AttributeError – If a dimension mismatch occurs, or if ‘umap_df’ is not set.
Return type:: None

mlh1_report_pages(ids)[source]

Generate MLH1 promoter methylation report HTML pages.

Parameters:: ids (list of str) – Sample IDs.
Returns:: HTML reports, one per sample.
Return type:: list of str

precompute_cnvs(ids=None)[source]

Precalculates CNVs for all samples and saves them to disk.

This method performs CNV analysis, and writes the output to the configured CNV directory. If ids is not provided, the method will compute CNVs for all samples found in the analysis_dir.

Parameters:: ids (list, optional) – A list of sample IDs to process. If None, the function will compute CNVs for all samples in the analysis_dir. Default is None.
Return type:: None

Note

Precalculating CNVs improves performance but requires additional memory space in the output directory.

read_umap_plot_from_disk()[source]

Reads UMAP plot from disk if available from previous analysis.

Return type:: None

run_app(*, open_tab=False)[source]

Runs the mepylome Dash application.

Parameters:: open_tab (bool, optional) – Whether to automatically open a new browser tab with the application URL. Defaults to False.
Return type:: None

set_betas()[source]

Sets the beta values DataFrame (‘betas_sel’) for further analysis.

This method reads the IDAT files located in ‘analysis_dir’, extracts the beta values, and saves them locally in ‘output_dir’. Depending on the configuration (‘cpg_selection’ and ‘load_full_betas’ flags), it either extracts a subset of CpGs for UMAP computation or loads all CpGs for subsequent processing into memory.

Raises:: ValueError – If no valid samples are found.
Return type:: None

visualize_gene(gene, array_type=None, ids=None, show=True)[source]

Visualizes methylation across a gene.

Plots a heatmap of beta values (samples x CpGs) for all CpG probes located within the genomic region of gene (according to the bundled gene annotation), with CpGs ordered by genomic position. A gene-body track is drawn above the heatmap, with thin connector lines linking each CpG’s true genomic position to its evenly-spaced column in the heatmap below.

Parameters:

gene (str) – Gene symbol (e.g. “EGFR”, “MLH1”, “CDKN2A”). Must match a gene name in the bundled gene annotation (case-sensitive).
array_type (str or ArrayType, optional) – Array type used to look up the gene’s CpGs. Defaults to the type detected from the selected samples (must be a single, common type).
ids (list of str, optional) – Sample IDs to include. Defaults to all samples found in analysis_dir/test_dir.
show (bool) – If True (default), displays the figure immediately and returns None. If False, returns the figure without displaying it.

Returns:

The methylation heatmap with gene track, or None if show is True.

Return type:

go.Figure or None

Raises:

ValueError – If no sample IDs are available, the gene is unknown, it has no CpG probes on the array type used, or the selected samples span multiple array types and array_type was not given explicitly.

visualize_region(chromosome, start, end, array_type=None, ids=None, show=True)[source]

Visualizes methylation across an arbitrary genomic region.

Like visualize_gene, but for any region given as chromosome/start/end instead of a gene symbol – useful for loci without a gene annotation, or custom regions of interest.

Parameters:

chromosome (str or Chromosome) – Chromosome (e.g. “chr3” or “3”).
start (int) – Region start position (genomic coordinate).
end (int) – Region end position (genomic coordinate).
array_type (str or ArrayType, optional) – Array type used to look up CpGs in the region. Defaults to the type detected from the selected samples (must be a single, common type).
ids (list of str, optional) – Sample IDs to include. Defaults to all samples found in analysis_dir/test_dir.
show (bool) – If True (default), displays the figure immediately and returns None. If False, returns the figure without displaying it.

Returns:

The methylation heatmap with region track, or None if show is True.

Return type:

go.Figure or None

Raises:

ValueError – If chromosome is invalid, no sample IDs are available, the region has no CpG probes on the array type used, or the selected samples span multiple array types and array_type was not given explicitly.

mepylome.analysis.utils

Auxiliary methods for the methylation analysis.

class mepylome.analysis.utils.IdatHandler(analysis_dir, *, annotation=None, test_dir=None, test_ids=None, overlap=False, analysis_ids=None)[source]

A class for handling IDAT files with annotation.

Includes reading annotation from various file formats and provides description lookups for methylation classes.

Parameters:

analysis_dir (str or Path) – The directory where the IDAT files are located.
annotation (str or Path, optional) – The path to the annotation file. Defaults to None.
test_dir (Path or None, optional) – Directory for test files, including new cases or validation IDAT files or other test cases. Defaults to None.
overlap (bool, optional) – If True, restricts the sample paths to only those present in both the IDAT files and the annotation file. Defaults to False.
analysis_ids (list, optional) –
A list of sample IDs within analysis_dir.
- If provided, only these samples will be used.
- If None, all available IDAT files in analysis_dir will be used.
Defaults to None.

Note: The IDs may be converted to Sentrix format during initialization if the IDs in the annotation and IDs in analysis_dir do not match directly.
test_ids (list, optional) –
A list of sample IDs within test_dir.
- If provided, only these samples will be used.
- If None, all available IDAT files in test_dir will be used. Defaults to None.
Note: The IDs may be converted to Sentrix format during initialization if the IDs in the annotation and IDs in analysis_dir do not match directly.

analysis_dir

The directory path where the IDAT files are located.

Type:: Path

test_dir

Directory for test files, including new cases or validation IDAT files or other test cases. Defaults to None.

Type:: Path or None, optional

overlap

A flag indicating whether to restrict sample paths to only those present in both the IDAT files and the annotation file.

Type:: bool

id_to_path

A dictionary where the keys are sample IDs and the values are the file paths of IDAT files (from both analysis_dir and test_dir).

Type:: dict

annotation

The path to the annotation file. Defaults to None. If not provided, the first spreadsheet file found in self.analysis_dir will be used as the annotation.

Type:: Path

annotation_df

A DataFrame containing the annotation data, if loaded.

Type:: pandas.DataFrame or None

samples_annotated

A DataFrame containing the samples as index and the annotation in the columns.

Type:: pandas.DataFrame or None

selected_columns

A list of selected columns from the annotated samples, initialized with the first column.

Type:: list

analysis_ids

A list of sample IDs from analysis_dir that are actually used after filtering and optional conversion to Sentrix IDs.

Type:: list

test_ids

A list of sample IDs from test_dir that are actually used after filtering and optional conversion to Sentrix IDs.

Type:: list

Raises:

ValueError –

If any sample in analysis_ids is not found in analysis_dir. - If any sample in in test_ids is not found in test_dir.

features(columns=None, separator='|')[source]

Combines specified columns into a single label per sample.

If columns is not provided, it defaults to the first column in samples_annotated or selected_columns if they are available. The function joins the values from the specified columns for each sample, converting them to strings and joining them with the specified separator.

Parameters:

columns (list, str, or None) – List of column names (or a single column name) to use for creating the label. If None, defaults to the first column in samples_annotated or selected_columns if not None.
separator (str) – The separator used to join values from the columns. Default is “|”.

Returns:

A Series of combined labels, indexed by sample IDs.

Return type:

pd.Series

Example

>>> idat_handler.features(columns=["GEO", "CNVs"])
sample_1    SGT_103|Balanced
sample_2    SGT_056|Balanced
sample_3    SGT_276|Balanced
dtype: object

init_parameters()[source]

Returns the initialization attributes.

Return type:: dict[str, Any]

mepylome.analysis.classifiers

Contains methods for supervised learning.

Non supervised classifiers (random forest, k-nearest neighbors, neural networks) for predicting the methylation class.

class mepylome.analysis.classifiers.fit_and_evaluate_clf(X, y, X_test, id_test, save_path, clf, cv, n_jobs=1)[source]

Predicts the methylation class by supervised learning classifier.

Uses supervised machine learning classifiers (Random Forest, K-Nearest Neighbors, Neural Networks, SVM, …) to predict the methylation class of the sample. Output will be written to disk.

Parameters:

X (pd.DataFrame) – Feature matrix (rows as samples, columns as features).
y (array-like) – Class labels.
X_test (array-like) – Value of the sample to be evaluated.
id_test (str) – Unique identifiers for the test samples to be evaluated.
save_path (str or Path) – Path where the classifiers and results will be saved/cached.
clf (list) –
Classifier to use. Can be:
- A scikit-learn classifier object or pipeline (trained or untrained).
- A string in the format “scaler-selector-classifier”. Possible values are:
- A pipeline string composed of arbitrary components joined by dashes (“-“). Each component can be specified using either an abbreviation or the full class name (e.g., “std” or “StandardScaler”).:
  scaler:
  
  ”std”: Standard scaling (StandardScaler).
  
  ”minmax”: Min-max scaling (MinMaxScaler).
  
  ”robust”: Robust scaling (RobustScaler).
  
  ”power”: Power transformation (PowerTransformer).
  
  ”quantile”: Quantile transformation (QuantileTransformer).
  
  selector:
  
  ”kbest”: Select the best features (SelectKBest).
  
  ”top”: To varying features (TopVarianceSelector).
  
  ”pca”: Principal component analysis (PCA).
  
  ”pca_auto”: Principal component analysis (PCA). Number of components is determined automatically.
  
  ”lda”: Linear Discriminant Analysis (LDA).
  
  clf:
  
  ”rf”: RandomForestClassifier.
  
  ”lr”: LogisticRegression.
  
  ”et”: ExtraTreesClassifier.
  
  ”knn”: KNeighborsClassifier.
  
  ”mlp”: MLPClassifier.
  
  ”svc”: Support Vector Classifier (SVC).
  
  ”ada”: AdaBoostClassifier.
  
  ”bag”: BaggingClassifier.
  
  ”dt”: DecisionTreeClassifier.
  
  ”gp”: GaussianProcessClassifier.
  
  ”hgb”: HistGradientBoostingClassifier.
  
  ”nb”: GaussianNB.
  
  ”perceptron”: Perceptron.
  
  ”qda”: Quadratic Discriminant Analysis (QDA).
  
  ”ridge”: RidgeClassifier.
  
  ”sgd”: SGDClassifier.
  
  Example: Using a feature selector and a classifier (SelectKBest
  selection and Logistic Regression): - “kbest-lr”
- A custom class, that inherits from TrainedClassifier.
cv (int or cross-validation generator) – Determines the cross-validation splitting strategy.
n_jobs (int) – Number of parallel processes to run.

Returns:

prediction (DataFrame): DataFrame containing the predicted probabilities for each class.
model (object): The trained classifier object.
metrics (dict): Dict containing classifier metrics.
reports (dict): Dict of evaluation report (both ‘txt’ and ‘html’) for each sample.

Return type:

ClassifierResult