API Reference

mepylome.dtypes.idat

Contains a IDAT file parser.

class mepylome.dtypes.idat.IdatParser(file, *, intensity_only=False, array_type_only=False)[source]

Reads and parses an IDAT file.

Stores all extracted values from the IDAT file as attributes.

Parameters:
  • file (str or file-like object) – Path to the IDAT file or a file-like object. Can also be a gzipped IDAT file.

  • intensity_only (bool, optional) – Whether to read only intensity values, which makes parsing faster. Defaults to False.

Examples

>>> filepath = "/path/to/idat/file_Grn.idat"
>>> idat_data = IdatParser(filepath)
>>> ids = idat_data.illumina_ids
>>> print(idat_data)

mepylome.dtypes.manifests

Module for handling Illumina array manifest files.

This module contains a single class Manifest for reading and processing Illumina array manifest files, which contain information about probes and their characteristics.

class mepylome.dtypes.manifests.Manifest(array_type=None, raw_path=None, proc_path=None, download_proc=True)[source]

Provides an object interface to an Illumina array manifest file.

This class provides functionality for reading and processing Illumina array manifest files. A manifest can be loaded automatically based on the array type or provided as a raw manifest file. On first use, the necessary data is automatically downloaded if needed, transformed, and saved locally, which might take some time. The processed manifest is then saved locally and loaded in its processed form on subsequent uses. During a running session, all loaded manifests are cached in memory.

Parameters:
  • array_type (str or ArrayType) – The type of array to process. Use either ArrayType (ArrayType.ILLUMINA_450K, ArrayType.ILLUMINA_EPIC, ArrayType.ILLUMINA_EPIC_V2) or corresponding string (‘450k’, ‘epic’, ‘epicv2’, ‘msa48’)

  • proc_path (str or Path) – The path to the local processed manifest file (default: None).

  • raw_path (str or Path, optional) – Path to the raw manifest file. Default is None.

  • download_proc (bool, optional) – If True and there is no locally saved processed manifest file, attempts to download the processed manifest file instead of the raw one. Defaults to True.

Examples

>>> # To initialize a manifest object for Illumina 450k array:
>>> manifest = Manifest("450k")
>>> manifest
>>> # To initialize a manifest object for Illumina EPIC array
>>> manifest = Manifest(ArrayType.ILLUMINA_EPIC)
>>> type_1 = manifest.probe_info(ProbeType.ONE)
>>> # To load all manifests when first used:
>>> Manifest.load()
control_address(control_type=None)[source]

Returns address IDs of all control probes of the specified type.

Return type:

Series

property control_data_frame: DataFrame

Pandas data frame of all manifest control probes.

property data_frame: DataFrame

Pandas data frame of all manifest probes.

static load(array_types=None)[source]

Loads specified manifests into memory.

Parameters:

array_types (list or ArrayType, optional) – List of array types or a single array type to load. Defaults to all available types.

Return type:

None

Examples

>>> # Load all manifests:
>>> Manifest.load()
>>> # Load specific manifests:
>>> Manifest.load(
>>>     [ArrayType.ILLUMINA_450K, ArrayType.ILLUMINA_EPIC]
>>> )
>>> Manifest.load("epicv2")
property methylation_probes: ndarray

All type I and II probes.

probe_info(probe_type, channel=None)[source]

Retrieves information about probes of a specified type and channel.

Parameters:
  • probe_type (ProbeType) – The type of probe (I, II, SnpI, SnpII, Control).

  • channel (Channel, optional) – The color channel (RED or GRN). Defaults to None.

Returns:

DataFrame containing information about the specified

probes.

Return type:

DataFrame

Raises:
  • ValueError – If probe_type is not a valid ProbeType or if channel is

  • not a valid Channel.

property snp_data_frame: DataFrame

SNP probes from the manifest data frame.

mepylome.dtypes.beads

Contains classes and function for processing Illumina methylation arrays.

It includes methods for extracting methylation information, various preprocessing techniques, normalization, and data handling.

class mepylome.dtypes.beads.MethylData(data=None, file=None, prep='illumina', seed=None)[source]

Represents methylated and unmethylated intensity data from RawData.

This class provides methods for preprocessing Illumina methylation data and computing beta values from methylated and unmethylated intensities.

Parameters:
  • data (RawData) – RawData object containing raw intensity data.

  • file (str) – Path to file or dir or list of paths containing raw intensity data.

  • prep (str) – Preprocessing method. Options: “illumina”, “swan”, “noob”.

  • seed (int, optional) – Seed value used for random number generation in the SWAN preprocessing method. Default is None.

Note

If ‘data’ is not provided, it will attempt to create a RawData object using the specified ‘file’.

Raises:
  • ValueError – If neither ‘data’ nor ‘file’ is provided.

  • ValueError – If ‘data’ is provided but is not of type ‘RawData’.

Examples

>>> methyl_data = MethylData(raw_data)
>>> methyl_data = MethylData(file=file_path, prep="swan")
property betas: DataFrame

Returns beta values.

betas_at(cpgs=None, fill=0.5)[source]

Calculates beta values for specified CpG sites.

Parameters:
  • cpgs (array-like) – Array of CpG IDs.

  • fill (float) – Value to fill for CpGs not found in the used manifest or equal to NaN.

Returns:

DataFrame containing beta values for specified

CpGs.

Return type:

pandas.DataFrame

Nore:

If ‘cpgs’ is None, all CpGs from the used manifest are considered.

property grn: DataFrame

Normalized green intensity by probe ID.

Type:

DataFrame

property methylated: DataFrame

Methylated intensity values indexed by IlmnID.

Type:

DataFrame

property mvalues: DataFrame

Returns M-values.

preprocess_illumina()[source]

Performs preprocessing usings Illuminas method.

This function implements preprocessing for Illumina methylation microarrays as used in Genome Studio, the standard software provided by Illumina.

Return type:

None

Details:

This implementation is adapted from ‘minfi’.

preprocess_noob(offset=15, dye_method='single')[source]

The Noob preprocessing method.

Description:

Noob (normal-exponential out-of-band) is a background correction method with dye-bias normalization.

Parameters:
  • offset (float) – An offset for the normexp background correction.

  • dye_method (str) – How should dye bias correction be done: “single” for single sample approach, or “reference” for a reference array.

Return type:

None

References

TJ Triche, DJ Weisenberger, D Van Den Berg, PW Laird and KD Siegmund _Low-level processing of Illumina Infinium DNA Methylation BeadArrays. Nucleic Acids Res (2013) 41, e90. doi:10.1093/nar/gkt090.

preprocess_raw()[source]

Calculates methylated/unmethylated arrays without preprocessing.

Converts the Red/Green channel for an Illumina methylation array into methylation signal, without using any normalization.

Return type:

None

preprocess_swan()[source]

Subset-quantile Within Array Normalization (SWAN).

Return type:

None

Details:

The SWAN method has two parts. First, an average quantile distribution is created using a subset of probes defined to be biologically similar based on the number of CpGs underlying the probe body. This is achieved by randomly selecting N Infinium I and II probes that have 1, 2 and 3 underlying CpGs, where N is the minimum number of probes in the 6 sets of Infinium I and II probes with 1, 2 or 3 probe body CpGs. This results in a pool of 3N Infinium I and 3N Infinium II probes. The subset for each probe type is then sorted by increasing intensity. The value of each of the 3N pairs of observations is subsequently assigned to be the mean intensity of the two probe types for that row or “quantile”. This is the standard quantile procedure. The intensities of the remaining probes are then separately adjusted for each probe type using linear interpolation between the subset probes.

Implementation adapted from ‘minfi’

Note

SWAN uses a random subset of probes for between array normalization. To achieve reproducible results, set the seed.

References

J Maksimovic, L Gordon and A Oshlack (2012). SWAN: Subset quantile Within-Array Normalization for Illumina Infinium HumanMethylation450 BeadChips. Genome Biology 13, R44.

property red: DataFrame

Normalized red intensity by probe ID.

Type:

DataFrame

property unmethylated: DataFrame

Unmethylated intensity values indexed by IlmnID.

Type:

DataFrame

class mepylome.dtypes.beads.RawData(basenames, *, manifest=None)[source]

Represents raw intensity data extracted from IDAT files.

This class initializes with a list of basepaths to IDAT files and parses them to extract raw intensity data from the green and red channels.

Parameters:
  • basenames (list) – List of basepaths to IDAT files.

  • manifest (Manifest, optional) – The manifest associated with the array. If not provided, it will be determined from the probe count.

array_type

Type of Illumina array.

Type:

str

probes

List of probe names corresponding to the IDAT files.

Type:

list

ids

Array of probe IDs.

Type:

array

_grn

Array of raw intensity values from the green channel.

Type:

array

_red

Array of raw intensity values from the red channel.

Type:

array

Example

>>> idat_basepath0 = directory_path / "200925700125_R07C01"
>>> idat_basepath1 = directory_path / "200925700133_R02C01_Grn.idat"
>>> raw_data = RawData(idat_basepath0)
>>> raw_data = RawData([idat_basepath0, idat_basepath1])
property grn: DataFrame

Green channel raw intensity indexed by probe IDs.

Type:

DataFrame

property red: DataFrame

Red channel raw intensity indexed by probe IDs.

Type:

DataFrame

class mepylome.dtypes.beads.ReferenceMethylData(file, prep='illumina', save_to_disk=False)[source]

Stores and manages reference cases for different array types.

This class categorizes and processes reference IDAT files to create MethylData objects for different array types. It is intended for CNV neutral reference cases used in CNV calculation.

Parameters:
  • file (list) – List of file paths to IDAT files or directory containing IDAT files.

  • prep (str) – Preprocessing method. Options: “illumina”, “swan”, “noob”.

_methyl_data

Internal dictionary to cache MethylData objects for each array type.

Type:

dict

Raises:

ValueError – If no reference files are found for the specified array type.

Examples

>>> # 'directory' contains 450k, EPIC and EPICv2 idat files
>>> reference = ReferenceMethylData(file=directory, prep="illumina")
>>> sample_450k = MethylData(file=idat_file_450k)
>>> sample_epic = MethylData(file=idat_file_epic)
>>> sample_epicv2 = MethylData(file=idat_file_epicv2)
>>> # reference can be used for all types
>>> cnv_450k = CNV(sample_450k, reference)
>>> cnv_epic = CNV(sample_epic, reference)
>>> cnv_epicv2 = CNV(sample_epicv2, reference)
mepylome.dtypes.beads.idat_basepaths(files, only_valid=False)[source]

Returns unique basepaths from IDAT files or directory.

This function processes a list of IDAT files or a directory containing IDAT files and returns their basepaths by removing the file endings. The function ensures that there are no duplicate basepaths in the returned list and maintains the order of the files as they appear in the input.

Parameters:
  • files (path or list) – A file or directory path or a list of file paths.

  • only_valid (bool) – If True, only returns basepaths that point to valid IDAT file pairs. Defaults is ‘False’.

Returns:

A list of unique basepaths corresponding to the provided IDAT

files. If a directory is provided, all IDAT files are recursively considered.

Return type:

list

Example

>>> idat_basepaths("/path/to/dir")
[PosixPath('/path/to/dir/file1'), PosixPath('/path/to/dir/file2')]
>>> idat_basepaths(["/path1/file1_Grn.idat", "/path2/file2_Red.idat"])
[PosixPath('/path1/file1'), PosixPath('/path2/file2')]
>>> idat_basepaths("/path/to/idat/file_Grn.idat.gz")
[PosixPath('/path/to/idat/file')]
mepylome.dtypes.beads.idat_paths_from_basenames(basenames)[source]

Returns paths to green and red IDAT files.

Parameters:

basenames (list) – List of basepaths for IDAT files.

Returns:

Paths to green and red IDAT files.

Return type:

tuple

Raises:

FileNotFoundError – If any IDAT file is not found.

mepylome.dtypes.beads.is_valid_idat_basepath(basepath)[source]

Checks if the given basepath(s) point to valid IDAT files.

Return type:

bool

mepylome.dtypes.cnv

Provides CNV analysis functionality including segmentation and plotting.

This module provides classes and functions for copy number variation (CNV) analysis.

class mepylome.dtypes.cnv.Annotation(manifest=None, array_type=None, gap=None, detail=None, bin_size=50000, min_probes_per_bin=15)[source]

Genomic annotations for CNV such as as binning and gene locations.

Parameters:
  • manifest (Manifest, optional) – The manifest containing annotation details. Can be determined from array_type.

  • array_type (str, optional) – The type of array used for annotation. Can be determined from manifest.

  • gap (pyranges.PyRanges) – The genomic gaps. If unset default values will be used.

  • detail (pyranges.PyRanges, optional) – Detailed annotation (usually genes).

  • bin_size (int, optional) – The base-pair size of annotation bins. Defaults to 50000.

  • min_probes_per_bin (int, optional) – The minimum number of probes per bin. Defaults to 15.

manifest

The manifest to use.

Type:

Manifest

array_type

The array type of the manifest.

Type:

str

probes

The Illumina ID’s of the manifest after adjusting the manifest to relevant genomic ranges.

Type:

list

gap

The genomic gaps except for the CNV analysis.

Type:

pyranges.PyRanges

detail

Detailed annotation information (usually genes).

Type:

pyranges.PyRanges

bin_size

The base-pair size of the bins.

Type:

int

min_probes_per_bin

The minimum number of probes per bin.

Type:

int

chromsizes

Dictionary containing chromosome sizes.

Type:

dict

static default_gaps()[source]

Default genomic gaps.

Return type:

PyRanges

Details:

The default value of conumee2.

static default_genes()[source]

Default PyRanges object including gene names with coordinates.

Return type:

PyRanges

Details:

Data downloaded from: https://grch37.ensembl.org/biomart/martview

static load(array_types=None)[source]

Loads specified annotation into memory.

Parameters:

array_types (list or ArrayType, optional) – List of array types or a single array type to load. Defaults to all available types.

Return type:

None

Examples

>>> # Load all annotations:
>>> Annotation.load()
>>> # Load specific annotation:
>>> Annotation.load(
>>>     [ArrayType.ILLUMINA_450K, ArrayType.ILLUMINA_EPIC]
>>> )
>>> Annotation.load("epicv2")
make_bins()[source]

Creates equidistant bins and then removes genomic gaps.

Return type:

PyRanges

merge_bins(bins)[source]

Merges adjacent bins until all contain a minimum of probes.

Return type:

PyRanges

static merge_bins_in_chromosome(bin_df, min_probes_per_bin)[source]

Merges adjacent bins until all contain a minimum of probes.

Parameters:
  • bin_df (DataFrame) – DataFrame containing bin information for a single chromosome.

  • min_probes_per_bin (int) – Minimum number of probes per bin required for merging.

Returns:

Merged bins in the chromosome.

Return type:

DataFrame

class mepylome.dtypes.cnv.CNV(sample, reference, annotation=None)[source]

Class for Copy Number Variation (CNV) analysis.

sample

MethylData object representing the sample.

Type:

MethylData

reference

MethylData object representing the CNV- neutral references.

Type:

MethylData

annotation

Annotation object containing genomic annotation information.

Type:

Annotation

bins

PyRanges object representing genomic bins.

Type:

PyRanges

probes

Index of probe IDs.

Type:

Index

coef

Coefficient of linear regression.

_ratio

Difference between observed sample intensity and expected intensity calculated by linear regression from references.

ratio

The values from _ratio as DataFrame with Illumina ID’s as indices.

noise

Noise level. A quality measure for the sample bead.

detail

Detailed information (usually Genes).

segments

Segments calculated by circular binary segmentation.

Parameters:
  • sample (MethylData) – MethylData object representing the sample.

  • reference (MethylData or ReferenceMethylData) – MethylData object representing the reference, or ReferenceMethylData object for multiple references.

  • annotation (Annotation, optional) – Annotation object containing genomic annotation information. Defaults to annotation associated with the sample array type.

Examples

>>> sample = MethylData(file="path/to/idat/file")
>>> reference = MethylData(file="path/to/idat/reference/dir")
>>> cnv = CNV(sample, reference)
>>> cnv.set_bins()
>>> cnv.set_detail()
>>> cnv.set_segments()
>>> cnv.plot()
Raises:

ValueError – If sample does not contain exactly 1 probe, or if reference is not of type MethylData or ReferenceMethylData.

Reference:

Daenekas, B., Pérez, E., Boniolo, F., Stefan, S., Benfatto, S., Sill, M., Sturm, D., Jones, D. T. W., Capper, D., Zapatka, M., & Hovestadt, V. (2024). Conumee 2.0: enhanced copy-number variation analysis from DNA methylation arrays for humans and mice. In J. Kelso (Ed.), Bioinformatics (Vol. 40, Issue 2). Oxford University Press (OUP). https://doi.org/10.1093/bioinformatics/btae029

fit()[source]

Fits linear regression model to calculate CNV at every CpG site.

This method fits a linear regression model to the intensity data of the sample and reference and calculates the CNV at every CpG site.

Return type:

None

plot()[source]

Generates and displays a plot of the CNV data.

Return type:

None

classmethod set_all(sample, reference, annotation=None, *, do_seg=True)[source]

Create a CNV object and perform CNV analysis.

Parameters:
  • sample (MethylData) – MethylData object representing the sample.

  • reference (MethylData or ReferenceMethylData) – MethylData object representing the reference, or ReferenceMethylData object for multiple references.

  • annotation (Annotation, optional) – Annotation object containing genomic annotation information. Defaults to annotation associated with the sample array type.

  • do_seg (bool, optional) – Indicates whether to perform segmentation, which can be computationally intensive. Defaults to True.

Returns:

CNV object with fitted data and optionally segmented.

Return type:

CNV

Examples

>>> cnv = CNV.set_all(sample, reference, do_seg=do_seg)
>>> # Note: This command is equivalent to:
>>> cnv = CNV(sample, reference)
>>> cnv.set_bins()
>>> cnv.set_detail()
>>> if do_seg:
>>>     cnv.set_segments()
set_bins()[source]

Calculates CNV within each bin based on the results of ‘fit’.

This method calculates copy number variation (CNV) within each bin by taking the median of the ratios obtained from the linear regression model fit in the ‘fit’ method.

Return type:

None

set_detail()[source]

Calculates CNV for the detail object based on the results of ‘fit’.

This method calculates copy number variation (CNV) for the detail object (usually genes) by aggregating the ratios obtained from the linear regression model fit in the ‘fit’ method for each genomic region specified in the detail object. The result includes the median ratio, variance, and count of probes within each region.

Return type:

None

set_itensity(methyl_data)[source]

Calculates intensity values from methylation data.

Return type:

None

set_segments()[source]

Sets CNV segments based on circular binary segmentation.

This method applies the circular binary segmentation (CBS) algorithm to identify copy number variation (CNV) segments in the dataset. It calculates the CNV segments for each chromosome and stores them in the ‘segments’ attribute of the object.

Return type:

None

write(path, data='all')[source]

Writes CNV data to disk as a zip file.

This method writes the CNV data to disk as a zip file containing CSV files. It allows specifying which data to include in the zip file, such as bins, detail, segments, and metadata.

Parameters:
  • path (str) – The path to save the zip file.

  • data (str or list of str, optional) – Specifies which data to include in the zip file. Valid options are “all”, “bins”, “detail”, “segments”, and “metadata”. Defaults to “all”.

Raises:

ValueError – If an invalid data option is specified.

Return type:

None

mepylome.analysis.methyl

Methylation analysis tools including a Dash-based browser application.

This module provides a comprehensive set of tools for conducting methylation analysis. The core functionality is encapsulated in the MethylAnalysis class, which manages the methylation analysis process and executes an interactive web application for the exploration of methylation data.

class mepylome.analysis.methyl.MethylAnalysis(analysis_dir=None, *, annotation=None, reference_dir=None, output_dir=None, test_dir=None, prep='illumina', cpgs='auto', cpg_blacklist=None, n_cpgs=25000, classifiers=None, cv_default=5, n_jobs_clf=1, n_jobs_cnv=None, precalculate_cnv=False, load_full_betas=True, feature_matrix=None, overlap=False, analysis_ids=None, test_ids=None, cpg_selection='top', do_seg=False, host='localhost', port=8050, debug=False, umap_parms=None, use_gpu=False, verbose=1, balancing_feature=None)[source]

Main class for methylation analysis including a GUI application.

Main class for methylation analysis, providing methods for setting up analysis parameters, reading data, and running a Dash-based web application for data visualization.

Parameters:
  • analysis_dir (str or Path) – Directory containing IDAT files for analysis.

  • annotation (str or Path) – Path to an annotation spreadsheet used to map sample files located in both analysis_dir and test_dir. One of the columns must contain the ID corresponding to the IDAT files (such as SentrixID or ID from files downloaded from GEO). If not provided, the system will attempt to identify the correct column automatically. If the annotation file is missing, it will search for a spreadsheet within the analysis_dir if available. (default: None)

  • reference_dir (str or Path) – Directory containing CNV neutral reference IDAT files. Must be provided if you wanna generate CNV plots. (default: None)

  • output_dir (str or Path) – Directory where output files will be saved (default: “/tmp/mepylome/analysis”).

  • test_dir (Path or None) – Directory for test files, including new cases for analysis or validation. Files uploaded via the GUI will be placed here. If set to None, the application will automatically use a temporary directory. (default: None)

  • prep (str) – Prepreparation method used for methylation microarrays: ‘illumina’, ‘swan’, or ‘noob (default: ‘illumina’).

  • cpgs (str, np.ndarray, list, set, or Path, optional) –

    Specifies the CpG sites to analyze. Possible values:

    1. A list, set, or NumPy array of official Illumina CpG site names.

    2. A path to a CSV file containing the CpG sites.

    3. A string specifying a predefined array type:

      • ’450k’ : The CpG sites from the Illumina 450k array.

      • ’epic’ : The CpG sites from the Illumina EPIC array.

      • ’epicv2’ : The CpG sites from the Illumina EPIC v2 array.

      • ’msa48’ : The CpG sites from the Illumina MSA array.

    4. A ‘+’-joined string of the options above combining multiple array types, returning the intersection of their CpG sites. For .. rubric:: Example

    • ’450k+epic’ : CpG sites both in the 450k and EPIC arrays.

    • ’epic+epicv2’: CpG sites both in the EPIC and EPICv2 arrays.

    5. ‘auto’ (default): Automatically detects all array types from IDAT files in analysis_dir and returns the intersection of CpG sites. This process may take longer as all files need to be read and, if necessary, decompressed.

  • cpg_blacklist (set or list, optional) – A list or set of CpG sites to exclude. Default is None.

  • n_cpgs (int) – Number of CpG sites to select for UMAP (default: 25000).

  • classifiers (object or list of objects, optional) –

    Classifier model(s) (default: None). Each classifier can be provided as:

    • A dictionary containing:

      • ’model’ (object): The classifier model object as defined below (required).

      • ’name’ (str, optional): A name for the classifier (default: “Custom_Classifier_<index>”).

      • ’cv’ (int or cross-validation generator, optional): Cross-validation strategy (default: self.cv_default).

    • A classifier model object (e.g., RandomForestClassifier(), vtl-kbest-rf), in which case the ‘name’ and ‘cv’ are automatically generated (see above). A classifier model can be one of:

      • A scikit-learn classifier object (trained or untrained).

      • A string in the format “scaler-selector-classifier”. See the documentation of fit_and_evaluate_clf in mepylome.analysis.methyl_clf for all valid values.

      • A custom class, that inherits from TrainedClassifier.

  • cv_default (int or cross-validation generator, optional) – Determines the default cross-validation splitting strategy (default: 5).

  • n_jobs_clf (int) – Number of parallel processes to run for classifying (default: 1). Choose -1 for using all available cores.

  • n_jobs_cnv (int, optional) – Number of parallel processes to use for CNV precalculation. If None, a reasonable number of cores will be automatically chosen based on the system and workload. (default: None)

  • precalculate_cnv (bool) – If set to True, CNV data will be precalculated before the main analysis. This process takes approximately 2-5 seconds per case initially, but it will improve performance during runtime by reducing computation time. (default: False)

  • load_full_betas (bool) –

    If True, loads beta values for all CpG sites into memory (when needed), enabling fast random access to the full methylation matrix. This can significantly increase memory usage.

    If False, only the specified n_cpgs CpG sites are loaded on demand. For supervised classifier training, the same reduced matrix (betas_sel) used for UMAP visualization is used. This greatly reduces memory consumption and is typically sufficient, though it may be slightly slower (default: True).

  • feature_matrix (pandas.DataFrame or numpy.ndarray, optional) – A user-provided feature matrix to be used for UMAP dimensionality reduction. If provided, this matrix will be used instead of betas_sel. If not provided (default is None), the betas_sel containing methylation beta values will be used for UMAP. (default: None)

  • overlap (bool) – Flag to analyze only samples that are both in the analysis directory and within the annotation file (default: False).

  • analysis_ids (list, optional) – A list of sample IDs. If provided, the analysis will be restricted to these samples only. If None, the analysis will include all available samples. (default: None)

  • test_ids (list, optional) – A list of sample IDs within test_dir. - If provided, only these samples will be used. - If None, all available IDAT files in test_dir will be used. (default: None)

  • cpg_selection (str) –

    Method to select CpG sites for UMAP (‘top’, ‘random’, or ‘balanced’) (default: ‘top’).

    • ’top’: Selects CpG sites with the highest variance.

    • ’random’: Selects CpG sites randomly.

    • ’balanced’: Selects the most varying CpG sites while ensuring a balanced distribution across groups based on balancing_feature. This method takes an equal number of sample files from `self.analysis_dir` for each group defined by balancing_feature. It is especially useful when the dataset is imbalanced, where some groups have significantly more samples than others.

  • balancing_feature (str) – Column in self.annotation used for balancing when cpg_selection=’balanced’. The balancing feature determines the groups/categories used to create a stratified selection of CpG sites.

  • do_seg (bool) – If set, enables segmentation analysis on CNV data and adds horizontal segmentation lines to the CNV plot. This will take an additional 2-5 seconds per sample. (default: False)

  • host (str) – Host address for the Dash application (default: ‘localhost’).

  • port (int) – Port number for the Dash application (default: 8050).

  • debug (bool) – Flag to enable debug mode for the Dash application (default: False).

  • umap_parms (dict) – Parameters for UMAP algorithm (default: {‘metric’: ‘manhattan’, ‘min_dist’: 0.1, ‘n_neighbors’: 15, ‘verbose’: True}).

  • use_gpu (bool) – Whether to use GPU acceleration for UMAP via cuML and CuPy (default: False). Set to True to enable GPU-backed UMAP computations, which can significantly speed up large datasets. This requires the cuml and cupy libraries to be installed, along with appropriate NVIDIA drivers and a working CUDA setup.

  • verbose (int) – Sets the (global) logging verbosity level: - 0: Errors and warnings only. - 1: Info, warnings, and errors (default). - 2: Debug, info, warnings, and errors.

Note

Many parameters can be modified within the GUI application after initialization, but not all.

analysis_dir

Path to the directory containing IDAT files for analysis.

Type:

Path

annotation

Path to an annotation spreadsheet used to map sample files located in both analysis_dir and test_dir.

Type:

str or Path

overlap

Flag to analyze only samples that are both in the analysis directory and within the annotation file (default: False).

Type:

bool

analysis_ids

A list of sample IDs. The analysis will be restricted to these samples only. If None, the analysis will include all available samples.

Type:

list

test_ids

A list of sample IDs in ‘test_dir’ that will be used.

Type:

list

n_cpgs

Number of CpG sites to select for UMAP (default: 25000).

Type:

int

n_jobs_clf

Number of parallel processes to run for classifying. If equal to -1 all available cores will be used.

Type:

int

n_jobs_cnv

Number of parallel processes to use for CNV precalculation. If None, a reasonable number of cores will be automatically chosen based on the system and workload.

Type:

int

reference_dir

Directory containing CNV neutral reference IDAT files. Must be provided if you wanna generate CNV plots.

Type:

str or Path

output_dir

Path to the directory where output files will be saved (default: “/tmp/mepylome/analysis”).

Type:

Path

test_dir

Directory for test files, including new cases for analysis or validation. Files uploaded via the GUI will be placed here. If set to None, the application will automatically use a temporary directory.

Type:

Path or None

prep

Prepreparation method used for methylation microarrays: ‘illumina’, ‘swan’, or ‘noob (default: ‘illumina’).

Type:

str

cpg_selection

Method to select CpG sites for UMAP (‘top’, ‘random’, or ‘balanced’) (default: ‘top’).

  • ‘top’: Selects CpG sites with the highest variance.

  • ‘random’: Selects CpG sites randomly.

  • ‘balanced’: Selects the most varying CpG sites while ensuring a balanced distribution across groups based on balancing_feature. This method takes an equal number of sample files from `self.analysis_dir` for each group defined by balancing_feature. It is especially useful when the dataset is imbalanced, where some groups have significantly more samples than others.

Type:

str

balancing_feature

Column in annotation used for balancing when cpg_selection=’balanced’. The balancing feature determines the groups/categories used to create a stratified selection of CpG sites.

Type:

str

host

Host address for the Dash application (default: ‘localhost’).

Type:

str

port

Port number for the Dash application (default: 8050).

Type:

int

debug

Flag to enable debug mode for the Dash application (default: False).

Type:

bool

cnv_dir

Directory for CNV (Copy Number Variation) data, initially set to None.

Type:

Path

umap_dir

Directory for UMAP (Uniform Manifold Approximation and Projection) data, initially set to None.

Type:

Path

umap_cpgs

CpG sites for UMAP analysis, initially set to None.

Type:

numpy.array

precalculate_cnv

Flag to precalculate CNV information by invoking ‘precompute_cnvs’ (default: False).

Type:

bool

load_full_betas

If True, loads beta values for all CpG sites into memory (when needed), enabling fast random access to the full methylation matrix. This can significantly increase memory usage.

If False, only the specified n_cpgs CpG sites are loaded on demand. For supervised classifier training, the same reduced matrix (betas_sel) used for UMAP visualization is used. This greatly reduces memory consumption and is typically sufficient, though it may be slightly slower (default: True).

Type:

bool

betas_sel

DataFrame containing a selected subset of beta values used for dimensionality reduction. Initially set to None.

Type:

pandas.DataFrame

betas_all

Dataframe containing beta values for all CpG sites, initially set to None.

Type:

pandas.DataFrame

feature_matrix

A user-provided feature matrix to be used for UMAP dimensionality reduction. If provided, this matrix will be used instead of betas_sel for UMAP plots and instead of betas_all for classifying (default: None).

Type:

pandas.DataFrame or numpy.ndarray, optional

betas_dir

Path to the betas directory, initially set to None.

Type:

Path

umap_plot

Plot for UMAP, initially set to EMPTY_FIGURE.

Type:

plotly.Figure

umap_plot_path

Path to the CSV file containing the UMAP plot data, initially set to None.

Type:

Path

umap_df

Dataframe containing UMAP data, initially set to empty data frame.

Type:

pandas.DataFrame

umap_parms

Parameters for UMAP algorithm (default: {‘metric’: ‘manhattan’, ‘min_dist’: 0.1, ‘n_neighbors’: 15, ‘verbose’: True}).

Type:

dict

use_gpu

Whether to use GPU acceleration for UMAP via cuML and CuPy (default: False). Set to True to enable GPU-backed UMAP computations, which can significantly speed up large datasets. This requires the cuml and cupy libraries to be installed, along with appropriate NVIDIA drivers and a working CUDA setup.

Type:

bool

raw_umap_plot

Raw UMAP plot data, initially set to None.

Type:

plotly.Figure

cnv_plot

Plot for CNV (Copy Number Variation) visualization, initially set to EMPTY_FIGURE.

Type:

plotly.Figure

cnv_id

ID for CNV (Copy Number Variation) sample, initially set to None.

Type:

str

dropdown_id

ID for dropdown selection, initially set to None.

Type:

list

ids

List of IDs, initially empty.

Type:

list

ids_to_highlight

IDs to highlight in the plot, initially set to empty list.

Type:

list

app

Dash application object, initially set to None.

Type:

dash.dash.Dash

Raises:

ValueError – If cpg_selection is not ‘top’, ‘balanced’, or ‘random’.

Examples

>>> # Basic usage
>>> from mepylome import MethylAnalysis
>>> analysis0 = MethylAnalysis()
>>> analysis0.run_app()
>>> # Usage if directories are known in advance
>>> analysis1 = MethylAnalysis(
>>>     analysis_dir='/path/to/idat/dir',
>>>     reference_dir='/path/to/reference/idat/dir',
>>>     annotation='/path/to/annotation/spread/sheat/with/2/cols',
>>>     output_dir='/path/to/mepylome/output',
>>> )
>>> analysis1.run_app()
property classifiers: list[dict[str, Any]]

Retrieves the configuration for classifiers.

This property returns a list of dictionaries, where each dictionary includes:

  • ‘name’ (str): A human-readable name for the classifier (e.g., ‘Random Forest’).

  • ‘model’ (object): The classifier model instance.

  • ‘cv’ (int or cross-validation generator): Determines the cross-validation splitting strategy.

Returns:

Classifier configurations.

Return type:

list of dict

classify(*, ids=None, values=None, clf_list)[source]

Classify samples using specified classifiers.

This method performs classification on given samples, defined either by ids or by values, using one or more supervised classifiers. The labels for classification are derived from the selected_columns. Classification can either use a provided feature_matrix (custom features), or default to CpG methylation data (betas_all). All samples in analysis_dir resp. those in analysis_ids with valid label will be used for learning.

Classifiers are applied to the data, and the method returns their predictions and performance reports.

Parameters:
  • ids (str, list, tuple, or None) – Sample IDs for prediction/classification. If values is provided, ids must be None.

  • values (pd.DataFrame, np.ndarray, or None) – Feature matrix for prediction/classification. If ids is provided, values must be None.

  • clf_list (object or list of objects) – A classifier model or a list of classifier models and configurations. This argument is handled the same way as self.classifiers. For full details on the format and options, refer to the docstring for self.classifiers.

Returns:

A list of ClassifierResult objects, each containing the following attributes:

  • prediction (pd.DataFrame): A DataFrame containing the predicted labels with their associated probabilities.

  • model (sklearn.base.BaseEstimator or TrainedClassifier): The trained classifier object used for prediction.

  • metrics (dict): A dictionary of evaluation metrics for the classifier, such as accuracy, precision, recall, etc.

  • reports (dict): A dictionary containing textual and HTML reports of the classifier’s performance. The keys are:

    • ”txt”: A plain-text report (e.g., classification report).

    • ”html”: An HTML-formatted report for richer visualization.

Return type:

list[ClassifierResult]

Outputs:
Log file: Contains training time, classifier performance metrics,

and evaluation results for each classifier.

Raises:

ValueError – If not exactly one if ids or values is set.

cn_summary(ids)[source]

Create a copy number summary plot for the given samples.

This method generates an overview of CNV gain and loss patterns across chromosomes for a list of sample IDs. It returns both the visual plot and the data used to generate it.

Parameters:
  • ids (list of str) – A list of sample IDs to include in the CNV

  • summary.

Returns:

A Plotly figure showing CNV

summary results.

df_cn_summary (pd.DataFrame): A DataFrame containing the data

behind the plot.

Return type:

plot (plotly.graph_objects.Figure)

Raises:

ValueError – If do_seg is not True. CNV summary plots require segmentation to be enabled.

compute_umap()[source]

Applies the UMAP algorithm on ‘betas_sel’.

Saves the 2D embedding in ‘umap_df’ and and on disk.

Raises:

AttributeError – If a dimension mismatch occurs, or if ‘betas_sel’ is not set.

Return type:

None

property cpgs: ndarray

Array of CpG sites to analyze, sorted in order.

When setting, the input should be the same as the cpgs argument in the constructor (__init__).

Raises:
  • ValueError – If the provided cpgs value is not a valid type or

  • format.

get_app()[source]

Returns a Dash application object for methylation analysis.

Return type:

Dash

get_cnv(sample_id, extract=None)[source]

Retrieves the CNV information for a specified sample.

This method locates the IDAT file corresponding to the provided sample_id, processes it to generate CNV data if not already available, and reads the resulting CNV information from disk.

Parameters:
  • sample_id (str) – The identifier for the sample whose CNV data is to be retrieved.

  • extract (list) – Specifies the data to extract from the CNV analysis. Available options include: - “bins”: Raw CNV data at the bin level. - “detail”: Detailed CNV information (generally genes). - “segments”: Segmented CNV regions. - “metadata”: CNV analysis metadata.

Returns:

A tuple containing the following elements:
  • bins (DataFrame): Data representing CNV bins.

  • detail (DataFrame): Gene CNV information.

  • segments (DataFrame): Segmented CNV data.

If CNV data is not found or cannot be generated, returns None for each extract value.

Return type:

tuple

property idat_handler: IdatHandler

Handles the management of IDAT files and associated metadata.

Returns:

An instance of IdatHandler configured with current settings.

Return type:

IdatHandler

make_cnv_plot(sample_id, genes_sel=None)[source]

Generates a copy number variation (CNV) plot for a specific sample.

This method generates a CNV plot for the specified sample and optionally highlights specific genes within the plot.

Parameters:
  • sample_id (str) – ID of the sample for which CNV plot is generated.

  • genes_sel (list or None, optional) – List of specific genes to highlight in the plot.

Raises:

FileNotFoundError – If the specified sample ID is not found in the analysis directory or if the reference directory does not exist.

Return type:

None

make_umap()[source]

Generates the UMAP plot.

This method extracts the beta values required for UMAP computation, computes the UMAP 2D embedding, and creates and displays the UMAP plot based on the computed embedding.

Return type:

None

make_umap_plot()[source]

Generates a UMAP plot from the given 2D embedding.

Generates the UMAP plot from the data provided in ‘umap_df’. The scatter plot color is based on selected columns in ‘idat_handler.selected_columns’.

Raises:

AttributeError – If a dimension mismatch occurs, or if ‘umap_df’ is not set.

Return type:

None

mlh1_report_pages(ids)[source]

Generate MLH1 promoter methylation report HTML pages.

Parameters:

ids (list of str) – Sample IDs.

Returns:

HTML reports, one per sample.

Return type:

list of str

precompute_cnvs(ids=None)[source]

Precalculates CNVs for all samples and saves them to disk.

This method performs CNV analysis, and writes the output to the configured CNV directory. If ids is not provided, the method will compute CNVs for all samples found in the analysis_dir.

Parameters:

ids (list, optional) – A list of sample IDs to process. If None, the function will compute CNVs for all samples in the analysis_dir. Default is None.

Return type:

None

Note

Precalculating CNVs improves performance but requires additional memory space in the output directory.

read_umap_plot_from_disk()[source]

Reads UMAP plot from disk if available from previous analysis.

Return type:

None

run_app(*, open_tab=False)[source]

Runs the mepylome Dash application.

Parameters:

open_tab (bool, optional) – Whether to automatically open a new browser tab with the application URL. Defaults to False.

Return type:

None

set_betas()[source]

Sets the beta values DataFrame (‘betas_sel’) for further analysis.

This method reads the IDAT files located in ‘analysis_dir’, extracts the beta values, and saves them locally in ‘output_dir’. Depending on the configuration (‘cpg_selection’ and ‘load_full_betas’ flags), it either extracts a subset of CpGs for UMAP computation or loads all CpGs for subsequent processing into memory.

Raises:

ValueError – If no valid samples are found.

Return type:

None

mepylome.analysis.methyl_aux

Auxiliary methods for the methylation analysis.

class mepylome.analysis.methyl_aux.IdatHandler(analysis_dir, *, annotation=None, test_dir=None, test_ids=None, overlap=False, analysis_ids=None)[source]

A class for handling IDAT files with annotation.

Includes reading annotation from various file formats and provides description lookups for methylation classes.

Parameters:
  • analysis_dir (str or Path) – The directory where the IDAT files are located.

  • annotation (str or Path, optional) – The path to the annotation file. Defaults to None.

  • test_dir (Path or None, optional) – Directory for test files, including new cases or validation IDAT files or other test cases. Defaults to None.

  • overlap (bool, optional) – If True, restricts the sample paths to only those present in both the IDAT files and the annotation file. Defaults to False.

  • analysis_ids (list, optional) –

    A list of sample IDs within analysis_dir.

    • If provided, only these samples will be used.

    • If None, all available IDAT files in analysis_dir will be used.

    Defaults to None.

    Note: The IDs may be converted to Sentrix format during initialization if the IDs in the annotation and IDs in analysis_dir do not match directly.

  • test_ids (list, optional) –

    A list of sample IDs within test_dir.

    • If provided, only these samples will be used.

    • If None, all available IDAT files in test_dir will be used. Defaults to None.

    Note: The IDs may be converted to Sentrix format during initialization if the IDs in the annotation and IDs in analysis_dir do not match directly.

analysis_dir

The directory path where the IDAT files are located.

Type:

Path

test_dir

Directory for test files, including new cases or validation IDAT files or other test cases. Defaults to None.

Type:

Path or None, optional

overlap

A flag indicating whether to restrict sample paths to only those present in both the IDAT files and the annotation file.

Type:

bool

id_to_path

A dictionary where the keys are sample IDs and the values are the file paths of IDAT files (from both analysis_dir and test_dir).

Type:

dict

annotation

The path to the annotation file. Defaults to None. If not provided, the first spreadsheet file found in self.analysis_dir will be used as the annotation.

Type:

Path

annotation_df

A DataFrame containing the annotation data, if loaded.

Type:

pandas.DataFrame or None

samples_annotated

A DataFrame containing the samples as index and the annotation in the columns.

Type:

pandas.DataFrame or None

selected_columns

A list of selected columns from the annotated samples, initialized with the first column.

Type:

list

analysis_ids

A list of sample IDs from analysis_dir that are actually used after filtering and optional conversion to Sentrix IDs.

Type:

list

test_ids

A list of sample IDs from test_dir that are actually used after filtering and optional conversion to Sentrix IDs.

Type:

list

Raises:

ValueError

  • If any sample in analysis_ids is not found in analysis_dir. - If any sample in in test_ids is not found in test_dir.

features(columns=None, separator='|')[source]

Combines specified columns into a single label per sample.

If columns is not provided, it defaults to the first column in samples_annotated or selected_columns if they are available. The function joins the values from the specified columns for each sample, converting them to strings and joining them with the specified separator.

Parameters:
  • columns (list, str, or None) – List of column names (or a single column name) to use for creating the label. If None, defaults to the first column in samples_annotated or selected_columns if not None.

  • separator (str) – The separator used to join values from the columns. Default is “|”.

Returns:

A Series of combined labels, indexed by sample IDs.

Return type:

pd.Series

Example

>>> idat_handler.features(columns=["GEO", "CNVs"])
sample_1    SGT_103|Balanced
sample_2    SGT_056|Balanced
sample_3    SGT_276|Balanced
dtype: object
init_parameters()[source]

Returns the initialization attributes.

Return type:

dict[str, Any]

mepylome.analysis.methyl_clf

Contains methods for supervised learning.

Non supervised classifiers (random forest, k-nearest neighbors, neural networks) for predicting the methylation class.

class mepylome.analysis.methyl_clf.fit_and_evaluate_clf(X, y, X_test, id_test, save_path, clf, cv, n_jobs=1)[source]

Predicts the methylation class by supervised learning classifier.

Uses supervised machine learning classifiers (Random Forest, K-Nearest Neighbors, Neural Networks, SVM, …) to predict the methylation class of the sample. Output will be written to disk.

Parameters:
  • X (pd.DataFrame) – Feature matrix (rows as samples, columns as features).

  • y (array-like) – Class labels.

  • X_test (array-like) – Value of the sample to be evaluated.

  • id_test (str) – Unique identifiers for the test samples to be evaluated.

  • save_path (str or Path) – Path where the classifiers and results will be saved/cached.

  • clf (list) –

    Classifier to use. Can be:

    • A scikit-learn classifier object or pipeline (trained or untrained).

    • A string in the format “scaler-selector-classifier”. Possible values are:

    • A pipeline string composed of arbitrary components joined by dashes (“-“). Each component can be specified using either an abbreviation or the full class name (e.g., “std” or “StandardScaler”).:

      scaler:
      • ”std”: Standard scaling (StandardScaler).

      • ”minmax”: Min-max scaling (MinMaxScaler).

      • ”robust”: Robust scaling (RobustScaler).

      • ”power”: Power transformation (PowerTransformer).

      • ”quantile”: Quantile transformation (QuantileTransformer).

      selector:
      • ”kbest”: Select the best features (SelectKBest).

      • ”top”: To varying features (TopVarianceSelector).

      • ”pca”: Principal component analysis (PCA).

      • ”pca_auto”: Principal component analysis (PCA). Number of components is determined automatically.

      • ”lda”: Linear Discriminant Analysis (LDA).

      clf:
      • ”rf”: RandomForestClassifier.

      • ”lr”: LogisticRegression.

      • ”et”: ExtraTreesClassifier.

      • ”knn”: KNeighborsClassifier.

      • ”mlp”: MLPClassifier.

      • ”svc”: Support Vector Classifier (SVC).

      • ”ada”: AdaBoostClassifier.

      • ”bag”: BaggingClassifier.

      • ”dt”: DecisionTreeClassifier.

      • ”gp”: GaussianProcessClassifier.

      • ”hgb”: HistGradientBoostingClassifier.

      • ”nb”: GaussianNB.

      • ”perceptron”: Perceptron.

      • ”qda”: Quadratic Discriminant Analysis (QDA).

      • ”ridge”: RidgeClassifier.

      • ”sgd”: SGDClassifier.

      Example: Using a feature selector and a classifier (SelectKBest

      selection and Logistic Regression): - “kbest-lr”

    • A custom class, that inherits from TrainedClassifier.

  • cv (int or cross-validation generator) – Determines the cross-validation splitting strategy.

  • n_jobs (int) – Number of parallel processes to run.

Returns:

  • prediction (DataFrame): DataFrame containing the predicted probabilities for each class.

  • model (object): The trained classifier object.

  • metrics (dict): Dict containing classifier metrics.

  • reports (dict): Dict of evaluation report (both ‘txt’ and ‘html’) for each sample.

Return type:

ClassifierResult