Tutorial (library)
This tutorial will guide you through the steps to analyze methylation data
using the mepylome package. We’ll cover setting up the environment,
parsing IDAT files, working with manifest files, processing raw data,
performing copy number variation (CNV) analysis, and conducting methylation
analysis using UMAP plots with a GUI. You can find the code of this tutorial
as a script in
https://github.com/brj0/mepylome/blob/main/examples/rtd_tutorial.py.
Setup
First, import the required modules and set up the necessary directories:
>>> from pathlib import Path
>>> from mepylome import (
>>> CNV,
>>> IdatParser,
>>> Manifest,
>>> MethylData,
>>> RawData,
>>> ReferenceMethylData,
>>> )
Setup directories for the tutorial:
>>> DIR = Path.home() / "mepylome" / "tutorial"
>>> ANALYSIS_DIR = DIR / "tutorial_analysis"
>>> REFERENCE_DIR = DIR / "tutorial_reference"
Download and setup the necessary data files (approximately 653 MB):
>>> from mepylome.utils import setup_tutorial_files
>>> setup_tutorial_files(ANALYSIS_DIR, REFERENCE_DIR)
Parsing IDAT files
Define the IDAT file path:
>>> idat_file = ANALYSIS_DIR / "200925700125_R07C01_Grn.idat"
Parse the IDAT file:
>>> idat_data = IdatParser(idat_file)
Print overview of parsed data:
>>> print(idat_data)
IdatParser(
file_size: 13686991
num_fields: 19
illumina_ids: array([ 1600101, 1600111, ..., 99810990, 99810992], dtype=int32)
probe_means: array([15629, 8469, ..., 7971, 943], dtype=uint16)
std_dev: array([1377, 408, ..., 702, 312], dtype=uint16)
n_beads: array([16, 7, ..., 6, 10], dtype=uint8)
mid_block: array([ 1600101, 1600111, ..., 99810990, 99810992], dtype=int32)
red_green: 0
mostly_null:
barcode: 200925700125
chip_type: BeadChip 8x5
mostly_a: R07C01
unknown_1:
unknown_2:
unknown_3:
unknown_4:
unknown_5:
unknown_6:
unknown_7:
)
The parsed data is available as attributes of the IdatParser object. For
example the Illumina IDs (probes IDs) can be accessed by:
>>> ids = idat_data.illumina_ids
>>> print(ids)
[ 1600101 1600111 1600115 ... 99810978 99810990 99810992]
Manifest files
The mepylome package includes a Manifest class that provides
functionality to download, process, and save Illumina manifest files
internally in a efficient format (stored in ~/.cache/mepylome/manifests). These
manifest files contain information about the CpG sites on the methylation
array, including genetic coordinates, probe types, and more.
Load the available manifest files for different array types.
>>> manifest_450k = Manifest("450k")
>>> manifest_epic = Manifest("epic")
>>> manifest_epic_v2 = Manifest("epicv2")
>>> manifest_msa48 = Manifest("msa48")
Note
The first time you run this, the manifest files will be downloaded and saved locally to ~/.cache/mepylome/manifests. This initial download might take some time.
Obtain values from attributes:
>>> probes_df = manifest_450k.data_frame
>>> controls_df = manifest_450k.control_data_frame
Print overview:
>>> print(probes_df)
IlmnID AddressA_ID AddressB_ID ... N_CpG End Probe_Type
0 cg13869341 62703328 16661461 ... 2 15865 1
1 cg14008030 27651330 -1 ... 2 18827 2
2 cg12045430 25703424 34666387 ... 7 29407 1
3 cg20826792 61731400 14693326 ... 7 29425 1
4 cg00381604 26752380 50693408 ... 6 29435 1
... ... ... ... ... ... ... ...
485572 rs1416770 28667385 -1 ... 0 -1 4
485573 rs1941955 33709340 -1 ... 0 -1 4
485574 rs2125573 25698376 -1 ... 0 -1 4
485575 rs2521373 12625304 -1 ... 0 -1 4
485576 rs4331560 10654345 -1 ... 0 -1 4
[485577 rows x 12 columns]
RawData
The RawData class extracts both raw green and raw red signal intensity
data from a IDAT file pair. You can initialize it using a base path to the
IDAT files (without the _Grn.idat / _Red.idat suffix), or by providing the
full path to either the Grn or Red IDAT file.
>>> idat_file = ANALYSIS_DIR / "200925700125_R07C01_Red.idat"
>>> # or
>>> idat_file = ANALYSIS_DIR / "200925700125_R07C01_Grn.idat"
>>> # or
>>> idat_file = ANALYSIS_DIR / "200925700125_R07C01"
>>> raw_data = RawData(idat_file)
The data is saved within the following attributes:
>>> # Intensity signals
>>> raw_data.grn
>>> raw_data.red
>>> # Type of the array_type (e.g., 450k, EPIC)
>>> raw_data.array_type
>>> # Corresponding manifest file
>>> raw_data.manifest
>>> # IDs on the bead
>>> raw_data.ids
Print an overview of the raw data
>>> print(raw_data)
RawData():
**********
array_type: epic
manifest: epic
probes:
['200925700125_R07C01']
ids:
[ 1600101 1600111 1600115 ... 99810978 99810990 99810992]
_grn:
[[15629 8469 7015 ... 10228 7971 943]]
_red:
[[ 4429 1575 24955 ... 6594 15010 5336]]
grn:
200925700125_R07C01
1600101 15629
1600111 8469
1600115 7015
1600123 7975
1600131 938
... ...
99810958 6292
99810970 318
99810978 10228
99810990 7971
99810992 943
[1052641 rows x 1 columns]
red:
200925700125_R07C01
1600101 4429
1600111 1575
1600115 24955
1600123 17707
1600131 8967
... ...
99810958 1881
99810970 1936
99810978 6594
99810990 15010
99810992 5336
[1052641 rows x 1 columns]
RawData can also read multiple files of the same array type (used for reference files):
>>> idat_file0 = ANALYSIS_DIR / "200925700125_R07C01_Grn.idat"
>>> idat_file1 = ANALYSIS_DIR / "200925700133_R02C01"
>>> raw_data_2 = RawData([idat_file0, idat_file1])
Alternatively, read all IDAT files in a directory (supports recursive search):
>>> raw_data_all = RawData(REFERENCE_DIR)
MethylData
The MethylData class allows for processing raw intensity data and can
calculate methylation signals as well as beta values. The raw data can be
preprocessed using one of the following methods: ‘illumina’ (default),
‘swan’, or ‘noob’. Initialize MethylData with raw data using the default
‘illumina’ preprocessing method.
>>> methyl_data = MethylData(raw_data)
>>> methyl_data_all = MethylData(raw_data_all)
Alternatively, you can explicitly specify the ‘illumina’ preprocessing method.
>>> methyl_data = MethylData(raw_data, prep="illumina")
You can also initialize MethylData directly from an IDAT file path, without
using RawData. This is the preferred method if you want to obtain
methylation signals or beta values.
>>> methyl_data = MethylData(file=idat_file)
Obtain various values via the attributes of the MethylData object:
>>> # The methylation signals for the green and red channels.
>>> methylated_signals = methyl_data.methylated
>>> unmethylated_signals = methyl_data.unmethylated
>>> # The corrected color signals.
>>> corrected_green_signals = methyl_data.grn
>>> corrected_red_signals = methyl_data.red
>>> # The type of the array used (e.g., 450k, EPIC, EPICv2).
>>> array_type = methyl_data.array_type
>>> # The corresponding manifest file.
>>> corresponding_manifest = methyl_data.manifest
Print an overview of the methylation data.
>>> print(methyl_data)
MethylData():
*************
array_type: epic
manifest: epic
probes:
['200925700125_R07C01']
_grn:
[[16785.56811897 9044.70326442 7472.74551323 ... 10946.40456039
8506.302329 908.14615612]]
_red:
[[ 3957.99684771 1303.20883282 23051.26201716 ... 5971.87773497
13800.43272211 4801.6873626 ]]
grn:
200925700125_R07C01
1600101 16785.568359
1600111 9044.703125
1600115 7472.745605
1600123 8510.626953
1600131 902.740540
... ...
99810958 6691.091309
99810970 232.442169
99810978 10946.404297
99810990 8506.302734
99810992 908.146179
[1052641 rows x 1 columns]
red:
200925700125_R07C01
1600101 3957.996826
1600111 1303.208862
1600115 23051.261719
1600123 16309.179688
1600131 8179.240234
... ...
99810958 1587.849731
99810970 1639.010620
99810978 5971.877930
99810990 13800.432617
99810992 4801.687500
[1052641 rows x 1 columns]
methylated:
200925700125_R07C01
IlmnID
cg14817997 2510.375488
cg26928153 10454.492188
cg16269199 7020.834473
cg13869341 30160.773438
cg14008030 19805.154297
... ...
cg10488260 2282.708496
cg14273923 12481.604492
cg09748881 6418.647461
cg07587934 9533.372070
cg16855331 5837.001465
[865859 rows x 1 columns]
unmethylated:
200925700125_R07C01
IlmnID
cg14817997 855.783081
cg26928153 926.525330
cg16269199 3892.054932
cg13869341 5986.760742
cg14008030 8202.495117
... ...
cg10488260 8199.704102
cg14273923 2489.212646
cg09748881 950.663391
cg07587934 5264.926270
cg16855331 15618.041992
[865859 rows x 1 columns]
Beta values are a indicator wheather a CpG is methylated or not. They can be calculated for all sites of the corresponding array:
>>> betas = methyl_data.betas
You can also access the beta values at specific CpG sites (here at all the sites of the EPICv2 manifest). Missing data will be replaced with fill.
>>> epicv2_cpgs = manifest_epic_v2.methylation_probes
>>> beta_specific = methyl_data.betas_at(epicv2_cpgs, fill=0.5)
Using Alternative Preprocessing Methods
Preprocess the raw data using the SWAN method.
>>> methyl_data_swan = MethylData(raw_data, prep="swan")
Preprocess the raw data using the NOOB method.
>>> methyl_data_noob = MethylData(raw_data, prep="noob")
See api for more information about SWAN and NOOB.
Copy number variation (CNV)
Copy number variations (CNV) are significant alterations in the genome involving the loss or gain of large DNA segments, often encompassing multiple genes. These variations are frequently linked to cancer development and can aid in tumor classification. The CNV profile can be calculated from signal intensity using methylation arrays. With the mepylome package, CNV can be efficiently calculated and visualized.
1. Set up analysis file
First, initialize your sample data for analysis:
>>> sample = methyl_data
2. Set Up Reference Data
Within the reference directory there must be multiple CNV-neutral IDAT pairs of the same array type as sample.
>>> reference = MethylData(file=REFERENCE_DIR)
Alternatively, if the reference directory contains IDAT files of multiple
array types, you can use ReferenceMethylData to load all files into
memory. This way, the reference object can be used for multiple array types.
The CNV class will automatically extract the files for the needed array type.
>>> reference_all = ReferenceMethylData(REFERENCE_DIR)
3. Initialize CNV Analysis
Create an instance of the CNV class for the analysis, and fit the data (this is basically a linear regression model comparing sample signal with the reference signals at each CpG site):
>>> cnv = CNV(sample, reference)
>>> # Alternative with ReferenceMethylData
>>> cnv = CNV(sample, reference_all)
4. Calculate CNV for Bins and Genes
Compute CNV values for genomic bins and gene regions:
>>> cnv.set_bins()
>>> cnv.set_detail()
5. Calculate CNV Segments
Use the binary circular segmentation algorithm for genome segmentation:
>>> cnv.set_segments()
Note
For this step, additional packages must be installed (see installation).
6. Streamlined Analysis
Alternatively, perform all CNV computations in a single call:
>>> cnv = CNV.set_all(sample, reference)
>>> # or
>>> cnv = CNV.set_all(sample, reference_all)
7. Visualize CNV Data
Display an interactive plot using Plotly, where genes can be highlighted:
>>> cnv.plot()
Methylation Analysis
1. Set up analysis object and run GUI in browser
For methylation analysis, ensure you have the setup described in General Setup
First, import the MethylAnalysis class from the mepylome.analysis module.
>>> from mepylome.analysis import MethylAnalysis
Create an instance of MethylAnalysis with the specified analysis and reference directories.
>>> methyl_analysis = MethylAnalysis(
>>> analysis_dir=ANALYSIS_DIR,
>>> reference_dir=REFERENCE_DIR,
>>> )
You can print an overview of the parameters of the object:
>>> print(methyl_analysis)
MethylAnalysis():
*****************
analysis_dir:
/home/username/mepylome/tutorial/tutorial_analysis
annotation:
/home/username/mepylome/tutorial/tutorial_analysis/annotation.csv
app:
None
betas_sel:
None
betas_all:
None
betas_dir:
/tmp/mepylome/analysis/betas-tutorial_analysis-illumina-3b616e0e24a8b0e2d443b777b8ad8b61
cnv_dir:
/tmp/mepylome/analysis/cnv-tutorial_analysis-tutorial_reference-illumina-False-f77099f7bcb04262a0456d122215ed4d
cnv_id:
None
cnv_plot:
Figure({
'data': [], 'layout': {'template': '...', 'yaxis': {'range': [-2, 2...
cpg_selection:
top
cpgs:
['cg00000029' 'cg00000103' 'cg00000109' ... 'ch.X.97651759F'
'ch.X.97737721F' 'ch.X.98007042R']
debug:
False
do_seg:
False
dropdown_id:
[]
host:
localhost
ids:
Index(['200925700133_R04C01', '201530470054_R05C01', '201869690203_R03C01',
'201904410008_R05C01', '201869690168_R08C01', '201530470054_R01C01',
'201869690203_R06C01', '200925700133_R03C01', '201904410008_R02C01',
'201904410008_R04C01', '201530470054_R02C01', '201530470054_R03C01',
'201904410008_R03C01', '201870610040_R03C01', '200925700133_R02C01',
'200925700125_R07C01', '201530470054_R04C01', '201870610040_R04C01',
'200925700133_R05C01', '201904410008_R06C01'],
dtype='object')
ids_to_highlight:
None
load_full_betas:
False
n_cpgs:
25000
output_dir:
/tmp/mepylome/analysis
overlap:
False
port:
8050
precalculate_cnv:
False
prep:
illumina
raw_umap_plot:
Figure({
'data': [{'customdata': array([['Chondrosarcoma', 'methylation clas...
reference_dir:
/home/username/mepylome/tutorial/tutorial_reference
selected_columns:
['Diagnosis']
umap_cpgs:
None
umap_df:
Umap_x Umap_y ... Colour Umap_color
200925700133_R04C01 7.089545 1.715319 ... 2E3092 Osteosarcoma (high-grade)
201530470054_R05C01 8.902815 1.723695 ... 6B66AE Osteoblastoma
201869690203_R03C01 3.801413 5.398326 ... 6282C2 Chondrosarcoma
201904410008_R05C01 6.164794 5.332215 ... 7C7E82 Control (muscle tissue)
201869690168_R08C01 3.486633 4.991620 ... 6282C2 Chondrosarcoma
201530470054_R01C01 8.228397 1.753262 ... 6B66AE Osteoblastoma
201869690203_R06C01 3.823109 4.597386 ... 6282C2 Chondrosarcoma
200925700133_R03C01 7.502202 1.312939 ... 2E3092 Osteosarcoma (high-grade)
201904410008_R02C01 6.569446 5.207128 ... 7C7E82 Control (muscle tissue)
201904410008_R04C01 6.959955 4.955368 ... 7C7E82 Control (muscle tissue)
201530470054_R02C01 8.313415 2.253792 ... 6B66AE Osteoblastoma
201530470054_R03C01 8.599755 1.344881 ... 6B66AE Osteoblastoma
201904410008_R03C01 6.801136 5.651977 ... 7C7E82 Control (muscle tissue)
201870610040_R03C01 4.220203 5.398547 ... 6282C2 Chondrosarcoma
200925700133_R02C01 6.599361 2.232107 ... 2E3092 Osteosarcoma (high-grade)
200925700125_R07C01 6.635824 1.511574 ... 2E3092 Osteosarcoma (high-grade)
201530470054_R04C01 8.751877 2.230259 ... 6B66AE Osteoblastoma
201870610040_R04C01 4.270664 4.871753 ... 6282C2 Chondrosarcoma
200925700133_R05C01 7.345413 2.310878 ... 2E3092 Osteosarcoma (high-grade)
201904410008_R06C01 6.281984 4.778615 ... 7C7E82 Control (muscle tissue)
[20 rows x 13 columns]
umap_dir:
/tmp/mepylome/analysis/umap-tutorial_analysis-illumina-25000-top-6ff1e8e7ec2b3a2f8d856ee634404085
umap_plot_path:
/tmp/mepylome/analysis/umap-tutorial_analysis-illumina-25000-top-6ff1e8e7ec2b3a2f8d856ee634404085/umap_plot.csv
upload_dir:
/tmp/mepylome/analysis/upload-tutorial_analysis-c8e2f8ac691e9c3deba2e880aa7c5251
To enable interactive analysis, you can launch a GUI. This will open a new tab in your web browser and you can start a GUI-based methylation analysis.
>>> methyl_analysis.run_app(open_tab=True)
MethylAnalysis has multiple parameters. For example, you can provide a custom set of CpG sites. For instance, you can set:
>>> cpgs = Manifest("epic").methylation_probes[:10000]
>>> methyl_analysis = MethylAnalysis(
>>> analysis_dir=ANALYSIS_DIR,
>>> reference_dir=REFERENCE_DIR,
>>> cpgs=cpgs,
>>> )
Mepylome saves all results in a temporary directory (‘/tmp/mepylome’). You can provide a custom directory for output:
>>> OUTPUT_DIR = DIR / "output_dir"
>>> OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
>>> methyl_analysis = MethylAnalysis(
>>> analysis_dir=ANALYSIS_DIR,
>>> reference_dir=REFERENCE_DIR,
>>> output_dir=OUTPUT_DIR,
>>> )
Here is a more comprehensive example with multiple custom parameters:
>>> TEST_DIR = DIR / "test_dir"
>>> methyl_analysis = MethylAnalysis(
>>> analysis_dir=ANALYSIS_DIR,
>>> reference_dir=REFERENCE_DIR,
>>> output_dir=OUTPUT_DIR,
>>> # New cases for validation, excluded from classifier training.
>>> test_dir=TEST_DIR,
>>> # Load beta values for all CpG sites into memory
>>> load_full_betas=True,
>>> # Use SWAN preprocessing method
>>> prep="swan",
>>> # Provide annotation file (if not already in analysis_dir)
>>> annotation=ANALYSIS_DIR / "annotation.csv",
>>> # Number of CpGs for UMAP analysis
>>> n_cpgs=5000,
>>> # Analyze the 'best' CpG sites
>>> cpg_selection="top",
>>> # Show segmentation intervals in CNV plot
>>> do_seg=True,
>>> )
Many parameters can be modified within the GUI application after initialization, but not all.
2. Set up beta values and generate UMAP
All calculations that can be performed within the GUI can also be done manually. For example, to extract the beta values:
>>> methyl_analysis.set_betas()
The beta values are then stored in:
>>> methyl_analysis.betas_sel
cg15836656 cg12823387 cg25563772 ... cg22115994 cg27601809 cg00444740
200925700133_R04C01 0.117822 0.058700 0.058626 ... 0.295462 0.738677 0.474659
201530470054_R05C01 0.087871 0.050942 0.065696 ... 0.119668 0.305578 0.105583
201869690203_R03C01 0.810276 0.735967 0.721423 ... 0.751743 0.200545 0.757986
201904410008_R05C01 0.892641 0.777778 0.840475 ... 0.731927 0.636772 0.737969
201869690168_R08C01 0.811684 0.886028 0.888461 ... 0.725186 0.186773 0.812097
201530470054_R01C01 0.065221 0.050916 0.024315 ... 0.069606 0.408356 0.102764
201869690203_R06C01 0.948498 0.892334 0.859396 ... 0.776809 0.080326 0.875806
200925700133_R03C01 0.123428 0.193491 0.173582 ... 0.406198 0.754906 0.231158
201904410008_R02C01 0.868791 0.838671 0.922514 ... 0.704358 0.789430 0.617703
201904410008_R04C01 0.791002 0.818421 0.770333 ... 0.707137 0.851243 0.643282
201530470054_R02C01 0.058704 0.092474 0.122352 ... 0.173067 0.606287 0.340654
201530470054_R03C01 0.046851 0.043868 0.106606 ... 0.050913 0.281679 0.151072
201904410008_R03C01 0.718867 0.819853 0.899481 ... 0.701382 0.788357 0.659056
201870610040_R03C01 0.783809 0.905159 0.851122 ... 0.822050 0.104641 0.862166
200925700133_R02C01 0.054682 0.076460 0.089566 ... 0.087220 0.665932 0.159535
200925700125_R07C01 0.048604 0.093311 0.129750 ... 0.457459 0.867225 0.323574
201530470054_R04C01 0.048784 0.036592 0.048904 ... 0.105748 0.214249 0.056179
201870610040_R04C01 0.870172 0.896742 0.842515 ... 0.659235 0.134309 0.786843
200925700133_R05C01 0.048411 0.153956 0.164447 ... 0.462702 0.662564 0.290133
201904410008_R06C01 0.825491 0.828089 0.876774 ... 0.752596 0.833333 0.645860
[20 rows x 5000 columns]
To perform the UMAP algorithm:
>>> methyl_analysis.compute_umap()
The result of the UMAP algorithm is then stored in:
>>> methyl_analysis.umap_df
Umap_x Umap_y ... ID Colour
200925700133_R04C01 6.644683 -7.024573 ... REFERENCE_SAMPLE 367 2E3092
201530470054_R05C01 8.254219 -8.493701 ... REFERENCE_SAMPLE 109 6B66AE
201869690203_R03C01 5.185830 -5.596291 ... REFERENCE_SAMPLE 78 6282C2
201904410008_R05C01 7.058737 -5.482265 ... REFERENCE_SAMPLE 2 7C7E82
201869690168_R08C01 4.727582 -5.662395 ... REFERENCE_SAMPLE 80 6282C2
201530470054_R01C01 8.136439 -7.440626 ... REFERENCE_SAMPLE 113 6B66AE
201869690203_R06C01 4.729284 -5.012035 ... REFERENCE_SAMPLE 76 6282C2
200925700133_R03C01 6.980153 -7.641444 ... REFERENCE_SAMPLE 368 2E3092
201904410008_R02C01 7.460121 -5.769195 ... REFERENCE_SAMPLE 5 7C7E82
201904410008_R04C01 8.094703 -5.589474 ... REFERENCE_SAMPLE 3 7C7E82
201530470054_R02C01 7.892371 -7.894600 ... REFERENCE_SAMPLE 112 6B66AE
201530470054_R03C01 7.611932 -8.378901 ... REFERENCE_SAMPLE 111 6B66AE
201904410008_R03C01 7.850667 -6.202358 ... REFERENCE_SAMPLE 4 7C7E82
201870610040_R03C01 5.266904 -4.993761 ... REFERENCE_SAMPLE 79 6282C2
200925700133_R02C01 6.583021 -7.879497 ... REFERENCE_SAMPLE 369 2E3092
200925700125_R07C01 6.138983 -7.349401 ... REFERENCE_SAMPLE 370 2E3092
201530470054_R04C01 8.566432 -7.849686 ... REFERENCE_SAMPLE 110 6B66AE
201870610040_R04C01 5.656568 -5.504724 ... REFERENCE_SAMPLE 77 6282C2
200925700133_R05C01 7.385612 -7.223170 ... REFERENCE_SAMPLE 366 2E3092
201904410008_R06C01 7.607561 -5.099674 ... REFERENCE_SAMPLE 1 7C7E82
[20 rows x 12 columns]
To generate the UMAP plot:
>>> methyl_analysis.make_umap_plot()
The Plotly object for the UMAP plot is then available in methyl_analysis.umap_plot.
3. CN-summary plots
Mepylome has the ability to calculate CN-summary plots, which provide an overview of copy number variations (CNV) across samples. To calculate these plots, you need to activate segmentation by setting ‘do_seg’ to True when initializing the analysis.
>>> methyl_analysis = MethylAnalysis(
>>> analysis_dir=ANALYSIS_DIR,
>>> reference_dir=REFERENCE_DIR,
>>> do_seg=True,
>>> )
The annotation dataframe contains metadata for each sample, including diagnosis information.
>>> annotation_df = methyl_analysis.idat_handler.samples_annotated
Loop through each unique diagnosis in the dataset to generate CN-summary plots for all diagnosis categories.
>>> for diagnosis in annotation_df["Diagnosis"].unique():
>>> # Filter sample IDs for the current diagnosis group.
>>> sample_ids = annotation_df[annotation_df["Diagnosis"] == diagnosis].index
>>> # Generate CN-summary plot and retrieve data (if needed for further
>>> # analysis).
>>> cn_plot, df_cn_summary = methyl_analysis.cn_summary(sample_ids)
>>> # Update the plot layout with a title for the diagnosis group and label
>>> # the y-axis.
>>> cn_plot.update_layout(
>>> title=f"CN-summary for {diagnosis}",
>>> yaxis_title="Proportion of CNV gains/losses",
>>> )
>>> # Display the CN-summary plot in the browser.
>>> cn_plot.show()
Supervised Classification
1. Preimplemented Classifier
Methylome allows the use of preimplemented classifiers to predict methylation classes for samples. Below are examples of performing predictions on sample IDs and randomly generated values. Do setup as before:
>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.ensemble import (
>>> ExtraTreesClassifier,
>>> RandomForestClassifier,
>>> )
>>> from sklearn.feature_selection import (
>>> SelectKBest,
>>> )
>>> from sklearn.pipeline import Pipeline
>>> from mepylome.analysis import MethylAnalysis
>>> methyl_analysis = MethylAnalysis(
>>> analysis_dir=ANALYSIS_DIR,
>>> reference_dir=REFERENCE_DIR,
>>> )
Perform prediction on sample IDs:
>>> # Retrieve all sample IDs
>>> ids = methyl_analysis.idat_handler.ids
>>> # Set the annotation column for classification
>>> methyl_analysis.idat_handler.selected_columns = ["Diagnosis"]
>>> # Perform classification using a preimplemented classifier (pipeline with no
>>> # scaler, SelectKBest as selector and RandomForestClassifier for
>>> # classification) and cross-validation
>>> clf_out = methyl_analysis.classify(ids=ids, clf_list="vtl-kbest-rf")
>>> # Output accuracy scores:
>>> print("Accuracy Scores:", clf_out[0].metrics["accuracy_scores"])
Accuracy Scores: [1.0, 1.0, 1.0, 1.0, 1.0]
>>> # Detailed classifier report of first sample:
>>> print(clf_out[0].reports["txt"][0])
201530470054_R05C01
===================
Pipeline Structure:
- Scaler : passthrough
- Feature_selection : SelectKBest
k: 10000
- Classifier : RandomForestClassifier
Metrics:
- Method : 5-fold cross validation
- Samples : 20
- Features : 865859
- Accuracy : 1.0000 (SD 0.0000)
- AUC : 1.0000 (SD 0.0000)
- F1-Score : 1.0000 (SD 0.0000)
- Precision : 1.0000 (SD 0.0000)
- Recall : 1.0000 (SD 0.0000)
Classification Probability:
--------------------------------------
- Osteoblastoma : 74.00 %
- Osteosarcoma (high-grade) : 23.00 %
- Control (muscle tissue) : 2.00 %
- Chondrosarcoma : 1.00 %
--------------------------------------
Perform prediction on beta values:
>>> # Number of CpGs in the dataset
>>> methyl_analysis.set_betas()
>>> n_cpgs = methyl_analysis.betas_all.shape[1]
>>> # Generate random beta values for 10 artificial samples
>>> random_beta_values = pd.DataFrame(
>>> np.random.rand(10, n_cpgs), columns=methyl_analysis.betas_all.columns
>>> )
>>> # Perform classification on random values:
>>> clf_out = methyl_analysis.classify(
>>> values=random_beta_values, clf_list="vtl-kbest-rf"
>>> )
2. Train Your Own Classifiers
You can train custom classifiers using the scikit-learn API and incorporate them into the analysis. Below is an example using a Random Forest classifier.
>>> # Set the annotation column for classification
>>> methyl_analysis.idat_handler.selected_columns = ["Diagnosis"]
>>> # Extract features (X) and target labels (y)
>>> X = methyl_analysis.betas_all
>>> y = methyl_analysis.idat_handler.features()
>>> # Exclude test indices and invalid samples
>>> valid_indices = [
>>> i
>>> for i, (idx, label) in enumerate(zip(X.index, y, strict=True))
>>> if label and idx not in methyl_analysis.idat_handler.test_ids
>>> ]
>>> X = X.iloc[valid_indices]
>>> y = [y[i] for i in valid_indices]
>>> # Choose classifier (maybe with tuned parameters)
>>> rf_clf = RandomForestClassifier(
>>> n_estimators=500,
>>> max_depth=20,
>>> min_samples_split=5,
>>> min_samples_leaf=2,
>>> bootstrap=True,
>>> class_weight="balanced",
>>> random_state=42,
>>> )
>>> # Train the classifier
>>> rf_clf.fit(X, y)
>>> # If the classifier is already trained, mepylome will not repeat the training
>>> # (and cross-validation) process
>>> ids = methyl_analysis.idat_handler.ids
>>> clf_out = methyl_analysis.classify(ids=ids, clf_list=rf_clf)
>>> print(clf_out[0].reports["txt"][0])
201530470054_R05C01
===================
Pipeline Structure:
- Classifier: RandomForestClassifier
class_weight: balanced
max_depth: 20
min_samples_leaf: 2
min_samples_split: 5
n_estimators: 500
random_state: 42
Metrics:
- Samples : 20
- Features : 865859
Classification Probability:
--------------------------------------
- Osteoblastoma : 83.19 %
- Osteosarcoma (high-grade) : 9.87 %
- Control (muscle tissue) : 4.49 %
- Chondrosarcoma : 2.45 %
--------------------------------------
3. Use Untrained Scikit-Learn Classifiers
Untrained classifiers can also be passed directly for classification. Mepylome will then execute the training and cross-validation process. Below is an example using an Extra Trees classifier.
>>> # Define an untrained Extra Trees classifier
>>> et_clf = ExtraTreesClassifier(n_estimators=300, random_state=0)
>>> # Perform classification
>>> ids = methyl_analysis.idat_handler.ids
>>> clf_out = methyl_analysis.classify(ids=ids, clf_list=et_clf)
>>> print(clf_out[0].reports["txt"][0])
201530470054_R05C01
===================
Pipeline Structure:
- Classifier: ExtraTreesClassifier
n_estimators: 300
random_state: 0
Metrics:
- Method : 5-fold cross validation
- Samples : 20
- Features : 865859
- Accuracy : 1.0000 (SD 0.0000)
- AUC : 1.0000 (SD 0.0000)
- F1-Score : 1.0000 (SD 0.0000)
- Precision : 1.0000 (SD 0.0000)
- Recall : 1.0000 (SD 0.0000)
Classification Probability:
--------------------------------------
- Osteoblastoma : 77.00 %
- Osteosarcoma (high-grade) : 13.00 %
- Control (muscle tissue) : 6.00 %
- Chondrosarcoma : 4.00 %
--------------------------------------
4. Create a Custom Classifier
You can define a custom trained classifier by implementing a class that inherits from TrainedClassifier. This allows you to use non-standard models or apply additional functionality.
>>> # Define a Custom Classifier
>>> from mepylome.analysis.methyl_clf import TrainedClassifier
>>> class CustomClassifier(TrainedClassifier):
>>> def __init__(self, clf):
>>> self.clf = clf
>>> self._classes = clf.classes_
>>> def predict_proba(self, betas, id_=None):
>>> return self.clf.predict_proba(betas)
>>> def classes(self):
>>> return self._classes
>>> def info(self, output_format="txt"):
>>> return "This text will be printed in reports."
>>> def metrics(self):
>>> return {"Key0": "Value0", "Key1": "Value1"}
>>> # Initialize the custom classifier
>>> custom_clf = CustomClassifier(rf_clf)
>>> # Perform classification
>>> clf_out = methyl_analysis.classify(ids=ids, clf_list=custom_clf)
>>> print(clf_out[0].reports["txt"][0])
201530470054_R05C01
===================
This text will be printed in reports.
Classification Probability:
--------------------------------------
- Osteoblastoma : 83.19 %
- Osteosarcoma (high-grade) : 9.87 %
- Control (muscle tissue) : 4.49 %
- Chondrosarcoma : 2.45 %
--------------------------------------
5. Add a Classifier into the MethylAnalysis Object
Finally, you can integrate a classifier directly into the MethylAnalysis object for seamless use in workflows and the interactive GUI app.
>>> # Add a custom classifier into the MethylAnalysis object
>>> pipeline = Pipeline(
>>> [
>>> ("feature_selection", SelectKBest(k=20000)),
>>> ("classifier", RandomForestClassifier(n_estimators=500)),
>>> ]
>>> )
>>> methyl_analysis.classifiers = {"model": pipeline, "name": "Custom RF"}
>>> # Launch the interactive app with the integrated classifier. Now the classifier
>>> # is available in the GUI under the name 'Custom RF'.
>>> methyl_analysis.run_app(open_tab=True)
For further details and advanced usage, refer to the mepylome documentation.