Tutorial (library) ================== This tutorial will guide you through the steps to analyze methylation data using the ``mepylome`` package. We'll cover setting up the environment, parsing IDAT files, working with manifest files, processing raw data, performing copy number variation (CNV) analysis, and conducting methylation analysis using UMAP plots with a GUI. You can find the code of this tutorial as a script in https://github.com/brj0/mepylome/blob/main/examples/rtd_tutorial.py. .. contents:: Contents of Tutorial :depth: 3 Setup ----- First, import the required modules and set up the necessary directories: .. code-block:: python >>> from pathlib import Path >>> from mepylome import ( >>> CNV, >>> IdatParser, >>> Manifest, >>> MethylData, >>> RawData, >>> ReferenceMethylData, >>> ) Setup directories for the tutorial: .. code-block:: python >>> DIR = Path.home() / "mepylome" / "tutorial" >>> ANALYSIS_DIR = DIR / "tutorial_analysis" >>> REFERENCE_DIR = DIR / "tutorial_reference" Download and setup the necessary data files (approximately 653 MB): .. code-block:: python >>> from mepylome.utils import setup_tutorial_files >>> setup_tutorial_files(ANALYSIS_DIR, REFERENCE_DIR) Parsing IDAT files ------------------ Define the IDAT file path: .. code-block:: python >>> idat_file = ANALYSIS_DIR / "200925700125_R07C01_Grn.idat" Parse the IDAT file: .. code-block:: python >>> idat_data = IdatParser(idat_file) Print overview of parsed data: .. code-block:: python >>> print(idat_data) IdatParser( file_size: 13686991 num_fields: 19 illumina_ids: array([ 1600101, 1600111, ..., 99810990, 99810992], dtype=int32) probe_means: array([15629, 8469, ..., 7971, 943], dtype=uint16) std_dev: array([1377, 408, ..., 702, 312], dtype=uint16) n_beads: array([16, 7, ..., 6, 10], dtype=uint8) mid_block: array([ 1600101, 1600111, ..., 99810990, 99810992], dtype=int32) red_green: 0 mostly_null: barcode: 200925700125 chip_type: BeadChip 8x5 mostly_a: R07C01 unknown_1: unknown_2: unknown_3: unknown_4: unknown_5: unknown_6: unknown_7: ) The parsed data is available as attributes of the ``IdatParser`` object. For example the Illumina IDs (probes IDs) can be accessed by: .. code-block:: python >>> ids = idat_data.illumina_ids >>> print(ids) [ 1600101 1600111 1600115 ... 99810978 99810990 99810992] Manifest files -------------- The mepylome package includes a ``Manifest`` class that provides functionality to download, process, and save Illumina manifest files internally in a efficient format (stored in ~/.cache/mepylome/manifests). These manifest files contain information about the CpG sites on the methylation array, including genetic coordinates, probe types, and more. Load the available manifest files for different array types. .. code-block:: python >>> manifest_450k = Manifest("450k") >>> manifest_epic = Manifest("epic") >>> manifest_epic_v2 = Manifest("epicv2") >>> manifest_msa48 = Manifest("msa48") .. note:: The first time you run this, the manifest files will be downloaded and saved locally to ~/.cache/mepylome/manifests. This initial download might take some time. Obtain values from attributes: .. code-block:: python >>> probes_df = manifest_450k.data_frame >>> controls_df = manifest_450k.control_data_frame Print overview: .. code-block:: python >>> print(probes_df) IlmnID AddressA_ID AddressB_ID ... N_CpG End Probe_Type 0 cg13869341 62703328 16661461 ... 2 15865 1 1 cg14008030 27651330 -1 ... 2 18827 2 2 cg12045430 25703424 34666387 ... 7 29407 1 3 cg20826792 61731400 14693326 ... 7 29425 1 4 cg00381604 26752380 50693408 ... 6 29435 1 ... ... ... ... ... ... ... ... 485572 rs1416770 28667385 -1 ... 0 -1 4 485573 rs1941955 33709340 -1 ... 0 -1 4 485574 rs2125573 25698376 -1 ... 0 -1 4 485575 rs2521373 12625304 -1 ... 0 -1 4 485576 rs4331560 10654345 -1 ... 0 -1 4 [485577 rows x 12 columns] RawData ------- The ``RawData`` class extracts both raw green and raw red signal intensity data from a IDAT file pair. You can initialize it using a base path to the IDAT files (without the _Grn.idat / _Red.idat suffix), or by providing the full path to either the Grn or Red IDAT file. .. code-block:: python >>> idat_file = ANALYSIS_DIR / "200925700125_R07C01_Red.idat" >>> # or >>> idat_file = ANALYSIS_DIR / "200925700125_R07C01_Grn.idat" >>> # or >>> idat_file = ANALYSIS_DIR / "200925700125_R07C01" >>> raw_data = RawData(idat_file) The data is saved within the following attributes: .. code-block:: python >>> # Intensity signals >>> raw_data.grn >>> raw_data.red >>> # Type of the array_type (e.g., 450k, EPIC) >>> raw_data.array_type >>> # Corresponding manifest file >>> raw_data.manifest >>> # IDs on the bead >>> raw_data.ids Print an overview of the raw data .. code-block:: python >>> print(raw_data) RawData(): ********** array_type: epic manifest: epic probes: ['200925700125_R07C01'] ids: [ 1600101 1600111 1600115 ... 99810978 99810990 99810992] _grn: [[15629 8469 7015 ... 10228 7971 943]] _red: [[ 4429 1575 24955 ... 6594 15010 5336]] grn: 200925700125_R07C01 1600101 15629 1600111 8469 1600115 7015 1600123 7975 1600131 938 ... ... 99810958 6292 99810970 318 99810978 10228 99810990 7971 99810992 943 [1052641 rows x 1 columns] red: 200925700125_R07C01 1600101 4429 1600111 1575 1600115 24955 1600123 17707 1600131 8967 ... ... 99810958 1881 99810970 1936 99810978 6594 99810990 15010 99810992 5336 [1052641 rows x 1 columns] RawData can also read multiple files of the same array type (used for reference files): .. code-block:: python >>> idat_file0 = ANALYSIS_DIR / "200925700125_R07C01_Grn.idat" >>> idat_file1 = ANALYSIS_DIR / "200925700133_R02C01" >>> raw_data_2 = RawData([idat_file0, idat_file1]) Alternatively, read all IDAT files in a directory (supports recursive search): .. code-block:: python >>> raw_data_all = RawData(REFERENCE_DIR) MethylData ---------- The ``MethylData`` class allows for processing raw intensity data and can calculate methylation signals as well as beta values. The raw data can be preprocessed using one of the following methods: 'illumina' (default), 'swan', or 'noob'. Initialize MethylData with raw data using the default 'illumina' preprocessing method. .. code-block:: python >>> methyl_data = MethylData(raw_data) >>> methyl_data_all = MethylData(raw_data_all) Alternatively, you can explicitly specify the 'illumina' preprocessing method. .. code-block:: python >>> methyl_data = MethylData(raw_data, prep="illumina") You can also initialize MethylData directly from an IDAT file path, without using ``RawData``. This is the preferred method if you want to obtain methylation signals or beta values. .. code-block:: python >>> methyl_data = MethylData(file=idat_file) Obtain various values via the attributes of the MethylData object: .. code-block:: python >>> # The methylation signals for the green and red channels. >>> methylated_signals = methyl_data.methylated >>> unmethylated_signals = methyl_data.unmethylated >>> # The corrected color signals. >>> corrected_green_signals = methyl_data.grn >>> corrected_red_signals = methyl_data.red >>> # The type of the array used (e.g., 450k, EPIC, EPICv2). >>> array_type = methyl_data.array_type >>> # The corresponding manifest file. >>> corresponding_manifest = methyl_data.manifest Print an overview of the methylation data. .. code-block:: python >>> print(methyl_data) MethylData(): ************* array_type: epic manifest: epic probes: ['200925700125_R07C01'] _grn: [[16785.56811897 9044.70326442 7472.74551323 ... 10946.40456039 8506.302329 908.14615612]] _red: [[ 3957.99684771 1303.20883282 23051.26201716 ... 5971.87773497 13800.43272211 4801.6873626 ]] grn: 200925700125_R07C01 1600101 16785.568359 1600111 9044.703125 1600115 7472.745605 1600123 8510.626953 1600131 902.740540 ... ... 99810958 6691.091309 99810970 232.442169 99810978 10946.404297 99810990 8506.302734 99810992 908.146179 [1052641 rows x 1 columns] red: 200925700125_R07C01 1600101 3957.996826 1600111 1303.208862 1600115 23051.261719 1600123 16309.179688 1600131 8179.240234 ... ... 99810958 1587.849731 99810970 1639.010620 99810978 5971.877930 99810990 13800.432617 99810992 4801.687500 [1052641 rows x 1 columns] methylated: 200925700125_R07C01 IlmnID cg14817997 2510.375488 cg26928153 10454.492188 cg16269199 7020.834473 cg13869341 30160.773438 cg14008030 19805.154297 ... ... cg10488260 2282.708496 cg14273923 12481.604492 cg09748881 6418.647461 cg07587934 9533.372070 cg16855331 5837.001465 [865859 rows x 1 columns] unmethylated: 200925700125_R07C01 IlmnID cg14817997 855.783081 cg26928153 926.525330 cg16269199 3892.054932 cg13869341 5986.760742 cg14008030 8202.495117 ... ... cg10488260 8199.704102 cg14273923 2489.212646 cg09748881 950.663391 cg07587934 5264.926270 cg16855331 15618.041992 [865859 rows x 1 columns] Beta values are a indicator wheather a CpG is methylated or not. They can be calculated for all sites of the corresponding array: .. code-block:: python >>> betas = methyl_data.betas You can also access the beta values at specific CpG sites (here at all the sites of the EPICv2 manifest). Missing data will be replaced with `fill`. .. code-block:: python >>> epicv2_cpgs = manifest_epic_v2.methylation_probes >>> beta_specific = methyl_data.betas_at(epicv2_cpgs, fill=0.5) Using Alternative Preprocessing Methods ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Preprocess the raw data using the SWAN method. .. code-block:: python >>> methyl_data_swan = MethylData(raw_data, prep="swan") Preprocess the raw data using the NOOB method. .. code-block:: python >>> methyl_data_noob = MethylData(raw_data, prep="noob") See `api `_ for more information about SWAN and NOOB. Copy number variation (CNV) --------------------------- Copy number variations (CNV) are significant alterations in the genome involving the loss or gain of large DNA segments, often encompassing multiple genes. These variations are frequently linked to cancer development and can aid in tumor classification. The CNV profile can be calculated from signal intensity using methylation arrays. With the mepylome package, CNV can be efficiently calculated and visualized. **1. Set up analysis file** First, initialize your sample data for analysis: .. code-block:: python >>> sample = methyl_data **2. Set Up Reference Data** Within the reference directory there must be multiple CNV-neutral IDAT pairs of the **same array type** as `sample`. .. code-block:: python >>> reference = MethylData(file=REFERENCE_DIR) Alternatively, if the reference directory contains IDAT files of multiple array types, you can use ``ReferenceMethylData`` to load all files into memory. This way, the reference object can be used for multiple array types. The CNV class will automatically extract the files for the needed array type. .. code-block:: python >>> reference_all = ReferenceMethylData(REFERENCE_DIR) **3. Initialize CNV Analysis** Create an instance of the CNV class for the analysis, and fit the data (this is basically a linear regression model comparing `sample` signal with the `reference` signals at each CpG site): .. code-block:: python >>> cnv = CNV(sample, reference) >>> # Alternative with ReferenceMethylData >>> cnv = CNV(sample, reference_all) **4. Calculate CNV for Bins and Genes** Compute CNV values for genomic bins and gene regions: .. code-block:: python >>> cnv.set_bins() >>> cnv.set_detail() **5. Calculate CNV Segments** Use the binary circular segmentation algorithm for genome segmentation: .. code-block:: python >>> cnv.set_segments() .. note:: For this step, additional packages must be installed (see `installation `_). **6. Streamlined Analysis** Alternatively, perform all CNV computations in a single call: .. code-block:: python >>> cnv = CNV.set_all(sample, reference) >>> # or >>> cnv = CNV.set_all(sample, reference_all) **7. Visualize CNV Data** Display an interactive plot using Plotly, where genes can be highlighted: .. code-block:: python >>> cnv.plot() Methylation Analysis -------------------- **1. Set up analysis object and run GUI in browser** For methylation analysis, ensure you have the setup described in :ref:`general_setup` First, import the `MethylAnalysis` class from the `mepylome.analysis` module. .. code-block:: python >>> from mepylome.analysis import MethylAnalysis Create an instance of `MethylAnalysis` with the specified analysis and reference directories. .. code-block:: python >>> methyl_analysis = MethylAnalysis( >>> analysis_dir=ANALYSIS_DIR, >>> reference_dir=REFERENCE_DIR, >>> ) You can print an overview of the parameters of the object: .. code-block:: python >>> print(methyl_analysis) MethylAnalysis(): ***************** analysis_dir: /home/username/mepylome/tutorial/tutorial_analysis annotation: /home/username/mepylome/tutorial/tutorial_analysis/annotation.csv app: None betas_sel: None betas_all: None betas_dir: /tmp/mepylome/analysis/betas-tutorial_analysis-illumina-3b616e0e24a8b0e2d443b777b8ad8b61 cnv_dir: /tmp/mepylome/analysis/cnv-tutorial_analysis-tutorial_reference-illumina-False-f77099f7bcb04262a0456d122215ed4d cnv_id: None cnv_plot: Figure({ 'data': [], 'layout': {'template': '...', 'yaxis': {'range': [-2, 2... cpg_selection: top cpgs: ['cg00000029' 'cg00000103' 'cg00000109' ... 'ch.X.97651759F' 'ch.X.97737721F' 'ch.X.98007042R'] debug: False do_seg: False dropdown_id: [] host: localhost ids: Index(['200925700133_R04C01', '201530470054_R05C01', '201869690203_R03C01', '201904410008_R05C01', '201869690168_R08C01', '201530470054_R01C01', '201869690203_R06C01', '200925700133_R03C01', '201904410008_R02C01', '201904410008_R04C01', '201530470054_R02C01', '201530470054_R03C01', '201904410008_R03C01', '201870610040_R03C01', '200925700133_R02C01', '200925700125_R07C01', '201530470054_R04C01', '201870610040_R04C01', '200925700133_R05C01', '201904410008_R06C01'], dtype='object') ids_to_highlight: None load_full_betas: False n_cpgs: 25000 output_dir: /tmp/mepylome/analysis overlap: False port: 8050 precalculate_cnv: False prep: illumina raw_umap_plot: Figure({ 'data': [{'customdata': array([['Chondrosarcoma', 'methylation clas... reference_dir: /home/username/mepylome/tutorial/tutorial_reference selected_columns: ['Diagnosis'] umap_cpgs: None umap_df: Umap_x Umap_y ... Colour Umap_color 200925700133_R04C01 7.089545 1.715319 ... 2E3092 Osteosarcoma (high-grade) 201530470054_R05C01 8.902815 1.723695 ... 6B66AE Osteoblastoma 201869690203_R03C01 3.801413 5.398326 ... 6282C2 Chondrosarcoma 201904410008_R05C01 6.164794 5.332215 ... 7C7E82 Control (muscle tissue) 201869690168_R08C01 3.486633 4.991620 ... 6282C2 Chondrosarcoma 201530470054_R01C01 8.228397 1.753262 ... 6B66AE Osteoblastoma 201869690203_R06C01 3.823109 4.597386 ... 6282C2 Chondrosarcoma 200925700133_R03C01 7.502202 1.312939 ... 2E3092 Osteosarcoma (high-grade) 201904410008_R02C01 6.569446 5.207128 ... 7C7E82 Control (muscle tissue) 201904410008_R04C01 6.959955 4.955368 ... 7C7E82 Control (muscle tissue) 201530470054_R02C01 8.313415 2.253792 ... 6B66AE Osteoblastoma 201530470054_R03C01 8.599755 1.344881 ... 6B66AE Osteoblastoma 201904410008_R03C01 6.801136 5.651977 ... 7C7E82 Control (muscle tissue) 201870610040_R03C01 4.220203 5.398547 ... 6282C2 Chondrosarcoma 200925700133_R02C01 6.599361 2.232107 ... 2E3092 Osteosarcoma (high-grade) 200925700125_R07C01 6.635824 1.511574 ... 2E3092 Osteosarcoma (high-grade) 201530470054_R04C01 8.751877 2.230259 ... 6B66AE Osteoblastoma 201870610040_R04C01 4.270664 4.871753 ... 6282C2 Chondrosarcoma 200925700133_R05C01 7.345413 2.310878 ... 2E3092 Osteosarcoma (high-grade) 201904410008_R06C01 6.281984 4.778615 ... 7C7E82 Control (muscle tissue) [20 rows x 13 columns] umap_dir: /tmp/mepylome/analysis/umap-tutorial_analysis-illumina-25000-top-6ff1e8e7ec2b3a2f8d856ee634404085 umap_plot_path: /tmp/mepylome/analysis/umap-tutorial_analysis-illumina-25000-top-6ff1e8e7ec2b3a2f8d856ee634404085/umap_plot.csv upload_dir: /tmp/mepylome/analysis/upload-tutorial_analysis-c8e2f8ac691e9c3deba2e880aa7c5251 To enable interactive analysis, you can launch a GUI. This will open a new tab in your web browser and you can start a GUI-based methylation analysis. .. code-block:: python >>> methyl_analysis.run_app(open_tab=True) `MethylAnalysis` has multiple parameters. For example, you can provide a custom set of CpG sites. For instance, you can set: .. code-block:: python >>> cpgs = Manifest("epic").methylation_probes[:10000] >>> methyl_analysis = MethylAnalysis( >>> analysis_dir=ANALYSIS_DIR, >>> reference_dir=REFERENCE_DIR, >>> cpgs=cpgs, >>> ) Mepylome saves all results in a temporary directory ('/tmp/mepylome'). You can provide a custom directory for output: .. code-block:: python >>> OUTPUT_DIR = DIR / "output_dir" >>> OUTPUT_DIR.mkdir(parents=True, exist_ok=True) >>> methyl_analysis = MethylAnalysis( >>> analysis_dir=ANALYSIS_DIR, >>> reference_dir=REFERENCE_DIR, >>> output_dir=OUTPUT_DIR, >>> ) Here is a more comprehensive example with multiple custom parameters: .. code-block:: python >>> TEST_DIR = DIR / "test_dir" >>> methyl_analysis = MethylAnalysis( >>> analysis_dir=ANALYSIS_DIR, >>> reference_dir=REFERENCE_DIR, >>> output_dir=OUTPUT_DIR, >>> # New cases for validation, excluded from classifier training. >>> test_dir=TEST_DIR, >>> # Load beta values for all CpG sites into memory >>> load_full_betas=True, >>> # Use SWAN preprocessing method >>> prep="swan", >>> # Provide annotation file (if not already in analysis_dir) >>> annotation=ANALYSIS_DIR / "annotation.csv", >>> # Number of CpGs for UMAP analysis >>> n_cpgs=5000, >>> # Analyze the 'best' CpG sites >>> cpg_selection="top", >>> # Show segmentation intervals in CNV plot >>> do_seg=True, >>> ) Many parameters can be modified within the GUI application after initialization, but not all. **2. Set up beta values and generate UMAP** All calculations that can be performed within the GUI can also be done manually. For example, to extract the beta values: .. code-block:: python >>> methyl_analysis.set_betas() The beta values are then stored in: .. code-block:: python >>> methyl_analysis.betas_sel cg15836656 cg12823387 cg25563772 ... cg22115994 cg27601809 cg00444740 200925700133_R04C01 0.117822 0.058700 0.058626 ... 0.295462 0.738677 0.474659 201530470054_R05C01 0.087871 0.050942 0.065696 ... 0.119668 0.305578 0.105583 201869690203_R03C01 0.810276 0.735967 0.721423 ... 0.751743 0.200545 0.757986 201904410008_R05C01 0.892641 0.777778 0.840475 ... 0.731927 0.636772 0.737969 201869690168_R08C01 0.811684 0.886028 0.888461 ... 0.725186 0.186773 0.812097 201530470054_R01C01 0.065221 0.050916 0.024315 ... 0.069606 0.408356 0.102764 201869690203_R06C01 0.948498 0.892334 0.859396 ... 0.776809 0.080326 0.875806 200925700133_R03C01 0.123428 0.193491 0.173582 ... 0.406198 0.754906 0.231158 201904410008_R02C01 0.868791 0.838671 0.922514 ... 0.704358 0.789430 0.617703 201904410008_R04C01 0.791002 0.818421 0.770333 ... 0.707137 0.851243 0.643282 201530470054_R02C01 0.058704 0.092474 0.122352 ... 0.173067 0.606287 0.340654 201530470054_R03C01 0.046851 0.043868 0.106606 ... 0.050913 0.281679 0.151072 201904410008_R03C01 0.718867 0.819853 0.899481 ... 0.701382 0.788357 0.659056 201870610040_R03C01 0.783809 0.905159 0.851122 ... 0.822050 0.104641 0.862166 200925700133_R02C01 0.054682 0.076460 0.089566 ... 0.087220 0.665932 0.159535 200925700125_R07C01 0.048604 0.093311 0.129750 ... 0.457459 0.867225 0.323574 201530470054_R04C01 0.048784 0.036592 0.048904 ... 0.105748 0.214249 0.056179 201870610040_R04C01 0.870172 0.896742 0.842515 ... 0.659235 0.134309 0.786843 200925700133_R05C01 0.048411 0.153956 0.164447 ... 0.462702 0.662564 0.290133 201904410008_R06C01 0.825491 0.828089 0.876774 ... 0.752596 0.833333 0.645860 [20 rows x 5000 columns] To perform the UMAP algorithm: .. code-block:: python >>> methyl_analysis.compute_umap() The result of the UMAP algorithm is then stored in: .. code-block:: python >>> methyl_analysis.umap_df Umap_x Umap_y ... ID Colour 200925700133_R04C01 6.644683 -7.024573 ... REFERENCE_SAMPLE 367 2E3092 201530470054_R05C01 8.254219 -8.493701 ... REFERENCE_SAMPLE 109 6B66AE 201869690203_R03C01 5.185830 -5.596291 ... REFERENCE_SAMPLE 78 6282C2 201904410008_R05C01 7.058737 -5.482265 ... REFERENCE_SAMPLE 2 7C7E82 201869690168_R08C01 4.727582 -5.662395 ... REFERENCE_SAMPLE 80 6282C2 201530470054_R01C01 8.136439 -7.440626 ... REFERENCE_SAMPLE 113 6B66AE 201869690203_R06C01 4.729284 -5.012035 ... REFERENCE_SAMPLE 76 6282C2 200925700133_R03C01 6.980153 -7.641444 ... REFERENCE_SAMPLE 368 2E3092 201904410008_R02C01 7.460121 -5.769195 ... REFERENCE_SAMPLE 5 7C7E82 201904410008_R04C01 8.094703 -5.589474 ... REFERENCE_SAMPLE 3 7C7E82 201530470054_R02C01 7.892371 -7.894600 ... REFERENCE_SAMPLE 112 6B66AE 201530470054_R03C01 7.611932 -8.378901 ... REFERENCE_SAMPLE 111 6B66AE 201904410008_R03C01 7.850667 -6.202358 ... REFERENCE_SAMPLE 4 7C7E82 201870610040_R03C01 5.266904 -4.993761 ... REFERENCE_SAMPLE 79 6282C2 200925700133_R02C01 6.583021 -7.879497 ... REFERENCE_SAMPLE 369 2E3092 200925700125_R07C01 6.138983 -7.349401 ... REFERENCE_SAMPLE 370 2E3092 201530470054_R04C01 8.566432 -7.849686 ... REFERENCE_SAMPLE 110 6B66AE 201870610040_R04C01 5.656568 -5.504724 ... REFERENCE_SAMPLE 77 6282C2 200925700133_R05C01 7.385612 -7.223170 ... REFERENCE_SAMPLE 366 2E3092 201904410008_R06C01 7.607561 -5.099674 ... REFERENCE_SAMPLE 1 7C7E82 [20 rows x 12 columns] To generate the UMAP plot: .. code-block:: python >>> methyl_analysis.make_umap_plot() The Plotly object for the UMAP plot is then available in `methyl_analysis.umap_plot`. **3. CN-summary plots** Mepylome has the ability to calculate CN-summary plots, which provide an overview of copy number variations (CNV) across samples. To calculate these plots, you need to activate segmentation by setting 'do_seg' to True when initializing the analysis. .. code-block:: python >>> methyl_analysis = MethylAnalysis( >>> analysis_dir=ANALYSIS_DIR, >>> reference_dir=REFERENCE_DIR, >>> do_seg=True, >>> ) The annotation dataframe contains metadata for each sample, including diagnosis information. .. code-block:: python >>> annotation_df = methyl_analysis.idat_handler.samples_annotated Loop through each unique diagnosis in the dataset to generate CN-summary plots for all diagnosis categories. .. code-block:: python >>> for diagnosis in annotation_df["Diagnosis"].unique(): >>> # Filter sample IDs for the current diagnosis group. >>> sample_ids = annotation_df[annotation_df["Diagnosis"] == diagnosis].index >>> # Generate CN-summary plot and retrieve data (if needed for further >>> # analysis). >>> cn_plot, df_cn_summary = methyl_analysis.cn_summary(sample_ids) >>> # Update the plot layout with a title for the diagnosis group and label >>> # the y-axis. >>> cn_plot.update_layout( >>> title=f"CN-summary for {diagnosis}", >>> yaxis_title="Proportion of CNV gains/losses", >>> ) >>> # Display the CN-summary plot in the browser. >>> cn_plot.show() Supervised Classification ------------------------- **1. Preimplemented Classifier** Methylome allows the use of preimplemented classifiers to predict methylation classes for samples. Below are examples of performing predictions on sample IDs and randomly generated values. Do setup as before: .. code-block:: python >>> import numpy as np >>> import pandas as pd >>> from sklearn.ensemble import ( >>> ExtraTreesClassifier, >>> RandomForestClassifier, >>> ) >>> from sklearn.feature_selection import ( >>> SelectKBest, >>> ) >>> from sklearn.pipeline import Pipeline >>> from mepylome.analysis import MethylAnalysis >>> methyl_analysis = MethylAnalysis( >>> analysis_dir=ANALYSIS_DIR, >>> reference_dir=REFERENCE_DIR, >>> ) Perform prediction on sample IDs: .. code-block:: python >>> # Retrieve all sample IDs >>> ids = methyl_analysis.idat_handler.ids >>> # Set the annotation column for classification >>> methyl_analysis.idat_handler.selected_columns = ["Diagnosis"] >>> # Perform classification using a preimplemented classifier (pipeline with no >>> # scaler, SelectKBest as selector and RandomForestClassifier for >>> # classification) and cross-validation >>> clf_out = methyl_analysis.classify(ids=ids, clf_list="vtl-kbest-rf") >>> # Output accuracy scores: >>> print("Accuracy Scores:", clf_out[0].metrics["accuracy_scores"]) Accuracy Scores: [1.0, 1.0, 1.0, 1.0, 1.0] >>> # Detailed classifier report of first sample: >>> print(clf_out[0].reports["txt"][0]) 201530470054_R05C01 =================== Pipeline Structure: - Scaler : passthrough - Feature_selection : SelectKBest k: 10000 - Classifier : RandomForestClassifier Metrics: - Method : 5-fold cross validation - Samples : 20 - Features : 865859 - Accuracy : 1.0000 (SD 0.0000) - AUC : 1.0000 (SD 0.0000) - F1-Score : 1.0000 (SD 0.0000) - Precision : 1.0000 (SD 0.0000) - Recall : 1.0000 (SD 0.0000) Classification Probability: -------------------------------------- - Osteoblastoma : 74.00 % - Osteosarcoma (high-grade) : 23.00 % - Control (muscle tissue) : 2.00 % - Chondrosarcoma : 1.00 % -------------------------------------- Perform prediction on beta values: .. code-block:: python >>> # Number of CpGs in the dataset >>> methyl_analysis.set_betas() >>> n_cpgs = methyl_analysis.betas_all.shape[1] >>> # Generate random beta values for 10 artificial samples >>> random_beta_values = pd.DataFrame( >>> np.random.rand(10, n_cpgs), columns=methyl_analysis.betas_all.columns >>> ) >>> # Perform classification on random values: >>> clf_out = methyl_analysis.classify( >>> values=random_beta_values, clf_list="vtl-kbest-rf" >>> ) **2. Train Your Own Classifiers** You can train custom classifiers using the scikit-learn API and incorporate them into the analysis. Below is an example using a Random Forest classifier. .. code-block:: python >>> # Set the annotation column for classification >>> methyl_analysis.idat_handler.selected_columns = ["Diagnosis"] >>> # Extract features (X) and target labels (y) >>> X = methyl_analysis.betas_all >>> y = methyl_analysis.idat_handler.features() >>> # Exclude test indices and invalid samples >>> valid_indices = [ >>> i >>> for i, (idx, label) in enumerate(zip(X.index, y, strict=True)) >>> if label and idx not in methyl_analysis.idat_handler.test_ids >>> ] >>> X = X.iloc[valid_indices] >>> y = [y[i] for i in valid_indices] >>> # Choose classifier (maybe with tuned parameters) >>> rf_clf = RandomForestClassifier( >>> n_estimators=500, >>> max_depth=20, >>> min_samples_split=5, >>> min_samples_leaf=2, >>> bootstrap=True, >>> class_weight="balanced", >>> random_state=42, >>> ) >>> # Train the classifier >>> rf_clf.fit(X, y) >>> # If the classifier is already trained, mepylome will not repeat the training >>> # (and cross-validation) process >>> ids = methyl_analysis.idat_handler.ids >>> clf_out = methyl_analysis.classify(ids=ids, clf_list=rf_clf) >>> print(clf_out[0].reports["txt"][0]) 201530470054_R05C01 =================== Pipeline Structure: - Classifier: RandomForestClassifier class_weight: balanced max_depth: 20 min_samples_leaf: 2 min_samples_split: 5 n_estimators: 500 random_state: 42 Metrics: - Samples : 20 - Features : 865859 Classification Probability: -------------------------------------- - Osteoblastoma : 83.19 % - Osteosarcoma (high-grade) : 9.87 % - Control (muscle tissue) : 4.49 % - Chondrosarcoma : 2.45 % -------------------------------------- **3. Use Untrained Scikit-Learn Classifiers** Untrained classifiers can also be passed directly for classification. Mepylome will then execute the training and cross-validation process. Below is an example using an Extra Trees classifier. .. code-block:: python >>> # Define an untrained Extra Trees classifier >>> et_clf = ExtraTreesClassifier(n_estimators=300, random_state=0) >>> # Perform classification >>> ids = methyl_analysis.idat_handler.ids >>> clf_out = methyl_analysis.classify(ids=ids, clf_list=et_clf) >>> print(clf_out[0].reports["txt"][0]) 201530470054_R05C01 =================== Pipeline Structure: - Classifier: ExtraTreesClassifier n_estimators: 300 random_state: 0 Metrics: - Method : 5-fold cross validation - Samples : 20 - Features : 865859 - Accuracy : 1.0000 (SD 0.0000) - AUC : 1.0000 (SD 0.0000) - F1-Score : 1.0000 (SD 0.0000) - Precision : 1.0000 (SD 0.0000) - Recall : 1.0000 (SD 0.0000) Classification Probability: -------------------------------------- - Osteoblastoma : 77.00 % - Osteosarcoma (high-grade) : 13.00 % - Control (muscle tissue) : 6.00 % - Chondrosarcoma : 4.00 % -------------------------------------- **4. Create a Custom Classifier** You can define a custom trained classifier by implementing a class that inherits from `TrainedClassifier`. This allows you to use non-standard models or apply additional functionality. .. code-block:: python >>> # Define a Custom Classifier >>> from mepylome.analysis.methyl_clf import TrainedClassifier >>> class CustomClassifier(TrainedClassifier): >>> def __init__(self, clf): >>> self.clf = clf >>> self._classes = clf.classes_ >>> def predict_proba(self, betas, id_=None): >>> return self.clf.predict_proba(betas) >>> def classes(self): >>> return self._classes >>> def info(self, output_format="txt"): >>> return "This text will be printed in reports." >>> def metrics(self): >>> return {"Key0": "Value0", "Key1": "Value1"} >>> # Initialize the custom classifier >>> custom_clf = CustomClassifier(rf_clf) >>> # Perform classification >>> clf_out = methyl_analysis.classify(ids=ids, clf_list=custom_clf) >>> print(clf_out[0].reports["txt"][0]) 201530470054_R05C01 =================== This text will be printed in reports. Classification Probability: -------------------------------------- - Osteoblastoma : 83.19 % - Osteosarcoma (high-grade) : 9.87 % - Control (muscle tissue) : 4.49 % - Chondrosarcoma : 2.45 % -------------------------------------- **5. Add a Classifier into the MethylAnalysis Object** Finally, you can integrate a classifier directly into the `MethylAnalysis` object for seamless use in workflows and the interactive GUI app. .. code-block:: python >>> # Add a custom classifier into the MethylAnalysis object >>> pipeline = Pipeline( >>> [ >>> ("feature_selection", SelectKBest(k=20000)), >>> ("classifier", RandomForestClassifier(n_estimators=500)), >>> ] >>> ) >>> methyl_analysis.classifiers = {"model": pipeline, "name": "Custom RF"} >>> # Launch the interactive app with the integrated classifier. Now the classifier >>> # is available in the GUI under the name 'Custom RF'. >>> methyl_analysis.run_app(open_tab=True) For further details and advanced usage, refer to the mepylome documentation.