Loading processed methylation data and checking beta distributions

Loading Beta Values and Metadata

We will continue working with the dataset referenced in the methylprep general walkthrough: GSE147391. It was processed from the command line with the following command:

>>> python -m methylprep process -d <filepath> --all

Where <filepath> is the directory where the raw IDAT files are stored. We use --all to tell methylprep to give us all the possible output files, which include:

  • beta_values.pkl
  • poobah_values.pkl
  • control_probes.pkl
  • m_values.pkl
  • noob_meth_values.pkl
  • noob_unmeth_values.pkl
  • meth_values.pkl
  • unmeth_values.pkl
  • sample_sheet_meta_data.pkl

As well as folders that contain the processed methylation data in .csv files.

[1]:
import methylcheck
from pathlib import Path
filepath = Path('/Users/patriciagirardi/tutorial/GPL21145')

betas = methylcheck.load(filepath)
betas.head()
Files: 100%|██████████| 1/1 [00:00<00:00,  5.92it/s]
INFO:methylcheck.load_processed:loaded data (865859, 16) from 1 pickled files (0.222s)
[1]:
203163220027_R01C01 203163220027_R02C01 203163220027_R03C01 203163220027_R04C01 203163220027_R05C01 203163220027_R06C01 203163220027_R07C01 203163220027_R08C01 203175700025_R01C01 203175700025_R02C01 203175700025_R03C01 203175700025_R04C01 203175700025_R05C01 203175700025_R06C01 203175700025_R07C01 203175700025_R08C01
IlmnID
cg00000029 0.852 0.749 0.739 NaN 0.891 0.896 0.757 0.734 0.806 0.881 0.740 0.885 0.693 0.779 0.617 0.891
cg00000103 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
cg00000109 0.943 0.916 0.884 0.951 0.936 0.774 0.932 0.920 0.898 0.938 0.931 0.953 0.899 0.889 0.734 0.892
cg00000155 0.960 0.962 0.958 0.958 0.963 0.959 0.965 0.969 0.959 0.956 0.964 0.961 0.959 0.967 0.961 0.960
cg00000158 0.969 0.972 0.971 0.968 0.968 0.964 0.970 0.974 0.964 0.969 0.960 0.968 0.965 0.965 0.975 0.966

You may also use methylcheck.load_both() to load both the beta values and metadata at the same time. Note that methylcheck expects the formatting used by methylprep in this command.

[2]:
df, meta = methylcheck.load_both(filepath)
meta.head()
INFO:methylcheck.load_processed:Found several meta_data files; attempting to match each with its respective beta_values files in same folders.
WARNING:methylcheck.load_processed:Columns in sample sheet meta data files does not match for these files and cannot be combined:['/Users/patriciagirardi/tutorial/GPL21145/GSE147391_GPL21145_meta_data.pkl', '/Users/patriciagirardi/tutorial/GPL21145/sample_sheet_meta_data.pkl']
INFO:methylcheck.load_processed:Multiple meta_data found. Only loading the first file.
INFO:methylcheck.load_processed:Loading 16 samples.
Files: 100%|██████████| 1/1 [00:00<00:00,  5.98it/s]
INFO:methylcheck.load_processed:loaded data (865859, 16) from 1 pickled files (0.2s)
INFO:methylcheck.load_processed:meta.Sample_IDs match data.index (OK)
[2]:
GSM_ID Sample_Name Sentrix_ID Sentrix_Position source histological diagnosis description Sample_ID
2 GSM4429898 Grade II rep3 203163220027 R01C01 Resected glioma Diffuse astrocytoma (II) Glioma 203163220027_R01C01
3 GSM4429899 Grade III rep1 203163220027 R02C01 Resected glioma Anaplastic astrocytoma (III) Glioma 203163220027_R02C01
12 GSM4429908 Grade IV rep5 203163220027 R03C01 Resected glioma Glioblastoma (IV) Glioma 203163220027_R03C01
15 GSM4429911 Grade IV rep8 203163220027 R04C01 Resected glioma Glioblastoma (IV) Glioma 203163220027_R04C01
6 GSM4429902 Grade II rep6 203163220027 R05C01 Resected glioma Oligodendroglioma (II) Glioma 203163220027_R05C01

Loading Other Types of Data

Formats supported by methylcheck.load() are:

[‘beta_value’, ‘m_value’, ‘meth’, ‘meth_df’, ‘noob_df’, ‘sesame’, ‘beta_csv’]

where ‘beta_value’ is the default.

[3]:
# the unnormalized probe values
(meth,unmeth) = methylcheck.load(filepath, format='meth_df')
meth.head()
100%|██████████| 16/16 [00:00<00:00, 488.67it/s]
100%|██████████| 16/16 [00:00<00:00, 1480.91it/s]
INFO:methylcheck.load_processed:(865859, 16) (865859, 16)
[3]:
203163220027_R01C01 203163220027_R02C01 203163220027_R03C01 203163220027_R04C01 203163220027_R05C01 203163220027_R06C01 203163220027_R07C01 203163220027_R08C01 203175700025_R01C01 203175700025_R02C01 203175700025_R03C01 203175700025_R04C01 203175700025_R05C01 203175700025_R06C01 203175700025_R07C01 203175700025_R08C01
IlmnID
cg00000029 933.0 874.0 845.0 528.0 1579.0 1243.0 893.0 741.0 1016.0 1272.0 1035.0 1423.0 917.0 904.0 743.0 1386.0
cg00000103 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
cg00000109 2597.0 2428.0 2309.0 3082.0 2542.0 2382.0 2710.0 1930.0 2029.0 2551.0 2385.0 3010.0 2219.0 1716.0 2132.0 2158.0
cg00000155 3491.0 4397.0 3956.0 4059.0 4314.0 4059.0 4403.0 4302.0 3865.0 3734.0 4996.0 5297.0 3787.0 5216.0 4389.0 4404.0
cg00000158 5556.0 5534.0 5384.0 4833.0 5381.0 6078.0 5678.0 4738.0 4938.0 4690.0 5658.0 6767.0 6030.0 5780.0 6300.0 5050.0
[4]:
# noob-normalized probe values
(noob_meth,noob_unmeth) = methylcheck.load(filepath, format='noob_df')
noob_meth.head()
100%|██████████| 16/16 [00:00<00:00, 550.34it/s]
100%|██████████| 16/16 [00:00<00:00, 1499.81it/s]
INFO:methylcheck.load_processed:(865859, 16), (865859, 16)
[4]:
203163220027_R01C01 203163220027_R02C01 203163220027_R03C01 203163220027_R04C01 203163220027_R05C01 203163220027_R06C01 203163220027_R07C01 203163220027_R08C01 203175700025_R01C01 203175700025_R02C01 203175700025_R03C01 203175700025_R04C01 203175700025_R05C01 203175700025_R06C01 203175700025_R07C01 203175700025_R08C01
IlmnID
cg00000029 1174.0 1123.0 973.0 NaN 1745.0 1355.0 1074.0 812.0 1342.0 1497.0 1286.0 1696.0 1018.0 956.0 820.0 1524.0
cg00000103 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
cg00000109 3372.0 3003.0 2728.0 3645.0 2950.0 2766.0 3160.0 2185.0 2684.0 3191.0 2902.0 3526.0 2488.0 1883.0 2384.0 2368.0
cg00000155 4647.0 5488.0 4754.0 4858.0 5390.0 5019.0 5198.0 5042.0 5300.0 4875.0 6244.0 6430.0 4254.0 6116.0 4892.0 5048.0
cg00000158 7810.0 7100.0 6721.0 5904.0 6912.0 7875.0 6893.0 5625.0 6925.0 6278.0 7188.0 8478.0 7100.0 6859.0 7247.0 5887.0
[5]:
# m_values
m_values = methylcheck.load(filepath, format='m_value')
m_values.head()
Files: 100%|██████████| 1/1 [00:00<00:00,  5.41it/s]
INFO:methylcheck.load_processed:loaded data (865859, 16) from 1 pickled files (0.22s)
[5]:
203163220027_R01C01 203163220027_R02C01 203163220027_R03C01 203163220027_R04C01 203163220027_R05C01 203163220027_R06C01 203163220027_R07C01 203163220027_R08C01 203175700025_R01C01 203175700025_R02C01 203175700025_R03C01 203175700025_R04C01 203175700025_R05C01 203175700025_R06C01 203175700025_R07C01 203175700025_R08C01
IlmnID
cg00000029 2.525 1.579 1.500 NaN 3.028 3.109 1.643 1.466 2.055 2.890 1.508 2.940 1.175 1.819 0.685 3.027
cg00000103 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
cg00000109 4.047 3.444 2.930 4.269 3.875 1.775 3.780 3.516 3.133 3.912 3.748 4.358 3.162 3.008 1.468 3.045
cg00000155 4.582 4.681 4.528 4.518 4.717 4.538 4.797 4.960 4.564 4.444 4.738 4.606 4.547 4.885 4.641 4.567
cg00000158 4.960 5.135 5.042 4.928 4.909 4.748 4.996 5.238 4.746 4.958 4.568 4.920 4.788 4.795 5.284 4.844

Note: If you point to a folder with multiple batches of samples, of different array types, you’ll get an error. All samples in the path you choose need to have the same number of probes (same array type). Here is the message you’ll get in this scenario.

ValueError: all the input array dimensions for the concatenation axis must match exactly,
but along dimension 0, the array at index 0 has size 226618 and the array at index 1 has size 244827

This shouldn’t happen if you download and process data using methylprep (which will automatically detect array types and create separate folders for each array type). If you didn’t process with methylprep, the workaround is manually moving the files from different array-types into separate folders and loading them separately.

The meth format for methylcheck.load() returns a list of SampleDataContainer objects.

sdc[0]._SampleDataContainer__data_frame will give you a dataframe of the first SampleDataContainer in your list called sdc.

Loading Data from .csv Files

methylcheck assumes you want to load data the fastest way, using a single high-performance python3 pickled dataframe. But there are times when you want to load from CSV output files instead. One use case is where you want to examine probes that were filtered out by poobah (p-value probe detection). The CSVs contain this information, whereas the pickled dataframe has it removed by default.

If you wish to load the beta values from CSVs, point the function to the parent directory where your CSVs are and it will automatically go through that directory recursively to concatinate all of the beta value columns from each CSV present. Please note that this will take longer than reading it from the beta pickle.

This function will by default mask the beta values of failed probes with NaN (p<=0.05). If you wish to view your raw beta values without any influence by poobah, add the argument no_poobah=True.

[6]:
df = methylcheck.load(filepath, format='beta_csv')
df
Files:   0%|          | 0/16 [00:00<?, ?it/s]INFO:numexpr.utils:NumExpr defaulting to 8 threads.
Files: 100%|██████████| 16/16 [00:20<00:00,  1.29s/it]
INFO:methylcheck.load_processed:merging...
100%|██████████| 16/16 [00:00<00:00, 524.09it/s]
[6]:
203163220027_R02C01 203163220027_R06C01 203163220027_R08C01 203163220027_R03C01 203163220027_R07C01 203163220027_R04C01 203163220027_R01C01 203163220027_R05C01 203175700025_R02C01 203175700025_R06C01 203175700025_R08C01 203175700025_R03C01 203175700025_R07C01 203175700025_R04C01 203175700025_R01C01 203175700025_R05C01
IlmnID
cg00000029 0.749 0.896 0.734 0.739 0.757 NaN 0.852 0.891 0.881 0.779 0.891 0.740 0.617 0.885 0.806 0.693
cg00000103 0.922 0.950 0.654 0.931 0.712 0.924 0.943 0.931 0.943 0.920 0.731 0.809 0.611 0.365 0.946 0.934
cg00000109 0.916 0.774 0.920 0.884 0.932 0.951 0.943 0.936 0.938 0.889 0.892 0.931 0.734 0.953 0.898 0.899
cg00000155 0.962 0.959 0.969 0.958 0.965 0.958 0.960 0.963 0.956 0.967 0.960 0.964 0.961 0.961 0.959 0.959
cg00000158 0.972 0.964 0.974 0.971 0.970 0.968 0.969 0.968 0.969 0.965 0.966 0.960 0.975 0.968 0.964 0.965
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
rs9363764 0.544 0.544 0.592 0.062 0.058 0.431 0.962 0.036 0.046 0.562 0.046 0.054 0.541 0.551 0.971 0.966
rs939290 0.055 0.591 0.974 0.051 0.965 0.809 0.966 0.070 0.964 0.578 0.601 0.535 0.625 0.603 0.962 0.965
rs951295 0.557 0.032 0.566 0.515 0.955 0.966 0.548 0.975 0.024 0.073 0.556 0.035 0.974 0.523 0.029 0.949
rs966367 0.966 0.468 0.028 0.956 0.499 0.036 0.029 0.478 0.488 0.483 0.526 0.034 0.455 0.034 0.027 0.051
rs9839873 0.968 0.618 0.975 0.642 0.671 0.962 0.969 0.969 0.969 0.613 0.669 0.967 0.974 0.972 0.970 0.050

865918 rows × 16 columns

Checking Beta Distributions

One of the first steps users should take when loading their data is examining the beta distribution and ensuring that the expected bimodal distribution is present. Samples whose beta distributions don’t fit the bimodal distribution are more likely to be poor quality samples that need to be filtered out (you may be expecting a different distribution based on what kind of data you’re looking at, of course, but the standard is bimodal with peaks around 0 and 1).

methylcheck includes a beta density plot function that is similar to seaborn’s kde plot. If the sample size is small enough (<30), sample names will be included in the legend of the plot, so you may identify outlier samples easily.

[7]:
methylcheck.beta_density_plot(df)
WARNING:methylcheck.samples.postprocessQC:Your data contains 7585 missing probe values per sample, (121361 overall). For a list per sample, use verbose=True
../_images/docs_loading-data_16_1.png

This data looks relatively clean! Most data on GEO should be high quality, but it always pays to check.

You may also want to get an idea of the mean beta distribution (useful for comparing multiple datasets, or for looking at a dataset before/after quality control measures.) Check the filtering probes section for examples on comparing beta distributions and what filtering can do to your data’s mean beta distribution.

[8]:
methylcheck.mean_beta_plot(df)
../_images/docs_loading-data_18_0.png