Loading processed methylation data and checking beta distributions¶

Loading Beta Values and Metadata¶

We will continue working with the dataset referenced in the methylprep general walkthrough: GSE147391. It was processed from the command line with the following command:

>>> python -m methylprep process -d <filepath> --all

Where <filepath> is the directory where the raw IDAT files are stored. We use --all to tell methylprep to give us all the possible output files, which include:

beta_values.pkl
poobah_values.pkl
control_probes.pkl
m_values.pkl
noob_meth_values.pkl
noob_unmeth_values.pkl
meth_values.pkl
unmeth_values.pkl
sample_sheet_meta_data.pkl

As well as folders that contain the processed methylation data in .csv files.

[1]:

import methylcheck
from pathlib import Path
filepath = Path('/Users/patriciagirardi/tutorial/GPL21145')

betas = methylcheck.load(filepath)
betas.head()

Files: 100%|██████████| 1/1 [00:00<00:00,  5.92it/s]
INFO:methylcheck.load_processed:loaded data (865859, 16) from 1 pickled files (0.222s)

[1]:

	203163220027_R01C01	203163220027_R02C01	203163220027_R03C01	203163220027_R04C01	203163220027_R05C01	203163220027_R06C01	203163220027_R07C01	203163220027_R08C01	203175700025_R01C01	203175700025_R02C01	203175700025_R03C01	203175700025_R04C01	203175700025_R05C01	203175700025_R06C01	203175700025_R07C01	203175700025_R08C01
IlmnID
cg00000029	0.852	0.749	0.739	NaN	0.891	0.896	0.757	0.734	0.806	0.881	0.740	0.885	0.693	0.779	0.617	0.891
cg00000103	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
cg00000109	0.943	0.916	0.884	0.951	0.936	0.774	0.932	0.920	0.898	0.938	0.931	0.953	0.899	0.889	0.734	0.892
cg00000155	0.960	0.962	0.958	0.958	0.963	0.959	0.965	0.969	0.959	0.956	0.964	0.961	0.959	0.967	0.961	0.960
cg00000158	0.969	0.972	0.971	0.968	0.968	0.964	0.970	0.974	0.964	0.969	0.960	0.968	0.965	0.965	0.975	0.966

You may also use methylcheck.load_both() to load both the beta values and metadata at the same time. Note that methylcheck expects the formatting used by methylprep in this command.

[2]:

df, meta = methylcheck.load_both(filepath)
meta.head()

INFO:methylcheck.load_processed:Found several meta_data files; attempting to match each with its respective beta_values files in same folders.
WARNING:methylcheck.load_processed:Columns in sample sheet meta data files does not match for these files and cannot be combined:['/Users/patriciagirardi/tutorial/GPL21145/GSE147391_GPL21145_meta_data.pkl', '/Users/patriciagirardi/tutorial/GPL21145/sample_sheet_meta_data.pkl']
INFO:methylcheck.load_processed:Multiple meta_data found. Only loading the first file.
INFO:methylcheck.load_processed:Loading 16 samples.
Files: 100%|██████████| 1/1 [00:00<00:00,  5.98it/s]
INFO:methylcheck.load_processed:loaded data (865859, 16) from 1 pickled files (0.2s)
INFO:methylcheck.load_processed:meta.Sample_IDs match data.index (OK)

[2]:

	GSM_ID	Sample_Name	Sentrix_ID	Sentrix_Position	source	histological diagnosis	description	Sample_ID
2	GSM4429898	Grade II rep3	203163220027	R01C01	Resected glioma	Diffuse astrocytoma (II)	Glioma	203163220027_R01C01
3	GSM4429899	Grade III rep1	203163220027	R02C01	Resected glioma	Anaplastic astrocytoma (III)	Glioma	203163220027_R02C01
12	GSM4429908	Grade IV rep5	203163220027	R03C01	Resected glioma	Glioblastoma (IV)	Glioma	203163220027_R03C01
15	GSM4429911	Grade IV rep8	203163220027	R04C01	Resected glioma	Glioblastoma (IV)	Glioma	203163220027_R04C01
6	GSM4429902	Grade II rep6	203163220027	R05C01	Resected glioma	Oligodendroglioma (II)	Glioma	203163220027_R05C01

Loading Other Types of Data¶

Formats supported by methylcheck.load() are:

[‘beta_value’, ‘m_value’, ‘meth’, ‘meth_df’, ‘noob_df’, ‘sesame’, ‘beta_csv’]

where ‘beta_value’ is the default.

[3]:

# the unnormalized probe values
(meth,unmeth) = methylcheck.load(filepath, format='meth_df')
meth.head()

100%|██████████| 16/16 [00:00<00:00, 488.67it/s]
100%|██████████| 16/16 [00:00<00:00, 1480.91it/s]
INFO:methylcheck.load_processed:(865859, 16) (865859, 16)

[3]:

	203163220027_R01C01	203163220027_R02C01	203163220027_R03C01	203163220027_R04C01	203163220027_R05C01	203163220027_R06C01	203163220027_R07C01	203163220027_R08C01	203175700025_R01C01	203175700025_R02C01	203175700025_R03C01	203175700025_R04C01	203175700025_R05C01	203175700025_R06C01	203175700025_R07C01	203175700025_R08C01
IlmnID
cg00000029	933.0	874.0	845.0	528.0	1579.0	1243.0	893.0	741.0	1016.0	1272.0	1035.0	1423.0	917.0	904.0	743.0	1386.0
cg00000103	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
cg00000109	2597.0	2428.0	2309.0	3082.0	2542.0	2382.0	2710.0	1930.0	2029.0	2551.0	2385.0	3010.0	2219.0	1716.0	2132.0	2158.0
cg00000155	3491.0	4397.0	3956.0	4059.0	4314.0	4059.0	4403.0	4302.0	3865.0	3734.0	4996.0	5297.0	3787.0	5216.0	4389.0	4404.0
cg00000158	5556.0	5534.0	5384.0	4833.0	5381.0	6078.0	5678.0	4738.0	4938.0	4690.0	5658.0	6767.0	6030.0	5780.0	6300.0	5050.0

[4]:

# noob-normalized probe values
(noob_meth,noob_unmeth) = methylcheck.load(filepath, format='noob_df')
noob_meth.head()

100%|██████████| 16/16 [00:00<00:00, 550.34it/s]
100%|██████████| 16/16 [00:00<00:00, 1499.81it/s]
INFO:methylcheck.load_processed:(865859, 16), (865859, 16)

[4]:

	203163220027_R01C01	203163220027_R02C01	203163220027_R03C01	203163220027_R04C01	203163220027_R05C01	203163220027_R06C01	203163220027_R07C01	203163220027_R08C01	203175700025_R01C01	203175700025_R02C01	203175700025_R03C01	203175700025_R04C01	203175700025_R05C01	203175700025_R06C01	203175700025_R07C01	203175700025_R08C01
IlmnID
cg00000029	1174.0	1123.0	973.0	NaN	1745.0	1355.0	1074.0	812.0	1342.0	1497.0	1286.0	1696.0	1018.0	956.0	820.0	1524.0
cg00000103	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
cg00000109	3372.0	3003.0	2728.0	3645.0	2950.0	2766.0	3160.0	2185.0	2684.0	3191.0	2902.0	3526.0	2488.0	1883.0	2384.0	2368.0
cg00000155	4647.0	5488.0	4754.0	4858.0	5390.0	5019.0	5198.0	5042.0	5300.0	4875.0	6244.0	6430.0	4254.0	6116.0	4892.0	5048.0
cg00000158	7810.0	7100.0	6721.0	5904.0	6912.0	7875.0	6893.0	5625.0	6925.0	6278.0	7188.0	8478.0	7100.0	6859.0	7247.0	5887.0

[5]:

# m_values
m_values = methylcheck.load(filepath, format='m_value')
m_values.head()

Files: 100%|██████████| 1/1 [00:00<00:00,  5.41it/s]
INFO:methylcheck.load_processed:loaded data (865859, 16) from 1 pickled files (0.22s)

[5]:

	203163220027_R01C01	203163220027_R02C01	203163220027_R03C01	203163220027_R04C01	203163220027_R05C01	203163220027_R06C01	203163220027_R07C01	203163220027_R08C01	203175700025_R01C01	203175700025_R02C01	203175700025_R03C01	203175700025_R04C01	203175700025_R05C01	203175700025_R06C01	203175700025_R07C01	203175700025_R08C01
IlmnID
cg00000029	2.525	1.579	1.500	NaN	3.028	3.109	1.643	1.466	2.055	2.890	1.508	2.940	1.175	1.819	0.685	3.027
cg00000103	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
cg00000109	4.047	3.444	2.930	4.269	3.875	1.775	3.780	3.516	3.133	3.912	3.748	4.358	3.162	3.008	1.468	3.045
cg00000155	4.582	4.681	4.528	4.518	4.717	4.538	4.797	4.960	4.564	4.444	4.738	4.606	4.547	4.885	4.641	4.567
cg00000158	4.960	5.135	5.042	4.928	4.909	4.748	4.996	5.238	4.746	4.958	4.568	4.920	4.788	4.795	5.284	4.844

Note: If you point to a folder with multiple batches of samples, of different array types, you’ll get an error. All samples in the path you choose need to have the same number of probes (same array type). Here is the message you’ll get in this scenario.

ValueError: all the input array dimensions for the concatenation axis must match exactly,
but along dimension 0, the array at index 0 has size 226618 and the array at index 1 has size 244827

This shouldn’t happen if you download and process data using methylprep (which will automatically detect array types and create separate folders for each array type). If you didn’t process with methylprep, the workaround is manually moving the files from different array-types into separate folders and loading them separately.

The meth format for methylcheck.load() returns a list of SampleDataContainer objects.

sdc[0]._SampleDataContainer__data_frame will give you a dataframe of the first SampleDataContainer in your list called sdc.

Loading Data from .csv Files¶

methylcheck assumes you want to load data the fastest way, using a single high-performance python3 pickled dataframe. But there are times when you want to load from CSV output files instead. One use case is where you want to examine probes that were filtered out by poobah (p-value probe detection). The CSVs contain this information, whereas the pickled dataframe has it removed by default.

If you wish to load the beta values from CSVs, point the function to the parent directory where your CSVs are and it will automatically go through that directory recursively to concatinate all of the beta value columns from each CSV present. Please note that this will take longer than reading it from the beta pickle.

This function will by default mask the beta values of failed probes with NaN (p<=0.05). If you wish to view your raw beta values without any influence by poobah, add the argument no_poobah=True.

[6]:

df = methylcheck.load(filepath, format='beta_csv')
df

Files:   0%|          | 0/16 [00:00<?, ?it/s]INFO:numexpr.utils:NumExpr defaulting to 8 threads.
Files: 100%|██████████| 16/16 [00:20<00:00,  1.29s/it]
INFO:methylcheck.load_processed:merging...
100%|██████████| 16/16 [00:00<00:00, 524.09it/s]

[6]:

	203163220027_R02C01	203163220027_R06C01	203163220027_R08C01	203163220027_R03C01	203163220027_R07C01	203163220027_R04C01	203163220027_R01C01	203163220027_R05C01	203175700025_R02C01	203175700025_R06C01	203175700025_R08C01	203175700025_R03C01	203175700025_R07C01	203175700025_R04C01	203175700025_R01C01	203175700025_R05C01
IlmnID
cg00000029	0.749	0.896	0.734	0.739	0.757	NaN	0.852	0.891	0.881	0.779	0.891	0.740	0.617	0.885	0.806	0.693
cg00000103	0.922	0.950	0.654	0.931	0.712	0.924	0.943	0.931	0.943	0.920	0.731	0.809	0.611	0.365	0.946	0.934
cg00000109	0.916	0.774	0.920	0.884	0.932	0.951	0.943	0.936	0.938	0.889	0.892	0.931	0.734	0.953	0.898	0.899
cg00000155	0.962	0.959	0.969	0.958	0.965	0.958	0.960	0.963	0.956	0.967	0.960	0.964	0.961	0.961	0.959	0.959
cg00000158	0.972	0.964	0.974	0.971	0.970	0.968	0.969	0.968	0.969	0.965	0.966	0.960	0.975	0.968	0.964	0.965
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
rs9363764	0.544	0.544	0.592	0.062	0.058	0.431	0.962	0.036	0.046	0.562	0.046	0.054	0.541	0.551	0.971	0.966
rs939290	0.055	0.591	0.974	0.051	0.965	0.809	0.966	0.070	0.964	0.578	0.601	0.535	0.625	0.603	0.962	0.965
rs951295	0.557	0.032	0.566	0.515	0.955	0.966	0.548	0.975	0.024	0.073	0.556	0.035	0.974	0.523	0.029	0.949
rs966367	0.966	0.468	0.028	0.956	0.499	0.036	0.029	0.478	0.488	0.483	0.526	0.034	0.455	0.034	0.027	0.051
rs9839873	0.968	0.618	0.975	0.642	0.671	0.962	0.969	0.969	0.969	0.613	0.669	0.967	0.974	0.972	0.970	0.050

865918 rows × 16 columns

Checking Beta Distributions¶

One of the first steps users should take when loading their data is examining the beta distribution and ensuring that the expected bimodal distribution is present. Samples whose beta distributions don’t fit the bimodal distribution are more likely to be poor quality samples that need to be filtered out (you may be expecting a different distribution based on what kind of data you’re looking at, of course, but the standard is bimodal with peaks around 0 and 1).

methylcheck includes a beta density plot function that is similar to seaborn’s kde plot. If the sample size is small enough (<30), sample names will be included in the legend of the plot, so you may identify outlier samples easily.

[7]:

methylcheck.beta_density_plot(df)

WARNING:methylcheck.samples.postprocessQC:Your data contains 7585 missing probe values per sample, (121361 overall). For a list per sample, use verbose=True

This data looks relatively clean! Most data on GEO should be high quality, but it always pays to check.

You may also want to get an idea of the mean beta distribution (useful for comparing multiple datasets, or for looking at a dataset before/after quality control measures.) Check the filtering probes section for examples on comparing beta distributions and what filtering can do to your data’s mean beta distribution.

[8]:

methylcheck.mean_beta_plot(df)