Custom QC with pOOBAH Vales¶

This tutorial is meant for those who want to have more customization to their quality control of beta values. Methylprep provides some automatic QC by default, but in this tutorial, we will go over how to do this manually, and with customizable parameters.

[1]:

import methylcheck
import pandas as pd
import numpy as np

Filepath of the processed files (Download and processing performed with Methylprep package)

[2]:

fpath = 'data/GPL13534/'

Load the Beta Values in a dataframe¶

The columns are each probe in the methylation array and the rows are each sample in the dataset. Note that if you want the dataframe in this orientation, you will need to transpose it.

The reason behind why we are using the format='beta_csv' in methylcheck.load is because this loads the raw beta values without any processing. By default, methylprep does some QC on the beta values automatically and saves those new beta values in beta_values.pkl. Specifically, it removes failed probes using Sesame pOOBAH method where a specific probe is classified as failed when the p-value >= 0.05.

If you want to use the pOOBAH to mask beta values yourself, you must specify no_poobah=True. Otherwise, it will mask them automatically when the CSV is loaded into a dataframe.

[22]:

betas = methylcheck.load('data/GPL13534', format='beta_csv', no_poobah=True).T
#betas.index.name = 'Samples'
print(betas.shape)
betas.head()

Files: 100%|██████████| 121/121 [00:52<00:00,  2.30it/s]
INFO:methylcheck.load_processed:merging...
100%|██████████| 121/121 [00:00<00:00, 692.51it/s]

(121, 485577)

[22]:

IlmnID	cg00000029	cg00000108	cg00000109	cg00000165	cg00000236	cg00000289	cg00000292	cg00000321	cg00000363	cg00000622	...	rs7746156	rs798149	rs845016	rs877309	rs9292570	rs9363764	rs939290	rs951295	rs966367	rs9839873
9996247040_R03C02	0.796	0.961	0.853	0.246	0.902	0.583	0.930	0.465	0.398	0.008	...	0.468	0.374	0.079	0.016	0.978	0.544	0.961	0.981	0.883	0.614
9996247040_R03C01	0.887	0.960	0.801	0.271	0.902	0.672	0.953	0.341	0.552	0.015	...	0.971	0.396	0.059	0.444	0.967	0.042	0.537	0.967	0.582	0.333
3998909005_R06C01	0.847	0.972	0.914	0.187	0.950	0.820	0.900	0.345	0.375	0.014	...	0.512	0.018	0.469	0.539	0.019	0.951	0.551	0.969	0.959	0.950
3998909005_R06C02	0.900	0.966	0.909	0.232	0.922	0.749	0.943	0.326	0.397	0.014	...	0.508	0.984	0.922	0.024	0.019	0.962	0.582	0.536	0.945	0.944
3998909206_R01C02	0.885	0.957	0.911	0.152	0.922	0.797	0.926	0.391	0.404	0.016	...	0.045	0.977	0.491	0.531	0.480	0.075	0.969	0.535	0.946	0.162

5 rows × 485577 columns

When loading the betas from the CSV, there are still control probes in your resulting dataframe. The cell below shows how to remove all of the control probes from you betas dataframe.

[23]:

rs_probes = betas.columns[betas.columns.str.startswith('rs')]
betas_nocontrol = betas.drop(rs_probes, axis=1)
print(betas_nocontrol.shape)
betas_nocontrol = betas_nocontrol.T[betas_nocontrol.index.sort_values()].T
betas_nocontrol

(121, 485512)

[23]:

IlmnID	cg00000029	cg00000108	cg00000109	cg00000165	cg00000236	cg00000289	cg00000292	cg00000321	cg00000363	cg00000622	...	ch.X.93511680F	ch.X.938089F	ch.X.94051109R	ch.X.94260649R	ch.X.967194F	ch.X.97129969R	ch.X.97133160R	ch.X.97651759F	ch.X.97737721F	ch.X.98007042R
100946230055_R04C01	0.864	0.971	0.925	0.288	0.935	0.654	0.945	0.378	0.468	0.010	...	0.046	0.036	0.031	0.171	0.150	0.108	0.076	0.022	0.063	0.078
100946230056_R04C01	0.854	0.978	0.930	0.215	0.932	0.639	0.971	0.421	0.397	0.012	...	0.034	0.040	0.034	0.105	0.149	0.094	0.059	0.019	0.058	0.057
100946230056_R04C02	0.879	0.958	0.866	0.257	0.899	0.604	0.971	0.191	0.560	0.012	...	0.038	0.057	0.045	0.253	0.232	0.186	0.090	0.022	0.073	0.128
101032570143_R04C02	0.837	0.968	0.911	0.334	0.918	0.765	0.951	0.435	0.456	0.011	...	0.034	0.098	0.058	0.329	0.392	0.447	0.206	0.032	0.070	0.142
101032570152_R04C01	0.813	0.971	0.928	0.164	0.934	0.810	0.955	0.358	0.372	0.011	...	0.044	0.058	0.036	0.171	0.350	0.222	0.137	0.031	0.061	0.152
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9996247054_R03C01	0.840	0.957	0.871	0.253	0.917	0.673	0.932	0.374	0.374	0.012	...	0.050	0.037	0.040	0.234	0.355	0.177	0.052	0.029	0.055	0.099
9996247054_R03C02	0.864	0.963	0.864	0.194	0.885	0.682	0.929	0.393	0.418	0.019	...	0.047	0.048	0.029	0.155	0.283	0.187	0.065	0.037	0.086	0.180
9996247055_R03C01	0.817	0.956	0.842	0.292	0.887	0.671	0.903	0.423	0.490	0.014	...	0.037	0.042	0.026	0.191	0.395	0.248	0.057	0.031	0.067	0.132
9996247055_R03C02	0.801	0.965	0.869	0.357	0.893	0.647	0.962	0.340	0.467	0.013	...	0.046	0.045	0.025	0.157	0.384	0.143	0.059	0.024	0.055	0.097
9996247056_R05C02	0.837	0.976	0.945	0.244	0.957	0.759	0.927	0.402	0.454	0.013	...	0.049	0.030	0.025	0.052	0.123	0.046	0.045	0.026	0.065	0.063

121 rows × 485512 columns

Load p-values in a dataframe¶

This is reading in the pOOBAH values to a dataframe, and should have the same dimensions as the betas dataframe. Each cell in this dataframe is a p-value for each probe for a specific sample. If a p-value is >=0.05, then it’s more likely that that specific probe for that sample failed. A failed probe means that the true probes signal is not istiguishable from the background fluorescence.

[19]:

p = pd.read_pickle('data/GPL13534/poobah_values.pkl').T
#p.index.name = 'Samples'
print(p.shape)
assert p.shape == betas_nocontrol.shape
print(f'Number of p-values >= 0.05: {(p>=0.05).sum().sum()}')
p = p.T[p.index.sort_values()].T
p.head()

(121, 485512)
Number of p-values >= 0.05: 1546688

[19]:

IlmnID	cg00000029	cg00000108	cg00000109	cg00000165	cg00000236	cg00000289	cg00000321	cg00000363	...	ch.X.93511680F	ch.X.938089F	ch.X.94051109R	ch.X.94260649R	ch.X.967194F	ch.X.97129969R	ch.X.97133160R	ch.X.97651759F	ch.X.97737721F	ch.X.98007042R
100946230055_R04C01	0.003	0.000	0.002	0.038	0.002	0.057	0.002	0.001	...	NaN	0.006	0.004	0.073	NaN	0.037	NaN	0.001	NaN	NaN
100946230056_R04C01	0.004	0.000	0.002	0.026	0.001	0.082	0.001	0.001	...	NaN	0.004	0.003	0.038	NaN	0.035	NaN	0.001	NaN	NaN
100946230056_R04C02	0.018	0.001	0.028	0.096	0.007	0.130	0.005	0.002	...	NaN	0.010	0.006	0.127	NaN	0.082	NaN	0.001	NaN	NaN
101032570143_R04C02	0.004	0.001	0.004	0.054	0.002	0.045	0.001	0.001	...	NaN	0.025	0.014	0.241	NaN	0.535	NaN	0.003	NaN	NaN
101032570152_R04C01	0.002	0.000	0.002	0.023	0.001	0.018	0.002	0.001	...	NaN	0.014	0.006	0.086	NaN	0.117	NaN	0.003	NaN	NaN

5 rows × 485512 columns

Mask Beta values where probe fails¶

When the p-value of a probe for a specific sample >=0.05, it is more likely that the probe has failed, which means that the beta value for that probe may not be accurate. Because of this, it is a good idea to mask these beta values with a NULL value.

[24]:

cutoff = 0.05
betas_filtered = betas_nocontrol.mask((p>=cutoff), np.nan)

print(betas_filtered.shape)
print(f'Masked {betas_filtered.isna().sum().sum() - betas_nocontrol.isna().sum().sum()} beta values')
betas_filtered

(121, 485512)
Masked 1546688 beta values

[24]:

IlmnID	cg00000029	cg00000108	cg00000109	cg00000165	cg00000236	cg00000289	cg00000292	cg00000321	cg00000363	cg00000622	...	ch.X.93511680F	ch.X.938089F	ch.X.94051109R	ch.X.94260649R	ch.X.967194F	ch.X.97129969R	ch.X.97133160R	ch.X.97651759F	ch.X.97737721F	ch.X.98007042R
100946230055_R04C01	0.864	0.971	0.925	0.288	0.935	NaN	0.945	0.378	0.468	0.010	...	0.046	0.036	0.031	NaN	0.150	0.108	0.076	0.022	0.063	0.078
100946230056_R04C01	0.854	0.978	0.930	0.215	0.932	NaN	0.971	0.421	0.397	0.012	...	0.034	0.040	0.034	0.105	0.149	0.094	0.059	0.019	0.058	0.057
100946230056_R04C02	0.879	0.958	0.866	NaN	0.899	NaN	0.971	0.191	0.560	0.012	...	0.038	0.057	0.045	NaN	0.232	NaN	0.090	0.022	0.073	0.128
101032570143_R04C02	0.837	0.968	0.911	NaN	0.918	0.765	0.951	0.435	0.456	0.011	...	0.034	0.098	0.058	NaN	0.392	NaN	0.206	0.032	0.070	0.142
101032570152_R04C01	0.813	0.971	0.928	0.164	0.934	0.810	0.955	0.358	0.372	0.011	...	0.044	0.058	0.036	NaN	0.350	NaN	0.137	0.031	0.061	0.152
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9996247054_R03C01	0.840	0.957	0.871	0.253	0.917	0.673	0.932	0.374	0.374	0.012	...	0.050	0.037	0.040	NaN	0.355	NaN	0.052	0.029	0.055	0.099
9996247054_R03C02	0.864	0.963	0.864	0.194	0.885	NaN	0.929	0.393	0.418	0.019	...	0.047	0.048	0.029	0.155	0.283	NaN	0.065	0.037	0.086	0.180
9996247055_R03C01	0.817	0.956	0.842	NaN	0.887	0.671	0.903	0.423	0.490	0.014	...	0.037	0.042	0.026	NaN	0.395	NaN	0.057	0.031	0.067	0.132
9996247055_R03C02	0.801	0.965	0.869	NaN	0.893	NaN	0.962	0.340	0.467	0.013	...	0.046	0.045	0.025	0.157	0.384	0.143	0.059	0.024	0.055	0.097
9996247056_R05C02	0.837	0.976	0.945	0.244	0.957	0.759	0.927	0.402	0.454	0.013	...	0.049	0.030	0.025	0.052	0.123	0.046	0.045	0.026	0.065	0.063

121 rows × 485512 columns

Start here if you have already masked your beta values based on p-values or had that done automatically¶

Remove Samples based on Percent or Number of Failed Probes¶

[40]:

percent_cutoff = 0.2 #use a percent in decimal format (20% = 0.2)
qc_betas = betas_filtered[~((betas_filtered.T.isna().sum() / betas_filtered.shape[1]) > 0.2)]

#if you want to remove samples based off a number threshold rather than a percentage, use the following 2 lines:
#number_cutoff = 20000
#qc_betas = betas_filtered[~(betas_filtered.T.isna().sum() >=  number_cutoff)]

print(f'{betas_filtered.shape[0] - qc_betas.shape[0]} sample(s) removed because of pOOBAH failure')
print(f'Sample(s) removed: {set(betas_filtered.index) - set(qc_betas.index)}')
print(qc_betas.shape)
qc_betas

1 sample(s) removed because of poobah failure
Sample(s) removed: {'101032570169_R04C02'}
(120, 485512)

[40]:

IlmnID	cg00000029	cg00000108	cg00000109	cg00000165	cg00000236	cg00000289	cg00000292	cg00000321	cg00000363	cg00000622	...	ch.X.93511680F	ch.X.938089F	ch.X.94051109R	ch.X.94260649R	ch.X.967194F	ch.X.97129969R	ch.X.97133160R	ch.X.97651759F	ch.X.97737721F	ch.X.98007042R
100946230055_R04C01	0.864	0.971	0.925	0.288	0.935	NaN	0.945	0.378	0.468	0.010	...	0.046	0.036	0.031	NaN	0.150	0.108	0.076	0.022	0.063	0.078
100946230056_R04C01	0.854	0.978	0.930	0.215	0.932	NaN	0.971	0.421	0.397	0.012	...	0.034	0.040	0.034	0.105	0.149	0.094	0.059	0.019	0.058	0.057
100946230056_R04C02	0.879	0.958	0.866	NaN	0.899	NaN	0.971	0.191	0.560	0.012	...	0.038	0.057	0.045	NaN	0.232	NaN	0.090	0.022	0.073	0.128
101032570143_R04C02	0.837	0.968	0.911	NaN	0.918	0.765	0.951	0.435	0.456	0.011	...	0.034	0.098	0.058	NaN	0.392	NaN	0.206	0.032	0.070	0.142
101032570152_R04C01	0.813	0.971	0.928	0.164	0.934	0.810	0.955	0.358	0.372	0.011	...	0.044	0.058	0.036	NaN	0.350	NaN	0.137	0.031	0.061	0.152
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9996247054_R03C01	0.840	0.957	0.871	0.253	0.917	0.673	0.932	0.374	0.374	0.012	...	0.050	0.037	0.040	NaN	0.355	NaN	0.052	0.029	0.055	0.099
9996247054_R03C02	0.864	0.963	0.864	0.194	0.885	NaN	0.929	0.393	0.418	0.019	...	0.047	0.048	0.029	0.155	0.283	NaN	0.065	0.037	0.086	0.180
9996247055_R03C01	0.817	0.956	0.842	NaN	0.887	0.671	0.903	0.423	0.490	0.014	...	0.037	0.042	0.026	NaN	0.395	NaN	0.057	0.031	0.067	0.132
9996247055_R03C02	0.801	0.965	0.869	NaN	0.893	NaN	0.962	0.340	0.467	0.013	...	0.046	0.045	0.025	0.157	0.384	0.143	0.059	0.024	0.055	0.097
9996247056_R05C02	0.837	0.976	0.945	0.244	0.957	0.759	0.927	0.402	0.454	0.013	...	0.049	0.030	0.025	0.052	0.123	0.046	0.045	0.026	0.065	0.063

120 rows × 485512 columns

Drop out Probes with a Percentage of NaNs¶

If you want to drop the probes with either all NaNs or a percentage of NaNs, use this code below. However, there are some scenarios where you will have to add back those probe columns, so only use this step if you have to.

[46]:

threshold = 0.95
final_betas = qc_betas.dropna(axis=1, thresh = int(threshold*qc_betas.shape[0]))
print(f'{qc_betas.shape[1] - final_betas.shape[1]} probe(s) removed because of NaNs')
#print(f'Sample(s) removed: {set(qc_betas.columns) - set(final_betas.columns)}') #could be a long output
final_betas

40600 probe(s) removed because of NaNs

[46]:

IlmnID	cg00000029	cg00000108	cg00000236	cg00000292	cg00000321	cg00000363	cg00000622	cg00000658	cg00000714	cg00000721	...	ch.X.92543860F	ch.X.92554290F	ch.X.93511680F	ch.X.938089F	ch.X.94051109R	ch.X.967194F	ch.X.97133160R	ch.X.97651759F	ch.X.97737721F	ch.X.98007042R
100946230055_R04C01	0.864	0.971	0.935	0.945	0.378	0.468	0.010	0.862	0.249	0.936	...	0.026	0.023	0.046	0.036	0.031	0.150	0.076	0.022	0.063	0.078
100946230056_R04C01	0.854	0.978	0.932	0.971	0.421	0.397	0.012	0.907	0.278	0.954	...	0.022	0.023	0.034	0.040	0.034	0.149	0.059	0.019	0.058	0.057
100946230056_R04C02	0.879	0.958	0.899	0.971	0.191	0.560	0.012	0.871	0.235	0.923	...	0.029	0.022	0.038	0.057	0.045	0.232	0.090	0.022	0.073	0.128
101032570143_R04C02	0.837	0.968	0.918	0.951	0.435	0.456	0.011	0.912	0.343	0.938	...	0.033	0.038	0.034	0.098	0.058	0.392	0.206	0.032	0.070	0.142
101032570152_R04C01	0.813	0.971	0.934	0.955	0.358	0.372	0.011	0.835	0.333	0.957	...	0.027	0.032	0.044	0.058	0.036	0.350	0.137	0.031	0.061	0.152
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9996247054_R03C01	0.840	0.957	0.917	0.932	0.374	0.374	0.012	0.904	0.411	0.923	...	0.030	0.028	0.050	0.037	0.040	0.355	0.052	0.029	0.055	0.099
9996247054_R03C02	0.864	0.963	0.885	0.929	0.393	0.418	0.019	0.884	0.335	0.927	...	0.033	0.037	0.047	0.048	0.029	0.283	0.065	0.037	0.086	0.180
9996247055_R03C01	0.817	0.956	0.887	0.903	0.423	0.490	0.014	0.871	0.391	0.894	...	0.026	0.032	0.037	0.042	0.026	0.395	0.057	0.031	0.067	0.132
9996247055_R03C02	0.801	0.965	0.893	0.962	0.340	0.467	0.013	0.892	0.309	0.932	...	0.033	0.028	0.046	0.045	0.025	0.384	0.059	0.024	0.055	0.097
9996247056_R05C02	0.837	0.976	0.957	0.927	0.402	0.454	0.013	0.912	0.299	0.945	...	0.029	0.030	0.049	0.030	0.025	0.123	0.045	0.026	0.065	0.063

120 rows × 444912 columns

Another way to tell if your sample is bad is to predict the sex of your samples and compare the predicted sex to the actual sex, if that information is available. If the predicted sex does not match the actual sex, this is an indicator that the sample needs to be investigated further, and could potentially be removed.

If you are planning on using your beta values for a machine learning model, you may want to filter out the sex probes to get rid of any sex bias in your model.