Custom QC with pOOBAH Vales

This tutorial is meant for those who want to have more customization to their quality control of beta values. Methylprep provides some automatic QC by default, but in this tutorial, we will go over how to do this manually, and with customizable parameters.

[1]:
import methylcheck
import pandas as pd
import numpy as np

Filepath of the processed files (Download and processing performed with Methylprep package)

[2]:
fpath = 'data/GPL13534/'

Load the Beta Values in a dataframe

The columns are each probe in the methylation array and the rows are each sample in the dataset. Note that if you want the dataframe in this orientation, you will need to transpose it.

The reason behind why we are using the format='beta_csv' in methylcheck.load is because this loads the raw beta values without any processing. By default, methylprep does some QC on the beta values automatically and saves those new beta values in beta_values.pkl. Specifically, it removes failed probes using Sesame pOOBAH method where a specific probe is classified as failed when the p-value >= 0.05.

If you want to use the pOOBAH to mask beta values yourself, you must specify no_poobah=True. Otherwise, it will mask them automatically when the CSV is loaded into a dataframe.

[22]:
betas = methylcheck.load('data/GPL13534', format='beta_csv', no_poobah=True).T
#betas.index.name = 'Samples'
print(betas.shape)
betas.head()
Files: 100%|██████████| 121/121 [00:52<00:00,  2.30it/s]
INFO:methylcheck.load_processed:merging...
100%|██████████| 121/121 [00:00<00:00, 692.51it/s]
(121, 485577)
[22]:
IlmnID cg00000029 cg00000108 cg00000109 cg00000165 cg00000236 cg00000289 cg00000292 cg00000321 cg00000363 cg00000622 ... rs7746156 rs798149 rs845016 rs877309 rs9292570 rs9363764 rs939290 rs951295 rs966367 rs9839873
9996247040_R03C02 0.796 0.961 0.853 0.246 0.902 0.583 0.930 0.465 0.398 0.008 ... 0.468 0.374 0.079 0.016 0.978 0.544 0.961 0.981 0.883 0.614
9996247040_R03C01 0.887 0.960 0.801 0.271 0.902 0.672 0.953 0.341 0.552 0.015 ... 0.971 0.396 0.059 0.444 0.967 0.042 0.537 0.967 0.582 0.333
3998909005_R06C01 0.847 0.972 0.914 0.187 0.950 0.820 0.900 0.345 0.375 0.014 ... 0.512 0.018 0.469 0.539 0.019 0.951 0.551 0.969 0.959 0.950
3998909005_R06C02 0.900 0.966 0.909 0.232 0.922 0.749 0.943 0.326 0.397 0.014 ... 0.508 0.984 0.922 0.024 0.019 0.962 0.582 0.536 0.945 0.944
3998909206_R01C02 0.885 0.957 0.911 0.152 0.922 0.797 0.926 0.391 0.404 0.016 ... 0.045 0.977 0.491 0.531 0.480 0.075 0.969 0.535 0.946 0.162

5 rows × 485577 columns

When loading the betas from the CSV, there are still control probes in your resulting dataframe. The cell below shows how to remove all of the control probes from you betas dataframe.

[23]:
rs_probes = betas.columns[betas.columns.str.startswith('rs')]
betas_nocontrol = betas.drop(rs_probes, axis=1)
print(betas_nocontrol.shape)
betas_nocontrol = betas_nocontrol.T[betas_nocontrol.index.sort_values()].T
betas_nocontrol
(121, 485512)
[23]:
IlmnID cg00000029 cg00000108 cg00000109 cg00000165 cg00000236 cg00000289 cg00000292 cg00000321 cg00000363 cg00000622 ... ch.X.93511680F ch.X.938089F ch.X.94051109R ch.X.94260649R ch.X.967194F ch.X.97129969R ch.X.97133160R ch.X.97651759F ch.X.97737721F ch.X.98007042R
100946230055_R04C01 0.864 0.971 0.925 0.288 0.935 0.654 0.945 0.378 0.468 0.010 ... 0.046 0.036 0.031 0.171 0.150 0.108 0.076 0.022 0.063 0.078
100946230056_R04C01 0.854 0.978 0.930 0.215 0.932 0.639 0.971 0.421 0.397 0.012 ... 0.034 0.040 0.034 0.105 0.149 0.094 0.059 0.019 0.058 0.057
100946230056_R04C02 0.879 0.958 0.866 0.257 0.899 0.604 0.971 0.191 0.560 0.012 ... 0.038 0.057 0.045 0.253 0.232 0.186 0.090 0.022 0.073 0.128
101032570143_R04C02 0.837 0.968 0.911 0.334 0.918 0.765 0.951 0.435 0.456 0.011 ... 0.034 0.098 0.058 0.329 0.392 0.447 0.206 0.032 0.070 0.142
101032570152_R04C01 0.813 0.971 0.928 0.164 0.934 0.810 0.955 0.358 0.372 0.011 ... 0.044 0.058 0.036 0.171 0.350 0.222 0.137 0.031 0.061 0.152
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9996247054_R03C01 0.840 0.957 0.871 0.253 0.917 0.673 0.932 0.374 0.374 0.012 ... 0.050 0.037 0.040 0.234 0.355 0.177 0.052 0.029 0.055 0.099
9996247054_R03C02 0.864 0.963 0.864 0.194 0.885 0.682 0.929 0.393 0.418 0.019 ... 0.047 0.048 0.029 0.155 0.283 0.187 0.065 0.037 0.086 0.180
9996247055_R03C01 0.817 0.956 0.842 0.292 0.887 0.671 0.903 0.423 0.490 0.014 ... 0.037 0.042 0.026 0.191 0.395 0.248 0.057 0.031 0.067 0.132
9996247055_R03C02 0.801 0.965 0.869 0.357 0.893 0.647 0.962 0.340 0.467 0.013 ... 0.046 0.045 0.025 0.157 0.384 0.143 0.059 0.024 0.055 0.097
9996247056_R05C02 0.837 0.976 0.945 0.244 0.957 0.759 0.927 0.402 0.454 0.013 ... 0.049 0.030 0.025 0.052 0.123 0.046 0.045 0.026 0.065 0.063

121 rows × 485512 columns

Load p-values in a dataframe

This is reading in the pOOBAH values to a dataframe, and should have the same dimensions as the betas dataframe. Each cell in this dataframe is a p-value for each probe for a specific sample. If a p-value is >=0.05, then it’s more likely that that specific probe for that sample failed. A failed probe means that the true probes signal is not istiguishable from the background fluorescence.

[19]:
p = pd.read_pickle('data/GPL13534/poobah_values.pkl').T
#p.index.name = 'Samples'
print(p.shape)
assert p.shape == betas_nocontrol.shape
print(f'Number of p-values >= 0.05: {(p>=0.05).sum().sum()}')
p = p.T[p.index.sort_values()].T
p.head()
(121, 485512)
Number of p-values >= 0.05: 1546688
[19]:
IlmnID cg00000029 cg00000108 cg00000109 cg00000165 cg00000236 cg00000289 cg00000292 cg00000321 cg00000363 cg00000622 ... ch.X.93511680F ch.X.938089F ch.X.94051109R ch.X.94260649R ch.X.967194F ch.X.97129969R ch.X.97133160R ch.X.97651759F ch.X.97737721F ch.X.98007042R
100946230055_R04C01 0.003 0.000 0.002 0.038 0.002 0.057 0.0 0.002 0.001 0.0 ... NaN 0.006 0.004 0.073 NaN 0.037 NaN 0.001 NaN NaN
100946230056_R04C01 0.004 0.000 0.002 0.026 0.001 0.082 0.0 0.001 0.001 0.0 ... NaN 0.004 0.003 0.038 NaN 0.035 NaN 0.001 NaN NaN
100946230056_R04C02 0.018 0.001 0.028 0.096 0.007 0.130 0.0 0.005 0.002 0.0 ... NaN 0.010 0.006 0.127 NaN 0.082 NaN 0.001 NaN NaN
101032570143_R04C02 0.004 0.001 0.004 0.054 0.002 0.045 0.0 0.001 0.001 0.0 ... NaN 0.025 0.014 0.241 NaN 0.535 NaN 0.003 NaN NaN
101032570152_R04C01 0.002 0.000 0.002 0.023 0.001 0.018 0.0 0.002 0.001 0.0 ... NaN 0.014 0.006 0.086 NaN 0.117 NaN 0.003 NaN NaN

5 rows × 485512 columns

Mask Beta values where probe fails

When the p-value of a probe for a specific sample >=0.05, it is more likely that the probe has failed, which means that the beta value for that probe may not be accurate. Because of this, it is a good idea to mask these beta values with a NULL value.

[24]:
cutoff = 0.05
betas_filtered = betas_nocontrol.mask((p>=cutoff), np.nan)

print(betas_filtered.shape)
print(f'Masked {betas_filtered.isna().sum().sum() - betas_nocontrol.isna().sum().sum()} beta values')
betas_filtered
(121, 485512)
Masked 1546688 beta values
[24]:
IlmnID cg00000029 cg00000108 cg00000109 cg00000165 cg00000236 cg00000289 cg00000292 cg00000321 cg00000363 cg00000622 ... ch.X.93511680F ch.X.938089F ch.X.94051109R ch.X.94260649R ch.X.967194F ch.X.97129969R ch.X.97133160R ch.X.97651759F ch.X.97737721F ch.X.98007042R
100946230055_R04C01 0.864 0.971 0.925 0.288 0.935 NaN 0.945 0.378 0.468 0.010 ... 0.046 0.036 0.031 NaN 0.150 0.108 0.076 0.022 0.063 0.078
100946230056_R04C01 0.854 0.978 0.930 0.215 0.932 NaN 0.971 0.421 0.397 0.012 ... 0.034 0.040 0.034 0.105 0.149 0.094 0.059 0.019 0.058 0.057
100946230056_R04C02 0.879 0.958 0.866 NaN 0.899 NaN 0.971 0.191 0.560 0.012 ... 0.038 0.057 0.045 NaN 0.232 NaN 0.090 0.022 0.073 0.128
101032570143_R04C02 0.837 0.968 0.911 NaN 0.918 0.765 0.951 0.435 0.456 0.011 ... 0.034 0.098 0.058 NaN 0.392 NaN 0.206 0.032 0.070 0.142
101032570152_R04C01 0.813 0.971 0.928 0.164 0.934 0.810 0.955 0.358 0.372 0.011 ... 0.044 0.058 0.036 NaN 0.350 NaN 0.137 0.031 0.061 0.152
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9996247054_R03C01 0.840 0.957 0.871 0.253 0.917 0.673 0.932 0.374 0.374 0.012 ... 0.050 0.037 0.040 NaN 0.355 NaN 0.052 0.029 0.055 0.099
9996247054_R03C02 0.864 0.963 0.864 0.194 0.885 NaN 0.929 0.393 0.418 0.019 ... 0.047 0.048 0.029 0.155 0.283 NaN 0.065 0.037 0.086 0.180
9996247055_R03C01 0.817 0.956 0.842 NaN 0.887 0.671 0.903 0.423 0.490 0.014 ... 0.037 0.042 0.026 NaN 0.395 NaN 0.057 0.031 0.067 0.132
9996247055_R03C02 0.801 0.965 0.869 NaN 0.893 NaN 0.962 0.340 0.467 0.013 ... 0.046 0.045 0.025 0.157 0.384 0.143 0.059 0.024 0.055 0.097
9996247056_R05C02 0.837 0.976 0.945 0.244 0.957 0.759 0.927 0.402 0.454 0.013 ... 0.049 0.030 0.025 0.052 0.123 0.046 0.045 0.026 0.065 0.063

121 rows × 485512 columns

Start here if you have already masked your beta values based on p-values or had that done automatically

Remove Samples based on Percent or Number of Failed Probes

[40]:
percent_cutoff = 0.2 #use a percent in decimal format (20% = 0.2)
qc_betas = betas_filtered[~((betas_filtered.T.isna().sum() / betas_filtered.shape[1]) > 0.2)]

#if you want to remove samples based off a number threshold rather than a percentage, use the following 2 lines:
#number_cutoff = 20000
#qc_betas = betas_filtered[~(betas_filtered.T.isna().sum() >=  number_cutoff)]

print(f'{betas_filtered.shape[0] - qc_betas.shape[0]} sample(s) removed because of pOOBAH failure')
print(f'Sample(s) removed: {set(betas_filtered.index) - set(qc_betas.index)}')
print(qc_betas.shape)
qc_betas
1 sample(s) removed because of poobah failure
Sample(s) removed: {'101032570169_R04C02'}
(120, 485512)
[40]:
IlmnID cg00000029 cg00000108 cg00000109 cg00000165 cg00000236 cg00000289 cg00000292 cg00000321 cg00000363 cg00000622 ... ch.X.93511680F ch.X.938089F ch.X.94051109R ch.X.94260649R ch.X.967194F ch.X.97129969R ch.X.97133160R ch.X.97651759F ch.X.97737721F ch.X.98007042R
100946230055_R04C01 0.864 0.971 0.925 0.288 0.935 NaN 0.945 0.378 0.468 0.010 ... 0.046 0.036 0.031 NaN 0.150 0.108 0.076 0.022 0.063 0.078
100946230056_R04C01 0.854 0.978 0.930 0.215 0.932 NaN 0.971 0.421 0.397 0.012 ... 0.034 0.040 0.034 0.105 0.149 0.094 0.059 0.019 0.058 0.057
100946230056_R04C02 0.879 0.958 0.866 NaN 0.899 NaN 0.971 0.191 0.560 0.012 ... 0.038 0.057 0.045 NaN 0.232 NaN 0.090 0.022 0.073 0.128
101032570143_R04C02 0.837 0.968 0.911 NaN 0.918 0.765 0.951 0.435 0.456 0.011 ... 0.034 0.098 0.058 NaN 0.392 NaN 0.206 0.032 0.070 0.142
101032570152_R04C01 0.813 0.971 0.928 0.164 0.934 0.810 0.955 0.358 0.372 0.011 ... 0.044 0.058 0.036 NaN 0.350 NaN 0.137 0.031 0.061 0.152
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9996247054_R03C01 0.840 0.957 0.871 0.253 0.917 0.673 0.932 0.374 0.374 0.012 ... 0.050 0.037 0.040 NaN 0.355 NaN 0.052 0.029 0.055 0.099
9996247054_R03C02 0.864 0.963 0.864 0.194 0.885 NaN 0.929 0.393 0.418 0.019 ... 0.047 0.048 0.029 0.155 0.283 NaN 0.065 0.037 0.086 0.180
9996247055_R03C01 0.817 0.956 0.842 NaN 0.887 0.671 0.903 0.423 0.490 0.014 ... 0.037 0.042 0.026 NaN 0.395 NaN 0.057 0.031 0.067 0.132
9996247055_R03C02 0.801 0.965 0.869 NaN 0.893 NaN 0.962 0.340 0.467 0.013 ... 0.046 0.045 0.025 0.157 0.384 0.143 0.059 0.024 0.055 0.097
9996247056_R05C02 0.837 0.976 0.945 0.244 0.957 0.759 0.927 0.402 0.454 0.013 ... 0.049 0.030 0.025 0.052 0.123 0.046 0.045 0.026 0.065 0.063

120 rows × 485512 columns

Drop out Probes with a Percentage of NaNs

If you want to drop the probes with either all NaNs or a percentage of NaNs, use this code below. However, there are some scenarios where you will have to add back those probe columns, so only use this step if you have to.

[46]:
threshold = 0.95
final_betas = qc_betas.dropna(axis=1, thresh = int(threshold*qc_betas.shape[0]))
print(f'{qc_betas.shape[1] - final_betas.shape[1]} probe(s) removed because of NaNs')
#print(f'Sample(s) removed: {set(qc_betas.columns) - set(final_betas.columns)}') #could be a long output
final_betas
40600 probe(s) removed because of NaNs
[46]:
IlmnID cg00000029 cg00000108 cg00000236 cg00000292 cg00000321 cg00000363 cg00000622 cg00000658 cg00000714 cg00000721 ... ch.X.92543860F ch.X.92554290F ch.X.93511680F ch.X.938089F ch.X.94051109R ch.X.967194F ch.X.97133160R ch.X.97651759F ch.X.97737721F ch.X.98007042R
100946230055_R04C01 0.864 0.971 0.935 0.945 0.378 0.468 0.010 0.862 0.249 0.936 ... 0.026 0.023 0.046 0.036 0.031 0.150 0.076 0.022 0.063 0.078
100946230056_R04C01 0.854 0.978 0.932 0.971 0.421 0.397 0.012 0.907 0.278 0.954 ... 0.022 0.023 0.034 0.040 0.034 0.149 0.059 0.019 0.058 0.057
100946230056_R04C02 0.879 0.958 0.899 0.971 0.191 0.560 0.012 0.871 0.235 0.923 ... 0.029 0.022 0.038 0.057 0.045 0.232 0.090 0.022 0.073 0.128
101032570143_R04C02 0.837 0.968 0.918 0.951 0.435 0.456 0.011 0.912 0.343 0.938 ... 0.033 0.038 0.034 0.098 0.058 0.392 0.206 0.032 0.070 0.142
101032570152_R04C01 0.813 0.971 0.934 0.955 0.358 0.372 0.011 0.835 0.333 0.957 ... 0.027 0.032 0.044 0.058 0.036 0.350 0.137 0.031 0.061 0.152
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9996247054_R03C01 0.840 0.957 0.917 0.932 0.374 0.374 0.012 0.904 0.411 0.923 ... 0.030 0.028 0.050 0.037 0.040 0.355 0.052 0.029 0.055 0.099
9996247054_R03C02 0.864 0.963 0.885 0.929 0.393 0.418 0.019 0.884 0.335 0.927 ... 0.033 0.037 0.047 0.048 0.029 0.283 0.065 0.037 0.086 0.180
9996247055_R03C01 0.817 0.956 0.887 0.903 0.423 0.490 0.014 0.871 0.391 0.894 ... 0.026 0.032 0.037 0.042 0.026 0.395 0.057 0.031 0.067 0.132
9996247055_R03C02 0.801 0.965 0.893 0.962 0.340 0.467 0.013 0.892 0.309 0.932 ... 0.033 0.028 0.046 0.045 0.025 0.384 0.059 0.024 0.055 0.097
9996247056_R05C02 0.837 0.976 0.957 0.927 0.402 0.454 0.013 0.912 0.299 0.945 ... 0.029 0.030 0.049 0.030 0.025 0.123 0.045 0.026 0.065 0.063

120 rows × 444912 columns

Another way to tell if your sample is bad is to predict the sex of your samples and compare the predicted sex to the actual sex, if that information is available. If the predicted sex does not match the actual sex, this is an indicator that the sample needs to be investigated further, and could potentially be removed.

If you are planning on using your beta values for a machine learning model, you may want to filter out the sex probes to get rid of any sex bias in your model.