Custom QC with pOOBAH Vales¶
This tutorial is meant for those who want to have more customization to their quality control of beta values. Methylprep provides some automatic QC by default, but in this tutorial, we will go over how to do this manually, and with customizable parameters.
[1]:
import methylcheck
import pandas as pd
import numpy as np
Filepath of the processed files (Download and processing performed with Methylprep package)
[2]:
fpath = 'data/GPL13534/'
Load the Beta Values in a dataframe¶
The columns are each probe in the methylation array and the rows are each sample in the dataset. Note that if you want the dataframe in this orientation, you will need to transpose it.
The reason behind why we are using the format='beta_csv'
in methylcheck.load
is because this loads the raw beta values without any processing. By default, methylprep does some QC on the beta values automatically and saves those new beta values in beta_values.pkl
. Specifically, it removes failed probes using Sesame pOOBAH method where a specific probe is classified as failed when the p-value >= 0.05.
If you want to use the pOOBAH to mask beta values yourself, you must specify no_poobah=True
. Otherwise, it will mask them automatically when the CSV is loaded into a dataframe.
[22]:
betas = methylcheck.load('data/GPL13534', format='beta_csv', no_poobah=True).T
#betas.index.name = 'Samples'
print(betas.shape)
betas.head()
Files: 100%|██████████| 121/121 [00:52<00:00, 2.30it/s]
INFO:methylcheck.load_processed:merging...
100%|██████████| 121/121 [00:00<00:00, 692.51it/s]
(121, 485577)
[22]:
IlmnID | cg00000029 | cg00000108 | cg00000109 | cg00000165 | cg00000236 | cg00000289 | cg00000292 | cg00000321 | cg00000363 | cg00000622 | ... | rs7746156 | rs798149 | rs845016 | rs877309 | rs9292570 | rs9363764 | rs939290 | rs951295 | rs966367 | rs9839873 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9996247040_R03C02 | 0.796 | 0.961 | 0.853 | 0.246 | 0.902 | 0.583 | 0.930 | 0.465 | 0.398 | 0.008 | ... | 0.468 | 0.374 | 0.079 | 0.016 | 0.978 | 0.544 | 0.961 | 0.981 | 0.883 | 0.614 |
9996247040_R03C01 | 0.887 | 0.960 | 0.801 | 0.271 | 0.902 | 0.672 | 0.953 | 0.341 | 0.552 | 0.015 | ... | 0.971 | 0.396 | 0.059 | 0.444 | 0.967 | 0.042 | 0.537 | 0.967 | 0.582 | 0.333 |
3998909005_R06C01 | 0.847 | 0.972 | 0.914 | 0.187 | 0.950 | 0.820 | 0.900 | 0.345 | 0.375 | 0.014 | ... | 0.512 | 0.018 | 0.469 | 0.539 | 0.019 | 0.951 | 0.551 | 0.969 | 0.959 | 0.950 |
3998909005_R06C02 | 0.900 | 0.966 | 0.909 | 0.232 | 0.922 | 0.749 | 0.943 | 0.326 | 0.397 | 0.014 | ... | 0.508 | 0.984 | 0.922 | 0.024 | 0.019 | 0.962 | 0.582 | 0.536 | 0.945 | 0.944 |
3998909206_R01C02 | 0.885 | 0.957 | 0.911 | 0.152 | 0.922 | 0.797 | 0.926 | 0.391 | 0.404 | 0.016 | ... | 0.045 | 0.977 | 0.491 | 0.531 | 0.480 | 0.075 | 0.969 | 0.535 | 0.946 | 0.162 |
5 rows × 485577 columns
When loading the betas from the CSV, there are still control probes in your resulting dataframe. The cell below shows how to remove all of the control probes from you betas dataframe.
[23]:
rs_probes = betas.columns[betas.columns.str.startswith('rs')]
betas_nocontrol = betas.drop(rs_probes, axis=1)
print(betas_nocontrol.shape)
betas_nocontrol = betas_nocontrol.T[betas_nocontrol.index.sort_values()].T
betas_nocontrol
(121, 485512)
[23]:
IlmnID | cg00000029 | cg00000108 | cg00000109 | cg00000165 | cg00000236 | cg00000289 | cg00000292 | cg00000321 | cg00000363 | cg00000622 | ... | ch.X.93511680F | ch.X.938089F | ch.X.94051109R | ch.X.94260649R | ch.X.967194F | ch.X.97129969R | ch.X.97133160R | ch.X.97651759F | ch.X.97737721F | ch.X.98007042R |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100946230055_R04C01 | 0.864 | 0.971 | 0.925 | 0.288 | 0.935 | 0.654 | 0.945 | 0.378 | 0.468 | 0.010 | ... | 0.046 | 0.036 | 0.031 | 0.171 | 0.150 | 0.108 | 0.076 | 0.022 | 0.063 | 0.078 |
100946230056_R04C01 | 0.854 | 0.978 | 0.930 | 0.215 | 0.932 | 0.639 | 0.971 | 0.421 | 0.397 | 0.012 | ... | 0.034 | 0.040 | 0.034 | 0.105 | 0.149 | 0.094 | 0.059 | 0.019 | 0.058 | 0.057 |
100946230056_R04C02 | 0.879 | 0.958 | 0.866 | 0.257 | 0.899 | 0.604 | 0.971 | 0.191 | 0.560 | 0.012 | ... | 0.038 | 0.057 | 0.045 | 0.253 | 0.232 | 0.186 | 0.090 | 0.022 | 0.073 | 0.128 |
101032570143_R04C02 | 0.837 | 0.968 | 0.911 | 0.334 | 0.918 | 0.765 | 0.951 | 0.435 | 0.456 | 0.011 | ... | 0.034 | 0.098 | 0.058 | 0.329 | 0.392 | 0.447 | 0.206 | 0.032 | 0.070 | 0.142 |
101032570152_R04C01 | 0.813 | 0.971 | 0.928 | 0.164 | 0.934 | 0.810 | 0.955 | 0.358 | 0.372 | 0.011 | ... | 0.044 | 0.058 | 0.036 | 0.171 | 0.350 | 0.222 | 0.137 | 0.031 | 0.061 | 0.152 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9996247054_R03C01 | 0.840 | 0.957 | 0.871 | 0.253 | 0.917 | 0.673 | 0.932 | 0.374 | 0.374 | 0.012 | ... | 0.050 | 0.037 | 0.040 | 0.234 | 0.355 | 0.177 | 0.052 | 0.029 | 0.055 | 0.099 |
9996247054_R03C02 | 0.864 | 0.963 | 0.864 | 0.194 | 0.885 | 0.682 | 0.929 | 0.393 | 0.418 | 0.019 | ... | 0.047 | 0.048 | 0.029 | 0.155 | 0.283 | 0.187 | 0.065 | 0.037 | 0.086 | 0.180 |
9996247055_R03C01 | 0.817 | 0.956 | 0.842 | 0.292 | 0.887 | 0.671 | 0.903 | 0.423 | 0.490 | 0.014 | ... | 0.037 | 0.042 | 0.026 | 0.191 | 0.395 | 0.248 | 0.057 | 0.031 | 0.067 | 0.132 |
9996247055_R03C02 | 0.801 | 0.965 | 0.869 | 0.357 | 0.893 | 0.647 | 0.962 | 0.340 | 0.467 | 0.013 | ... | 0.046 | 0.045 | 0.025 | 0.157 | 0.384 | 0.143 | 0.059 | 0.024 | 0.055 | 0.097 |
9996247056_R05C02 | 0.837 | 0.976 | 0.945 | 0.244 | 0.957 | 0.759 | 0.927 | 0.402 | 0.454 | 0.013 | ... | 0.049 | 0.030 | 0.025 | 0.052 | 0.123 | 0.046 | 0.045 | 0.026 | 0.065 | 0.063 |
121 rows × 485512 columns
Load p-values in a dataframe¶
This is reading in the pOOBAH values to a dataframe, and should have the same dimensions as the betas dataframe. Each cell in this dataframe is a p-value for each probe for a specific sample. If a p-value is >=0.05
, then it’s more likely that that specific probe for that sample failed. A failed probe means that the true probes signal is not istiguishable from the background fluorescence.
[19]:
p = pd.read_pickle('data/GPL13534/poobah_values.pkl').T
#p.index.name = 'Samples'
print(p.shape)
assert p.shape == betas_nocontrol.shape
print(f'Number of p-values >= 0.05: {(p>=0.05).sum().sum()}')
p = p.T[p.index.sort_values()].T
p.head()
(121, 485512)
Number of p-values >= 0.05: 1546688
[19]:
IlmnID | cg00000029 | cg00000108 | cg00000109 | cg00000165 | cg00000236 | cg00000289 | cg00000292 | cg00000321 | cg00000363 | cg00000622 | ... | ch.X.93511680F | ch.X.938089F | ch.X.94051109R | ch.X.94260649R | ch.X.967194F | ch.X.97129969R | ch.X.97133160R | ch.X.97651759F | ch.X.97737721F | ch.X.98007042R |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100946230055_R04C01 | 0.003 | 0.000 | 0.002 | 0.038 | 0.002 | 0.057 | 0.0 | 0.002 | 0.001 | 0.0 | ... | NaN | 0.006 | 0.004 | 0.073 | NaN | 0.037 | NaN | 0.001 | NaN | NaN |
100946230056_R04C01 | 0.004 | 0.000 | 0.002 | 0.026 | 0.001 | 0.082 | 0.0 | 0.001 | 0.001 | 0.0 | ... | NaN | 0.004 | 0.003 | 0.038 | NaN | 0.035 | NaN | 0.001 | NaN | NaN |
100946230056_R04C02 | 0.018 | 0.001 | 0.028 | 0.096 | 0.007 | 0.130 | 0.0 | 0.005 | 0.002 | 0.0 | ... | NaN | 0.010 | 0.006 | 0.127 | NaN | 0.082 | NaN | 0.001 | NaN | NaN |
101032570143_R04C02 | 0.004 | 0.001 | 0.004 | 0.054 | 0.002 | 0.045 | 0.0 | 0.001 | 0.001 | 0.0 | ... | NaN | 0.025 | 0.014 | 0.241 | NaN | 0.535 | NaN | 0.003 | NaN | NaN |
101032570152_R04C01 | 0.002 | 0.000 | 0.002 | 0.023 | 0.001 | 0.018 | 0.0 | 0.002 | 0.001 | 0.0 | ... | NaN | 0.014 | 0.006 | 0.086 | NaN | 0.117 | NaN | 0.003 | NaN | NaN |
5 rows × 485512 columns
Mask Beta values where probe fails¶
When the p-value of a probe for a specific sample >=0.05
, it is more likely that the probe has failed, which means that the beta value for that probe may not be accurate. Because of this, it is a good idea to mask these beta values with a NULL
value.
[24]:
cutoff = 0.05
betas_filtered = betas_nocontrol.mask((p>=cutoff), np.nan)
print(betas_filtered.shape)
print(f'Masked {betas_filtered.isna().sum().sum() - betas_nocontrol.isna().sum().sum()} beta values')
betas_filtered
(121, 485512)
Masked 1546688 beta values
[24]:
IlmnID | cg00000029 | cg00000108 | cg00000109 | cg00000165 | cg00000236 | cg00000289 | cg00000292 | cg00000321 | cg00000363 | cg00000622 | ... | ch.X.93511680F | ch.X.938089F | ch.X.94051109R | ch.X.94260649R | ch.X.967194F | ch.X.97129969R | ch.X.97133160R | ch.X.97651759F | ch.X.97737721F | ch.X.98007042R |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100946230055_R04C01 | 0.864 | 0.971 | 0.925 | 0.288 | 0.935 | NaN | 0.945 | 0.378 | 0.468 | 0.010 | ... | 0.046 | 0.036 | 0.031 | NaN | 0.150 | 0.108 | 0.076 | 0.022 | 0.063 | 0.078 |
100946230056_R04C01 | 0.854 | 0.978 | 0.930 | 0.215 | 0.932 | NaN | 0.971 | 0.421 | 0.397 | 0.012 | ... | 0.034 | 0.040 | 0.034 | 0.105 | 0.149 | 0.094 | 0.059 | 0.019 | 0.058 | 0.057 |
100946230056_R04C02 | 0.879 | 0.958 | 0.866 | NaN | 0.899 | NaN | 0.971 | 0.191 | 0.560 | 0.012 | ... | 0.038 | 0.057 | 0.045 | NaN | 0.232 | NaN | 0.090 | 0.022 | 0.073 | 0.128 |
101032570143_R04C02 | 0.837 | 0.968 | 0.911 | NaN | 0.918 | 0.765 | 0.951 | 0.435 | 0.456 | 0.011 | ... | 0.034 | 0.098 | 0.058 | NaN | 0.392 | NaN | 0.206 | 0.032 | 0.070 | 0.142 |
101032570152_R04C01 | 0.813 | 0.971 | 0.928 | 0.164 | 0.934 | 0.810 | 0.955 | 0.358 | 0.372 | 0.011 | ... | 0.044 | 0.058 | 0.036 | NaN | 0.350 | NaN | 0.137 | 0.031 | 0.061 | 0.152 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9996247054_R03C01 | 0.840 | 0.957 | 0.871 | 0.253 | 0.917 | 0.673 | 0.932 | 0.374 | 0.374 | 0.012 | ... | 0.050 | 0.037 | 0.040 | NaN | 0.355 | NaN | 0.052 | 0.029 | 0.055 | 0.099 |
9996247054_R03C02 | 0.864 | 0.963 | 0.864 | 0.194 | 0.885 | NaN | 0.929 | 0.393 | 0.418 | 0.019 | ... | 0.047 | 0.048 | 0.029 | 0.155 | 0.283 | NaN | 0.065 | 0.037 | 0.086 | 0.180 |
9996247055_R03C01 | 0.817 | 0.956 | 0.842 | NaN | 0.887 | 0.671 | 0.903 | 0.423 | 0.490 | 0.014 | ... | 0.037 | 0.042 | 0.026 | NaN | 0.395 | NaN | 0.057 | 0.031 | 0.067 | 0.132 |
9996247055_R03C02 | 0.801 | 0.965 | 0.869 | NaN | 0.893 | NaN | 0.962 | 0.340 | 0.467 | 0.013 | ... | 0.046 | 0.045 | 0.025 | 0.157 | 0.384 | 0.143 | 0.059 | 0.024 | 0.055 | 0.097 |
9996247056_R05C02 | 0.837 | 0.976 | 0.945 | 0.244 | 0.957 | 0.759 | 0.927 | 0.402 | 0.454 | 0.013 | ... | 0.049 | 0.030 | 0.025 | 0.052 | 0.123 | 0.046 | 0.045 | 0.026 | 0.065 | 0.063 |
121 rows × 485512 columns
Start here if you have already masked your beta values based on p-values or had that done automatically¶
Remove Samples based on Percent or Number of Failed Probes¶
[40]:
percent_cutoff = 0.2 #use a percent in decimal format (20% = 0.2)
qc_betas = betas_filtered[~((betas_filtered.T.isna().sum() / betas_filtered.shape[1]) > 0.2)]
#if you want to remove samples based off a number threshold rather than a percentage, use the following 2 lines:
#number_cutoff = 20000
#qc_betas = betas_filtered[~(betas_filtered.T.isna().sum() >= number_cutoff)]
print(f'{betas_filtered.shape[0] - qc_betas.shape[0]} sample(s) removed because of pOOBAH failure')
print(f'Sample(s) removed: {set(betas_filtered.index) - set(qc_betas.index)}')
print(qc_betas.shape)
qc_betas
1 sample(s) removed because of poobah failure
Sample(s) removed: {'101032570169_R04C02'}
(120, 485512)
[40]:
IlmnID | cg00000029 | cg00000108 | cg00000109 | cg00000165 | cg00000236 | cg00000289 | cg00000292 | cg00000321 | cg00000363 | cg00000622 | ... | ch.X.93511680F | ch.X.938089F | ch.X.94051109R | ch.X.94260649R | ch.X.967194F | ch.X.97129969R | ch.X.97133160R | ch.X.97651759F | ch.X.97737721F | ch.X.98007042R |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100946230055_R04C01 | 0.864 | 0.971 | 0.925 | 0.288 | 0.935 | NaN | 0.945 | 0.378 | 0.468 | 0.010 | ... | 0.046 | 0.036 | 0.031 | NaN | 0.150 | 0.108 | 0.076 | 0.022 | 0.063 | 0.078 |
100946230056_R04C01 | 0.854 | 0.978 | 0.930 | 0.215 | 0.932 | NaN | 0.971 | 0.421 | 0.397 | 0.012 | ... | 0.034 | 0.040 | 0.034 | 0.105 | 0.149 | 0.094 | 0.059 | 0.019 | 0.058 | 0.057 |
100946230056_R04C02 | 0.879 | 0.958 | 0.866 | NaN | 0.899 | NaN | 0.971 | 0.191 | 0.560 | 0.012 | ... | 0.038 | 0.057 | 0.045 | NaN | 0.232 | NaN | 0.090 | 0.022 | 0.073 | 0.128 |
101032570143_R04C02 | 0.837 | 0.968 | 0.911 | NaN | 0.918 | 0.765 | 0.951 | 0.435 | 0.456 | 0.011 | ... | 0.034 | 0.098 | 0.058 | NaN | 0.392 | NaN | 0.206 | 0.032 | 0.070 | 0.142 |
101032570152_R04C01 | 0.813 | 0.971 | 0.928 | 0.164 | 0.934 | 0.810 | 0.955 | 0.358 | 0.372 | 0.011 | ... | 0.044 | 0.058 | 0.036 | NaN | 0.350 | NaN | 0.137 | 0.031 | 0.061 | 0.152 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9996247054_R03C01 | 0.840 | 0.957 | 0.871 | 0.253 | 0.917 | 0.673 | 0.932 | 0.374 | 0.374 | 0.012 | ... | 0.050 | 0.037 | 0.040 | NaN | 0.355 | NaN | 0.052 | 0.029 | 0.055 | 0.099 |
9996247054_R03C02 | 0.864 | 0.963 | 0.864 | 0.194 | 0.885 | NaN | 0.929 | 0.393 | 0.418 | 0.019 | ... | 0.047 | 0.048 | 0.029 | 0.155 | 0.283 | NaN | 0.065 | 0.037 | 0.086 | 0.180 |
9996247055_R03C01 | 0.817 | 0.956 | 0.842 | NaN | 0.887 | 0.671 | 0.903 | 0.423 | 0.490 | 0.014 | ... | 0.037 | 0.042 | 0.026 | NaN | 0.395 | NaN | 0.057 | 0.031 | 0.067 | 0.132 |
9996247055_R03C02 | 0.801 | 0.965 | 0.869 | NaN | 0.893 | NaN | 0.962 | 0.340 | 0.467 | 0.013 | ... | 0.046 | 0.045 | 0.025 | 0.157 | 0.384 | 0.143 | 0.059 | 0.024 | 0.055 | 0.097 |
9996247056_R05C02 | 0.837 | 0.976 | 0.945 | 0.244 | 0.957 | 0.759 | 0.927 | 0.402 | 0.454 | 0.013 | ... | 0.049 | 0.030 | 0.025 | 0.052 | 0.123 | 0.046 | 0.045 | 0.026 | 0.065 | 0.063 |
120 rows × 485512 columns
Drop out Probes with a Percentage of NaNs¶
If you want to drop the probes with either all NaNs or a percentage of NaNs, use this code below. However, there are some scenarios where you will have to add back those probe columns, so only use this step if you have to.
[46]:
threshold = 0.95
final_betas = qc_betas.dropna(axis=1, thresh = int(threshold*qc_betas.shape[0]))
print(f'{qc_betas.shape[1] - final_betas.shape[1]} probe(s) removed because of NaNs')
#print(f'Sample(s) removed: {set(qc_betas.columns) - set(final_betas.columns)}') #could be a long output
final_betas
40600 probe(s) removed because of NaNs
[46]:
IlmnID | cg00000029 | cg00000108 | cg00000236 | cg00000292 | cg00000321 | cg00000363 | cg00000622 | cg00000658 | cg00000714 | cg00000721 | ... | ch.X.92543860F | ch.X.92554290F | ch.X.93511680F | ch.X.938089F | ch.X.94051109R | ch.X.967194F | ch.X.97133160R | ch.X.97651759F | ch.X.97737721F | ch.X.98007042R |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100946230055_R04C01 | 0.864 | 0.971 | 0.935 | 0.945 | 0.378 | 0.468 | 0.010 | 0.862 | 0.249 | 0.936 | ... | 0.026 | 0.023 | 0.046 | 0.036 | 0.031 | 0.150 | 0.076 | 0.022 | 0.063 | 0.078 |
100946230056_R04C01 | 0.854 | 0.978 | 0.932 | 0.971 | 0.421 | 0.397 | 0.012 | 0.907 | 0.278 | 0.954 | ... | 0.022 | 0.023 | 0.034 | 0.040 | 0.034 | 0.149 | 0.059 | 0.019 | 0.058 | 0.057 |
100946230056_R04C02 | 0.879 | 0.958 | 0.899 | 0.971 | 0.191 | 0.560 | 0.012 | 0.871 | 0.235 | 0.923 | ... | 0.029 | 0.022 | 0.038 | 0.057 | 0.045 | 0.232 | 0.090 | 0.022 | 0.073 | 0.128 |
101032570143_R04C02 | 0.837 | 0.968 | 0.918 | 0.951 | 0.435 | 0.456 | 0.011 | 0.912 | 0.343 | 0.938 | ... | 0.033 | 0.038 | 0.034 | 0.098 | 0.058 | 0.392 | 0.206 | 0.032 | 0.070 | 0.142 |
101032570152_R04C01 | 0.813 | 0.971 | 0.934 | 0.955 | 0.358 | 0.372 | 0.011 | 0.835 | 0.333 | 0.957 | ... | 0.027 | 0.032 | 0.044 | 0.058 | 0.036 | 0.350 | 0.137 | 0.031 | 0.061 | 0.152 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9996247054_R03C01 | 0.840 | 0.957 | 0.917 | 0.932 | 0.374 | 0.374 | 0.012 | 0.904 | 0.411 | 0.923 | ... | 0.030 | 0.028 | 0.050 | 0.037 | 0.040 | 0.355 | 0.052 | 0.029 | 0.055 | 0.099 |
9996247054_R03C02 | 0.864 | 0.963 | 0.885 | 0.929 | 0.393 | 0.418 | 0.019 | 0.884 | 0.335 | 0.927 | ... | 0.033 | 0.037 | 0.047 | 0.048 | 0.029 | 0.283 | 0.065 | 0.037 | 0.086 | 0.180 |
9996247055_R03C01 | 0.817 | 0.956 | 0.887 | 0.903 | 0.423 | 0.490 | 0.014 | 0.871 | 0.391 | 0.894 | ... | 0.026 | 0.032 | 0.037 | 0.042 | 0.026 | 0.395 | 0.057 | 0.031 | 0.067 | 0.132 |
9996247055_R03C02 | 0.801 | 0.965 | 0.893 | 0.962 | 0.340 | 0.467 | 0.013 | 0.892 | 0.309 | 0.932 | ... | 0.033 | 0.028 | 0.046 | 0.045 | 0.025 | 0.384 | 0.059 | 0.024 | 0.055 | 0.097 |
9996247056_R05C02 | 0.837 | 0.976 | 0.957 | 0.927 | 0.402 | 0.454 | 0.013 | 0.912 | 0.299 | 0.945 | ... | 0.029 | 0.030 | 0.049 | 0.030 | 0.025 | 0.123 | 0.045 | 0.026 | 0.065 | 0.063 |
120 rows × 444912 columns
Another way to tell if your sample is bad is to predict the sex of your samples and compare the predicted sex to the actual sex, if that information is available. If the predicted sex does not match the actual sex, this is an indicator that the sample needs to be investigated further, and could potentially be removed.
If you are planning on using your beta values for a machine learning model, you may want to filter out the sex probes to get rid of any sex bias in your model.