Filtering problem probes

by criteria, or publication source, or sex-linked, or array-controls.

[1]:
%load_ext autoreload
%autoreload 2

## THE HARD WAY -- when this hasn't been pip installed yet.
#import methylprep and methylcheck -- adjust paths relative to this folder.
import os
print(os.getcwd())

import sys
methylcheck_path = os.path.abspath(os.path.join('..'))
if methylcheck_path not in sys.path:
    sys.path.insert(0,methylcheck_path)
import methylcheck

methylprep_path = os.path.abspath(os.path.join('../../methylprep'))
if methylprep_path not in sys.path:
    sys.path.append(methylprep_path)
import methylprep

# ignore warnings for now
import warnings
warnings.filterwarnings('ignore')
print(methylcheck.__path__, methylprep.__path__)

""" ## THE EASY WAY.
import methylcheck
import methylprep
dir()
"""
/Users/mmaxmeister/legx/methylcheck/docs
['/Users/mmaxmeister/legx/methylcheck/methylcheck'] ['/Users/mmaxmeister/legx/methylprep/methylprep']
[1]:
' ## THE EASY WAY.\nimport methylcheck\nimport methylprep\ndir()\n'

All available probe exclusion lists

[2]:
criteria = ['Chen2013', 'Price2013', 'Naeem2014', 'DacaRoszak2015',
            'Polymorphism', 'CrossHybridization', 'BaseColorChange', 'RepeatSequenceElements']
EPIC_criteria = ['McCartney2016', 'Zhou2016', 'Polymorphism', 'CrossHybridization', 'BaseColorChange', 'RepeatSequenceElements']

print('450k')
for crit in criteria:
    print(crit, len(methylcheck.list_problem_probes('450k', [crit])))
print('EPIC')
for crit in EPIC_criteria:
    print(crit, len(methylcheck.list_problem_probes('EPIC', [crit])))

450k
Chen2013 265410
Price2013 213246
Naeem2014 128695
DacaRoszak2015 89678
Polymorphism 289952
CrossHybridization 92524
BaseColorChange 359
RepeatSequenceElements 96631
EPIC
McCartney2016 326267
Zhou2016 178671
Polymorphism 346033
CrossHybridization 108172
BaseColorChange 406
RepeatSequenceElements 0
[27]:
# read in the sample sheet for the experiment
baseDir = "example_data/GSE69852/"
# generate a dataframe of beta values for these samples with pipeline.
df = methylprep.run_pipeline(baseDir, betas=True)
100%|██████████| 6/6 [00:59<00:00,  9.84s/it]
[28]:
#import importlib
#importlib.reload(methylcheck)
df.head()
[28]:
9247377093_R02C01 9247377093_R03C01 9247377093_R06C02 9247377085_R04C02 9247377093_R05C01 9247377093_R02C02
IlmnID
cg00035864 0.236234 0.287561 0.318016 0.308176 0.239339 0.161795
cg00061679 0.427194 0.395514 0.456510 0.525169 0.523010 0.549533
cg00063477 0.929039 0.927137 0.940222 0.932739 0.930215 0.931468
cg00121626 0.481058 0.357316 0.328793 0.330045 0.403873 0.313132
cg00223952 0.044029 0.040062 0.038420 0.022201 0.027155 0.022284
[32]:
sketchy_probes_list = methylcheck.list_problem_probes('450k', ['Chen2013','Polymorphism'])
df2 = methylcheck.exclude_probes(df, sketchy_probes_list)
methylcheck.mean_beta_compare(df,df2)
Of 485512 probes, 290858 matched, yielding 194654 probes after filtering.
../_images/docs_filtering_probes_6_1.png

Be careful – you can apply the a probe list for EPIC to a 450k dataset, and it will work, but won’t be good filtering.

[34]:
sketchy_probes_list = methylcheck.list_problem_probes('EPIC', ['McCartney2016'])
df2 = methylcheck.exclude_probes(df, sketchy_probes_list)
methylcheck.mean_beta_compare(df,df2)
Of 485512 probes, 151418 matched, yielding 334094 probes after filtering.
../_images/docs_filtering_probes_8_1.png
[35]:
## Maximum filtering happens by default. (passing no criteria)
[36]:
sketchy_probes_list = methylcheck.list_problem_probes('450k')
df3 = methylcheck.exclude_probes(df, sketchy_probes_list)
methylcheck.mean_beta_compare(df,df3)
Of 485512 probes, 341057 matched, yielding 144455 probes after filtering.
../_images/docs_filtering_probes_10_1.png
[37]:
#underlying samples
methylcheck.beta_density_plot(df3)
6
../_images/docs_filtering_probes_11_1.png

There are other filtering techniques, such as MDS and cumulative_sum in other example notebooks.

[ ]: