Quality Control Example

using methylprep and methylcheck and GSE69852 from the GEO database

Import modules

[1]:
# add directory of function to path
import sys
sys.path.append("/Users/mmaxmeister/legx/methylcheck/")
sys.path.append("/Users/mmaxmeister/legx/methylprep/") # even though methylprep is pip installed, this is still necessary on my machine (MM)
import methylprep
%matplotlib inline
[2]:
# not a module yet.
import os
print(os.getcwd())
os.chdir('/Users/mmaxmeister/legx/methylcheck')
import methylcheck
dir(methylcheck)
/Users/mmaxmeister/legx/methylcheck/docs
[2]:
['DNA_mAge_Hannum',
 'NullHandler',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'beta_density_plot',
 'beta_mds_plot',
 'cli',
 'cumulative_sum_beta_distribution',
 'detect_array',
 'exclude_probes',
 'exclude_sex_control_probes',
 'filters',
 'getLogger',
 'list_problem_probes',
 'mean_beta_compare',
 'mean_beta_plot',
 'postprocessQC']

Importing methylprep data

[5]:
# print(help(methylprep.run_pipeline))
# data_dir = "/Users/mmaxmeister/legx/methylprep/docs/example_data/GSE69852"

# SAMPLE DATA
# obtain and gunzip raw data from https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE100850&format=file first.
# ABOUT GSE100850: Epigenetic alterations detected in the genome of very young breast cancer patients:
#    identification of new biomarkers

### using methylprep, you would generate a sample output file called "beta_values.pkl" this way:
#data_dir = "/Users/mmaxmeister/GSE100850_RAW"
#df = methylprep.run_pipeline(data_dir, betas=True) # makes df into a betas sheet. (the consolidate function within it)

#In this example, we've already done that. So just load the data from disk, like this:
import pandas as pd
df = pd.read_pickle("docs/example_data/beta_values.pkl")

Filtering functions

[6]:
# command line autodetects array type. Use this to get it / confirm it.
print(methylcheck.detect_array(df))
df.shape
EPIC
[6]:
(865859, 39)
[7]:
df = methylcheck.exclude_sex_control_probes(df, 'EPIC', verbose=True)
df.shape
Array EPIC: Removed 19627 sex linked probes and 695 internal control probes from 39 samples. 846232 probes remaining.
Discrepancy between number of probes to exclude (20322) and number actually removed (19627): 695
It appears that your sample had no control probes, or that the control probe names didn't match the manifest (EPIC).
[7]:
(846232, 39)
[8]:
#print(help(methylcheck.list_problem_probes))
# if you pass in '450k' or 'EPIC' you'll get different types of filtering.
sketchy_probes = methylcheck.list_problem_probes('EPIC')
filtered = methylcheck.filters.exclude_probes(df, sketchy_probes)
Of 846232 probes, 381361 matched, yielding 464871 probes after filtering.

QC plots

[9]:
# Shows probe beta values for each sample as a separate trace.
methylcheck.beta_density_plot(df)
39
../_images/docs_methylprep_methylcheck_example_10_1.png
[10]:
# This shows all samples as a single trace.
methylcheck.mean_beta_plot(df)
../_images/docs_methylprep_methylcheck_example_11_0.png

Visual filtering functions

[10]:
# MDS filtering -- you can exclude outlier samples based on falling outside
# some standard deviation cutoff from the center of a 2D plot.
# This is an interactive function.
mds_filtered, mds_excluded = methylcheck.beta_mds_plot(df)
Your data needed to be transposed (df = df.transpose()).
(39, 846232)
Making sure that probes are in columns (the second number should be larger than the first).
Starting MDS fit_transform. this may take a while.
You can now remove outliers based on their transformed beta values
 falling outside a range, defined by the sample standard deviation.
Your acceptable value range: x=(-100.0 to 100.0), y=(-93.0 to 93.0).
../_images/docs_methylprep_methylcheck_example_13_1.png
Original samples (39, 2) vs filtered (30, 2)
Your scale factor was: 1.5
Enter new scale factor, <enter> to accept and save:
[18]:
# adjust the cutoff value to exclude outliers. Default is 0.7.
df_outliers_removed = methylcheck.cumulative_sum_beta_distribution(mds_filtered, cutoff=0.5)
1it [00:00,  6.06it/s]
Calculating area under curve for each sample.
30it [00:03,  7.16it/s]
../_images/docs_methylprep_methylcheck_example_14_3.png
[19]:
# effect of MDS filtering
methylcheck.mean_beta_compare(df, mds_filtered)
../_images/docs_methylprep_methylcheck_example_15_0.png
[20]:
# effect of cumulative sum filtering
methylcheck.mean_beta_compare(df, df_outliers_removed, verbose=True)
Your data needed to be transposed (df = df.transpose()).
../_images/docs_methylprep_methylcheck_example_16_1.png
[ ]: