Quality Control Example

using methylprep and methylcheck and GSE69852 from the GEO database

Import modules

# add directory of function to path
import sys
sys.path.append("/Users/mmaxmeister/legx/methylprep/") # even though methylprep is pip installed, this is still necessary on my machine (MM)
import methylprep
%matplotlib inline
# not a module yet.
import os
import methylcheck

Importing methylprep data

# print(help(methylprep.run_pipeline))
# data_dir = "/Users/mmaxmeister/legx/methylprep/docs/example_data/GSE69852"

# obtain and gunzip raw data from https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE100850&format=file first.
# ABOUT GSE100850: Epigenetic alterations detected in the genome of very young breast cancer patients:
#    identification of new biomarkers

### using methylprep, you would generate a sample output file called "beta_values.pkl" this way:
#data_dir = "/Users/mmaxmeister/GSE100850_RAW"
#df = methylprep.run_pipeline(data_dir, betas=True) # makes df into a betas sheet. (the consolidate function within it)

#In this example, we've already done that. So just load the data from disk, like this:
import pandas as pd
df = pd.read_pickle("docs/example_data/beta_values.pkl")

Filtering functions

# command line autodetects array type. Use this to get it / confirm it.
(865859, 39)
df = methylcheck.exclude_sex_control_probes(df, 'EPIC', verbose=True)
Array EPIC: Removed 19627 sex linked probes and 695 internal control probes from 39 samples. 846232 probes remaining.
Discrepancy between number of probes to exclude (20322) and number actually removed (19627): 695
It appears that your sample had no control probes, or that the control probe names didn't match the manifest (EPIC).
(846232, 39)
# if you pass in '450k' or 'EPIC' you'll get different types of filtering.
sketchy_probes = methylcheck.list_problem_probes('EPIC')
filtered = methylcheck.filters.exclude_probes(df, sketchy_probes)
Of 846232 probes, 381361 matched, yielding 464871 probes after filtering.

QC plots

# Shows probe beta values for each sample as a separate trace.
# This shows all samples as a single trace.

Visual filtering functions

# MDS filtering -- you can exclude outlier samples based on falling outside
# some standard deviation cutoff from the center of a 2D plot.
# This is an interactive function.
mds_filtered, mds_excluded = methylcheck.beta_mds_plot(df)
Your data needed to be transposed (df = df.transpose()).
(39, 846232)
Making sure that probes are in columns (the second number should be larger than the first).
Starting MDS fit_transform. this may take a while.
You can now remove outliers based on their transformed beta values
 falling outside a range, defined by the sample standard deviation.
Your acceptable value range: x=(-100.0 to 100.0), y=(-93.0 to 93.0).
Original samples (39, 2) vs filtered (30, 2)
Your scale factor was: 1.5
Enter new scale factor, <enter> to accept and save:
# adjust the cutoff value to exclude outliers. Default is 0.7.
df_outliers_removed = methylcheck.cumulative_sum_beta_distribution(mds_filtered, cutoff=0.5)
1it [00:00,  6.06it/s]
Calculating area under curve for each sample.
30it [00:03,  7.16it/s]
# effect of MDS filtering
methylcheck.mean_beta_compare(df, mds_filtered)
# effect of cumulative sum filtering
methylcheck.mean_beta_compare(df, df_outliers_removed, verbose=True)
Your data needed to be transposed (df = df.transpose()).
[ ]: