Quality Control Example¶
using methylprep and methylcheck and GSE69852 from the GEO database¶
# add directory of function to path import sys sys.path.append("/Users/mmaxmeister/legx/methylcheck/") sys.path.append("/Users/mmaxmeister/legx/methylprep/") # even though methylprep is pip installed, this is still necessary on my machine (MM) import methylprep %matplotlib inline
# not a module yet. import os print(os.getcwd()) os.chdir('/Users/mmaxmeister/legx/methylcheck') import methylcheck dir(methylcheck)
['DNA_mAge_Hannum', 'NullHandler', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'beta_density_plot', 'beta_mds_plot', 'cli', 'cumulative_sum_beta_distribution', 'detect_array', 'exclude_probes', 'exclude_sex_control_probes', 'filters', 'getLogger', 'list_problem_probes', 'mean_beta_compare', 'mean_beta_plot', 'postprocessQC']
Importing methylprep data¶
# print(help(methylprep.run_pipeline)) # data_dir = "/Users/mmaxmeister/legx/methylprep/docs/example_data/GSE69852" # SAMPLE DATA # obtain and gunzip raw data from https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE100850&format=file first. # ABOUT GSE100850: Epigenetic alterations detected in the genome of very young breast cancer patients: # identification of new biomarkers ### using methylprep, you would generate a sample output file called "beta_values.pkl" this way: #data_dir = "/Users/mmaxmeister/GSE100850_RAW" #df = methylprep.run_pipeline(data_dir, betas=True) # makes df into a betas sheet. (the consolidate function within it) #In this example, we've already done that. So just load the data from disk, like this: import pandas as pd df = pd.read_pickle("docs/example_data/beta_values.pkl")
# command line autodetects array type. Use this to get it / confirm it. print(methylcheck.detect_array(df)) df.shape
df = methylcheck.exclude_sex_control_probes(df, 'EPIC', verbose=True) df.shape
Array EPIC: Removed 19627 sex linked probes and 695 internal control probes from 39 samples. 846232 probes remaining. Discrepancy between number of probes to exclude (20322) and number actually removed (19627): 695 It appears that your sample had no control probes, or that the control probe names didn't match the manifest (EPIC).
#print(help(methylcheck.list_problem_probes)) # if you pass in '450k' or 'EPIC' you'll get different types of filtering. sketchy_probes = methylcheck.list_problem_probes('EPIC') filtered = methylcheck.filters.exclude_probes(df, sketchy_probes)
Of 846232 probes, 381361 matched, yielding 464871 probes after filtering.
# Shows probe beta values for each sample as a separate trace. methylcheck.beta_density_plot(df)
# This shows all samples as a single trace. methylcheck.mean_beta_plot(df)
Visual filtering functions¶
# MDS filtering -- you can exclude outlier samples based on falling outside # some standard deviation cutoff from the center of a 2D plot. # This is an interactive function. mds_filtered, mds_excluded = methylcheck.beta_mds_plot(df)
Your data needed to be transposed (df = df.transpose()). (39, 846232) Making sure that probes are in columns (the second number should be larger than the first). Starting MDS fit_transform. this may take a while. You can now remove outliers based on their transformed beta values falling outside a range, defined by the sample standard deviation. Your acceptable value range: x=(-100.0 to 100.0), y=(-93.0 to 93.0).
Original samples (39, 2) vs filtered (30, 2) Your scale factor was: 1.5 Enter new scale factor, <enter> to accept and save:
# adjust the cutoff value to exclude outliers. Default is 0.7. df_outliers_removed = methylcheck.cumulative_sum_beta_distribution(mds_filtered, cutoff=0.5)
1it [00:00, 6.06it/s]
Calculating area under curve for each sample.
30it [00:03, 7.16it/s]
# effect of MDS filtering methylcheck.mean_beta_compare(df, mds_filtered)
# effect of cumulative sum filtering methylcheck.mean_beta_compare(df, df_outliers_removed, verbose=True)
Your data needed to be transposed (df = df.transpose()).