# API Reference¶

 methylcheck.cli methylcheck.run_pipeline(df, **kwargs) Run a variety of probe and sample filters in tandem, then plot results methylcheck.run_qc(path) Generates all QC plots for a dataset in the path provided. methylcheck.read_geo(filepath[, verbose, …]) Use to load preprocessed GEO data into methylcheck. Attempts to find the sample beta/M_values methylcheck.load([filepath, format, …]) Methylsuite’s all-purpose data loading function. methylcheck.load_both([filepath, format, …]) Creates and returns TWO objects (data and meta_data) from the given filepath. methylcheck.qc_signal_intensity([…]) Suggests sample outliers based on methylated and unmethylated signal intensity. methylcheck.plot_M_vs_U([…]) plot methylated vs unmethylated probe intensities methylcheck.plot_controls([path, subset, …]) internal array QC controls (available with the –save_control or –all methylprep process option) methylcheck.plot_beta_by_type(beta_df[, …]) compare betas for type I and II probes – (inspired by the plotBetasByType() function) methylcheck.probes methylcheck.list_problem_probes(array[, …]) Function to create a list of probes to exclude from downstream processes. methylcheck.exclude_probes(df, probe_list) Exclude probes from a dataframe of sample beta values. methylcheck.exclude_sex_control_probes(df, array) Exclude probes from an array, and return a filtered array. methylcheck.drop_nan_probes(df[, silent, …]) accounts for df shape (probes in rows or cols) so dropna() will work. methylcheck.samples methylcheck.sample_plot(df, **kwargs) A more intuitive alias of beta_density_plot(), since not all values are beta distributions. methylcheck.beta_density_plot(df[, verbose, …]) Returns a plot of beta values for each sample in a batch of samples as a separate line. methylcheck.mean_beta_plot(df[, verbose, …]) Returns a plot of the average beta values for all probes in a batch of samples. methylcheck.mean_beta_compare(df1, df2[, …]) Use this function to compare two dataframes, pre-vs-post filtering and removal of outliers. methylcheck.beta_mds_plot(df[, …]) Performs multidimensional scaling on a dataframe of samples methylcheck.combine_mds(*args, **kwargs) To combine (or segment) datasets for multidimensional scaling analysis methylcheck.cumulative_sum_beta_distribution(df) Attempts to filter outlier samples based on the cumulative area under the curve exceeding a reasonable value (cutoff). methylcheck.predict methylcheck.get_sex(data_source[, …]) This will calculate and predict the sex of each sample. methylcheck.assign(df[, two_pass]) Manually and interactively assign each sample to a group, based on beta-value distribution shape.

methylcheck.load_processed.load(filepath='.', format='beta_value', file_stem='', verbose=False, silent=False, column_names=None, no_poobah=False, pval_cutoff=0.05, no_filter=True)

When methylprep processes large datasets, you use the ‘batch_size’ option to keep memory and file size more manageable. Use the load helper function to quickly load and combine all of those parts into a single data frame of beta-values or m-values.

Doing this with pandas is about 8 times slower than using numpy in the intermediate step.

If no arguments are supplied, it will load all files in current directory that have a ‘beta_values_X.pkl’ pattern.

Arguments:
filepath:
Where to look for all the pickle files of processed data.
format: (‘beta_value’, ‘m_value’, ‘meth’, ‘meth_df’, ‘noob_df’, ‘beta_csv’, ‘sesame’)

This also allows processed.csv file data to be loaded. If you need meth and unmeth values, choose ‘meth’ and it will return a data_containers object with the ‘meth’ and ‘unmeth’ values, exactly like the data_containers object returned by methylprep.run_pipeline.

If you choose ‘meth_df’ or ‘noob_df’ it will load the pickled meth and unmeth dataframes from the folder specified.

column_names:
if your csv files contain column names that differ from those expected, you can specify them as a list of strings by default it looks for [‘noob_meth’, ‘noob_unmeth’] or [‘meth’, ‘unmeth’] or [‘beta_value’] or [‘m_value’] Note: if you csv data has probe names in a column that is not the FIRST column, or is not named “IlmnID”, you should specify it with column_names and put it first in the list, like [‘illumina_id’, ‘noob_meth’, ‘noob_umeth’].
no_poobah:
if loading from CSVs, and there is a column for probe p-values (the poobah_pval column), the default is to filter out probes that fail the p < 0.05 cutoff. if you specify ‘no_poobah’=True, it will load everything, regardless of p-values.
pval_cutoff:
if applying poobah (pvalue probe detection based on poor signal to noise) this specifies the threashold for cutoff (0.05 by default)
no_filter: (default = True)
if False, removes probes that illumina, the manufacturer, claimed are sketchy in 2019 for a select list of newer EPIC Sentrix_IDs. only affects ‘beta_value’ and ‘m_value’ output; no effect on meth/unmeth raw/NOOB intensity values returned.
file_stem: (string)
Older versions (pre v1.3.0) of methylprep processed with batch_size created a bunch of generically named files, such as ‘beta_values_1.pkl’, ‘beta_values_2.pkl’, ‘beta_values_3.pkl’, and so on. IF you rename these or provide a custom name during processing, provide that name here to load them all. (i.e. if your pickle file is called ‘GSE150999_beta_values_X.pkl’, then your file_stem is ‘GSE150999_’)
verbose:
outputs more processing messages.
silent:
suppresses all processing messages, even warnings.
Use cases and format:
format = beta_value:
you have beta_values.pkl file in the path specified and want a dataframe returned or you have a bunch of beta_values_1.pkl files in the path and want them merged and returned as one dataframe (when using ‘batch_size’ option in methylprep.run_pipeline() you’ll get multiple files saved)
format = m_value:
you have m_values.pkl file in the path specified and want a dataframe returned or you have a bunch of m_values_1.pkl files in the path and want them merged and returned as one dataframe
format = meth: (data_containers)
you have processed CSV files in the path specified and want a data_container returned
format = meth_df: (dataframe)
you have processed CSV files in the path specified and want a dataframe returned take the data_containers object returned and run methylcheck.container_to_pkl(containers, save=True) function on it.
format = noob_df: (dataframe)
loads noob_meth_values.pkl and noob_unmeth_values.pkl and returns two dataframes in a list
format = sesame:
for reading csvs processed using R’s sesame package. It has a different format (Probe_ID, ind_beta, ind_negs, ind_poob) per sample. Only those probes that pass the p-value cutoff will be included.
format = beta_csv:
for reading processed.csv files from methylprep, and forcing it NOT to load from the pickled beta dataframe file, if present.
format = poobah_csv:
similar to beta_csv, this pulls poobah p-values for all probes out of all processed CSV files into one dataframe. These p-values will include failed probes and probes that would be filterd by quality_mask. ‘poobah’ excludes these.
format = poobah:
reads the ‘poobah_values.pkl’ file and returns a dataframe of p-values. Note failed / poor-quality probes are replaced with NaN.

Note

Science on p-value cutoff:
This function defaults to a p-value cutoff of 0.05, which is typical for scientific tests. There is currently no consensus on what percent of a sample’s probes can fail. For example, if a sample has 860,000 probes and 5% of them fail, should you reject the whole sample from the batch? For large batch industrial scale testing, the authors assign some limit, like 5%, 10%, 20%, 30%, etc as a cutoff. And methylcheck’s run_qc() function defaults to 10 percent. But the academics we spoke to don’t automatically throw out any samples. Because it depends. Cancer samples have lots of anueploidy (an abnormal number of chromosomes in a haploid set) and lost chromosomes, so one would expect no signal for these CpG sites. So those researchers wouldn’t throw out samples unless most of the sample fails. People are working on deriving a calibration curve from public GEO data as a guide, and give a frame of reference, but none exist yet. And public data rarely includes failed samples.

Note

• modified this from methylprep on 2020-02-20 to allow for data_containers to be returned as option
• v0.6.3: added ‘no_filter’ step that automatically removes probes that illumina, the manufacturer, claims are sketchy for certain Catalog IDs. (Disable this with no_filter=True)
methylcheck.load_processed.load_both(filepath='.', format='beta_value', file_stem='', verbose=False, silent=False, column_names=None, rename_samples=False, sample_names='Sample_Name')

Creates and returns TWO objects (data and meta_data) from the given filepath. Confirms sample names match.

Returns TWO objects (data, meta) as dataframes for analysis. If meta_data files are found in multiple folders, it will read them all and try to match to the samples in the beta_values pickles by sample ID.
Arguments:
filepath:
Where to look for all the pickle files of processed data.
format:
‘beta_values’, ‘m_value’, or some other custom file pattern.
file_stem (string):
By default, methylprep process with batch_size creates a bunch of generically named files, such as ‘beta_values_1.pkl’, ‘beta_values_2.pkl’, ‘beta_values_3.pkl’, and so on. IF you rename these or provide a custom name during processing, provide that name here. (i.e. if your pickle file is called ‘GSE150999_beta_values_X.pkl’, then your file_stem is ‘GSE150999_’)
column_names:
if your processed csv files contain column names that differ from those expected, you can specify them as a list of strings by default it looks for [‘noob_meth’, ‘noob_unmeth’] or [‘meth’, ‘unmeth’] or [‘beta_value’] or [‘m_value’] Note: if you csv data has probe names in a column that is not the FIRST column, or is not named “IlmnID”, you should specify it with column_names and put it first in the list, like [‘illumina_id’, ‘noob_meth’, ‘noob_umeth’].
rename_samples:
if your meta_data contains a ‘Sample_Name’ column, the returned data and meta_data will have index and columns renamed to Sample_Names instead of Sample_IDs, respectively.
sample_name (string):
the column name to use in meta dataframe for sample names. Assumes ‘Sample_Name’ if unspecified.
verbose:
outputs more processing messages.
silent:
suppresses all processing messages, even warnings.
methylcheck.load_processed.container_to_pkl(containers, output='betas', save=True)

simple helper function to convert a list of SampleDataContainer objects to a df and pickle it.

save (True|False)
whether to save the data to disk, in the current directory
output (‘betas’|’m_value’|’meth’|’noob’|’copy_number’)
reads processed CSVs and consolidates into a single dataframe, with samples in columns and probes in rows: betas – saves ‘beta_values.pkl’ m_value – saves ‘m_values.pkl’ meth – saves uncorrected ‘meth_values.pkl’ and ‘unmeth_values.pkl’ noob – saves ‘meth_noob_values.pkl’ and ‘unmeth_noob_values.pkl’ copy_number – saves ‘copy_number_values.pkl’
this is for loading a bunch of processed csv files into containers, then into betas  import methylcheck as m files = '/Volumes/202761400007' containers = m.load(files, 'meth') df = m.load_processed.container_to_beta(containers) 
methylcheck.read_geo_processed.read_geo(filepath, verbose=False, debug=False, as_beta=True, column_pattern=None, test_only=False, rename_probe_column=True, decimals=3)
Use to load preprocessed GEO data into methylcheck. Attempts to find the sample beta/M_values

in the CSV/TXT/XLSX file and turn it into a clean dataframe, with probe ids in the index/rows. Version 3 (introduced June 2020)

• looks for /d_RxxCxx patterned headings and an probe index
• sets index in df to probes
• sets columns to sample names
• forces probe values to be floats, if strings/mixed
• if filename has ‘intensit’ or ‘signal’ in it, this converts to betas and saves even if filename doesn’t match, if columns have Methylated in them, it will convert and save
• returns the usable dataframe

as_beta == True – converts meth/unmeth into a df of sample betas. column_pattern=None (Sample21 | Sample_21 | Sample 21) – some string of characters that precedes the number part of each sample in the columns of the file to be ingested.

FIXED:

[x] handle files with .Signal_A and .Signal_B instead of Meth/Unmeth [x] BUG: can’t parse matrix_… files if uses underscores instead of spaces around sample numbers, or where sampleXXX has no separator. [x] handle processed files with sample_XX [x] returns IlmnID as index/probe column, unless ‘rename_probe_column’ == False [x] pass in sample_column names from header parser so that logic is in one place

(makes the output much larger, so add kwarg to exclude this)

[x] demicals (default 3) – round all probe beta/intensity/p values returned to this number of decimal places. [x] bug: can only recognize beta samples if ‘sample’ in column name, or sentrix_id pattern matches columns.

need to expand this to handle arbitrary sample naming styles (limited to one column per sample patterns)
TODO:
[-] BUG: meth_unmeth_pval works as_beta but not returning full data yet [-] multiline header not working with all files yet. [-] _family GSM123456-tbl-1.txt files not detected yet
notes:
this makes inferences based on strings in the filename, and based on the column names.
methylcheck.read_geo_processed.detect_header_pattern(test, filename, return_sample_column_names=False)

test is a dataframe with first 100 rows of the data set, and all columns. makes all the assumptions easier to read in one place.

betas non-normalized matrix_processed matrix_signal series_matrix – read_series_matrix(file, detect_only=True) methylated_signal_intensities and unmethylated_signal_intensities _family

TODO: GSM12345-tbl-1.txt type files (in _family.tar.gz packages) are possible, but needs more work. TODO: combining two files with meth/unmeth values

• numbered samples handled differently from sample_ids in columns
• won’t detect columns with no separators in strings

## ReportPDF Report Builder class¶

class methylcheck.reports.qc_report.ReportPDF(**kwargs)

ReportPDF allows you to build custom QC reports.

To use:

• First, initialize the report and pass in kwargs, like myReport = ReportPDF(**kwargs)
• Next, run myReport.run_qc() to fill it in.
• Third, you must run myReport.pdf.close() after run_qc() to save the file to disk.
• You can supply kwargs to specify which QC plots to include - and supply a list of chart names to control the order of objects in the report. - if you pass in ‘order’ in kwargs, any page you omit in the order will be omitted from the final report. - You may pass in a custom table to override one of the built-in pages.
• include ‘path’ with the path to your processed pickle files.
• include an optional ‘outpath’ for where to save the pdf report.

kwargs:

• processing params - filename - poobah_min_percent (e.g. at least 80% of probes must pass for sample to pass) - pval_cutoff (e.g. set alpha at 0.05) - outpath - path - runme: Default is not to actually generate all the parts of PDF with report.run_qc() then report.pdf.close(), but setting this to True will do everything at once.

• front page text - title - author - subject - keywords

• if ‘debug=True’ is in kwargs, - then it will return a report without any parts that failed.

• tests - poobah: includes a table with each sample and percent of probes that passed p-value signal detection - gct: includes GCT scores (bisulfite conversion completeness) in poobah table - mds: performs multidimensional scaling to identify and report on sample outliers

• plots - beta_density_plot - M_vs_U (default False) - M_vs_U_compare (default False) – shows the effect of all processing steps vs raw intensity - qc_signal_intensity - controls (A battery of probe performance plots) - probe_types

• customizing plots - poobah_colormap (pass in the matplotlib colormap name to override the meta_mds default colormap)

This also overrides the default colormap used in M_vs_U plot.

• extend_poobah_range (Default: True will show 7 colors for poobah failure range on beta_mds_plot, max 30%; False will show only 5, max 20%)
• cutoff_line – False to disable cutoff line on qc_signal_intensity and M_vs_U plots
• appendix_fontsize (default 12 point) – specify an int for other fontsize

custom tables:

pass in arbitrary data using kwarg custom_tables as list of dictionaries with this structure:

python custom_tables=[ { 'title': "some title, optional", # NOTE: chart titles must be unique! 'col_names': ["<list of strings>"], 'row_names': ["<list of strings, optional>"], 'data': ["<list of lists, with order matching col_names>"], 'order_after': "string name of the plot this should come after. It cannot appear first in list.", 'font_size': <can be None, int, 'auto' (shrink to page), or 'truncate' (chop of long values to make fit)> }, {"<...second table here...>"} ] 

If ‘order_after’ is None, the custom table will be inserted at the beginning of the report. If there are multiple custom tables and all have ‘order_after’ set to None, the first table in the list gets inserted, then the next one, sequentially, so that the last table inserted will be the first table to appear.

Pre-processing pipeline:

Probe-level (w/explanations of suggested exclusions)
• Links to recommended probe exclusion lists/files/papers
• Background subtraction and normalization (‘noob’)
• Detection p-value (‘neg’ vs ‘oob’)
• Dye-bias correction (from SeSAMe)
Sample-level (w/explanations of suggested exclusions)
Detection p-value (% failed probes) - custom detection (% failed, of those in a user-defined-list supplied to function) MDS
Suggested for customer to do on their own
• Sex check
• Age check
• SNP check
__init__(**kwargs)

Initialize self. See help(type(self)) for accurate signature.

exec_summary()
QC exec summary
sample_name/ID probe % failures probe_failure pass auto-qc result (only if present in kwargs passed in, otherwise omitted) MDS pass signal intensity pass if any fails, fail it (so overall pass)
table 2: meta

array type (detect from data) number of samples (from data) processing pipeline version number (passed in) date processed (passed in) avg probe failure rate percent of samples that failed any failures from ‘control probes’

reqs a way to capture warnings of data-off-chart
open_error_buffer()

preparing a stream of log messages to add to report appendix.

page_of_paragraphs(para_list, pdf, line_height='double', fontsize=None)

https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.text.html (0,0) is lower left; (1,1) is upper right.

This version estimates the size of each paragraph and moves the origin downward accordingly. Thsis is tricky because the anchors are lower left, not upper left.

It is ok if a paragraph contains whitespace line breaks, OR each paragraph is one long line to be wrapped here. Also - if a paragraph wraps, this accounts for it in total lines count, so everything fits on a page. The default fontsize is 12 if not specified.

page_of_text(text, pdf, fontsize=None)

text is a single big string of text, with whitespace for line breaks. https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.text.html (0,0) is lower left; (1,1) is upper right

parse_custom_tables(tables)

tables is a list of { ‘title’: “some title, optional”, ‘col_names’: [list of strings], ‘row_names’: [list of strings, optional], ‘data’: [list of lists, with order matching col_names], ‘order_after’: string name of the plot this should come after. It cannot appear first in list. }

to_table(list_of_lists, col_names, row_names=None, add_title='', font_size='auto')
• embeds a table in a PDF page.
• attempts to split long tables into multiple pages.
• should warn if table is too wide to fit.
• font_size: - auto: let matplotlib figure out the best size. This breaks if one column or value is soooooooper long. - truncate: IN FUTURE, chop off each value that is too long, to force it to fit on a page. Data loss. - INT: set the font for this table at this size (for control freaks) - None: default – use font size 12.

## Run QC pipeline¶

Run a variety of probe and sample filters in tandem, then plot results

by specifying all of your options at once, instead of running every part of methylcheck in piacemeal fashion.

this is analogous to using the methylcheck CLI, but for notebooks/scripts

required:
df: (required)
• data as a DataFrame of beta values (or DataFrame of m_values)
• sample names in columns and probes in rows
parameters:
verbose: (True/False)
default: False – shows extra info about processing if True
silent: (True/False)
default: False – suppresses all warnings/info
exclude_sex:
filters out probes on sex-chromosomes
exclude_control:
filters out illumina control probes
exclude_all:
filters out the most probes (sex-linked, control, and all sketchy-listed probes from papers)

exclude: (list of strings, shorthand references to papers with sketchy probes to exclude)

If the array is 450K the publications may include:
'Chen2013' 'Price2013' 'Zhou2016' 'Naeem2014' 'DacaRoszak2015'
If the array is EPIC the publications may include:
'Zhou2016' 'McCartney2016'
or these reasons:
'Polymorphism' 'CrossHybridization' 'BaseColorChange' 'RepeatSequenceElements'
or use 'exclude_all':
to do maximum filtering, including all of these papers’ lists.
plot: (list of strings)
[‘mean_beta_plot’, ‘beta_density_plot’, ‘cumulative_sum_beta_distribution’, ‘beta_mds_plot’, ‘all’] if ‘all’, then all of these plots will be generated. if omitted, no plots are created.
save_plots: (True|False)
default: False
export (True|False):
default: False – will export the filtered df as a pkl file if True
note:
this pipeline cannot also apply the array-level methylcheck.run_qc() function because that relies on additional probe information that may not be present. Everything in this pipeline applies to a dataframe of beta or m-values for a set of samples.
returns:
a filtered dataframe object

## filtering probes¶

methylcheck.probes.exclude_sex_control_probes(df, array, no_sex=True, no_control=True, verbose=False)

Exclude probes from an array, and return a filtered array.

df: dataframe of beta values or m-values. array: type of array used.

{‘27k’, ‘450k’, ‘EPIC’, ‘EPICPLUS’, ‘MOUSE’} or {‘IlluminaHumanMethylation27k’,’IlluminaHumanMethylation450k’,’IlluminaHumanMethylationEPIC’, ‘mouse’} or {‘27k’, ‘450k’, ‘epic’, ‘epic+’, ‘mouse’}
no_sex: bool
(default True) if True, will remove all probes that target X and Y chromosome locations, as they are sex specific – and lead to multiple clusters when trying to detect and remove outliers (noisy data).
no_control: bool
(default True) if True, removes Illumina’s internal control probes.
verbose: bool
(default False) reports out on number of probes removed.
a dataframe with samples removed.
methylcheck.probes.exclude_probes(df, probe_list)

Exclude probes from a dataframe of sample beta values. Use list_problem_probes() to obtain a list of probes (or pass in the names of ‘Criteria’ from problem probes), then pass that in as a probe_list along with the dataframe of beta values (array)

Resolves a problem whereby probe lists have basic names, but samples have additional meta data added. Example:

probe list
[‘cg24168924’, ‘cg15886294’, ‘cg05943251’, ‘cg05579622’, ‘cg01797553’, ‘cg14885690’, ‘cg12490816’, ‘cg02631583’, ‘cg17361593’, ‘cg15000031’, ‘cg21515494’, ‘cg17219246’, ‘cg10838001’, ‘cg13913475’, ‘cg00492169’, ‘cg20352786’, ‘cg05932698’, ‘cg06736139’, ‘cg08333283’, ‘cg10010298’, ‘cg25984048’, ‘cg27287823’, ‘cg19269713’, ‘cg12456833’, ‘cg26161708’, ‘cg04984052’, ‘cg00033806’, ‘cg23255774’, ‘cg10717379’, ‘cg00880984’, ‘cg01818617’, ‘cg18563133’, ‘cg15895341’, ‘cg08155050’, ‘cg06820286’, ‘cg04325909’, ‘cg15094920’, ‘cg08037129’, ‘cg11161730’, ‘cg06044537’, ‘cg11936560’, ‘cg12404870’, ‘cg12670496’, ‘cg01473643’, ‘cg08605930’, ‘cg16553354’, ‘cg22175254’, ‘cg22966295’, ‘cg07346931’, ‘cg06234741’]
sample probe names
Index([‘cg00000029_II_F_C_rep1_EPIC’, ‘cg00000103_II_F_C_rep1_EPIC’, ‘cg00000109_II_F_C_rep1_EPIC’, ‘cg00000155_II_F_C_rep1_EPIC’,

‘cg00000158_II_F_C_rep1_EPIC’, ‘cg00000165_II_R_C_rep1_EPIC’, ‘cg00000221_II_R_C_rep1_EPIC’, ‘cg00000236_II_R_C_rep1_EPIC’, … ‘ch.9.98957343R_II_R_O_rep1_EPIC’, ‘ch.9.98959675F_II_F_O_rep1_EPIC’, ‘ch.9.98989607R_II_R_O_rep1_EPIC’, ‘ch.9.991104F_II_F_O_rep1_EPIC’]

This chops off anything after the first underscore, and compares with probe_list to see if percent match increases. It then drops probes from array that match probe_list, at least partially.

ADDED: checking whether array.index is string or int type. Regardless, this should work and not alter the original index. ADDED v0.6.4: pass in a string like ‘illumina’ or ‘McCartney2016’ and it will fetch the list for you.

ref: https://bioconductor.org/packages/devel/bioc/vignettes/sesame/inst/doc/sesame.html#howwhy-probes-are-masked SESAME probe exclusion lists were pulled using these R commands:

EPIC_Zhou = sesameDataGet(‘EPIC.probeInfo’)$mask # 104454 probes HM450_Zhou <- sesameDataGet(‘HM450.probeInfo’))$mask # 65144 probes
methylcheck.probes.problem_probe_reasons(array, criteria=None)

Returns a dataframe of probes to be excluded, based on recommendations from the literature. Mouse and 27k arrays are not supported.

array: string

name for type of array used ‘IlluminaHumanMethylationEPIC’, ‘IlluminaHumanMethylation450k’ This shorthand names are also okay:

{‘EPIC’,’EPIC+’,’450k’,’27k’,’MOUSE’, ‘mouse’, ‘epic’, ‘450k’}
criteria: list

List of the publications to use when excluding probes. If the array is 450K the publications may include:

‘Chen2013’ ‘Price2013’ ‘Zhou2016’ ‘Naeem2014’ ‘DacaRoszak2015’ ‘Sesame’ – uses the default mask imported from sesame
If the array is EPIC the publications may include:
‘Zhou2016’ ‘McCartney2016’ ‘Sesame’ – uses the default mask imported from sesame
or these reasons:
‘Polymorphism’ ‘CrossHybridization’ ‘BaseColorChange’ ‘RepeatSequenceElements’

If no publication list is specified, probes from all publications will be added to the exclusion list. If more than one publication is specified, all probes from all publications in the list will be added to the exclusion list.

returns: dataframe
this returns a dataframe showing how each probe in the list is categorized for exclusion (based on criteria: reasons and paper-refs). This output is not suitable for other functions that just expect a list of probe names.
methylcheck.probes.list_problem_probes(array, criteria=None, custom_list=None)

Function to create a list of probes to exclude from downstream processes.

By default, all probes that have been noted in the literature to have polymorphisms, cross-hybridization, repeat sequence elements and base color changes are included in the DEFAULT exclusion list.

• You can customize the exclusion list by passing in either publication shortnames or criteria into the function.
• you can combine pubs and reasons into the same list of exclusion criteria.
• if a publication doesn’t match your array type, it will raise an error and tell you.

Including any of these labels in pubs (publications) or criteria (described below) will result in these probes NOT being included in the final exclusion list.

User also has ability to add custom list of probes to include in final returned list.

Parameters:

array: string

name for type of array used ‘IlluminaHumanMethylationEPIC’, ‘IlluminaHumanMethylation450k’ This shorthand names are also okay:

{'EPIC','EPIC+','450k','27k','MOUSE'}
criteria: list

List of the publications to use when excluding probes. If the array is 450K the publications may include: 'Chen2013' 'Price2013' 'Zhou2016' 'Naeem2014' 'DacaRoszak2015'

If the array is EPIC the publications may include: 'Zhou2016' 'McCartney2016'

If array is EPIC or EPIC+, specifying 'illumina' will remove 998 probes the manufacturer has recommended be excluded. The defects only affected a small number of EPIC arrays produced.

If no publication list is specified, probes from all publications will be added to the exclusion list. If more than one publication is specified, all probes from all publications in the list will be added to the exclusion list.

criteria: lists

List of the criteria to use when excluding probes. List may contain the following exculsion criteria: ’Polymorphism’

‘CrossHybridization’ ‘BaseColorChange’ ‘RepeatSequenceElements’ ‘illumina’

If no criteria list is specified, all critera will be excluded. If more than one criteria is specified, all probes meeting any of the listed criteria will be added to the exclusion list.

custom_list: list, default None
User-provided list of probes to be excluded. These probe names have to match the probe names in your data exactly.

Returns:

probe_exclusion_list: list
List containing probe identifiers to be excluded
or probe_exclusion_dataframe: dataframe
DataFrame containing probe names as index and reason | paper_reference as columns

If you supply no criteria (default), then maximum filtering occurs:

• EPIC will have 389050 probes removed
• 450k arrays will have 341057 probes removed

Reason lists for 450k and probes removed:

• Daca-Roszak_etal_2015 (96427)
• Chen_etal_2013 (445389)
• Naeem_etal_2014 (146590)
• Price_etal_2013 (284476)
• Zhou_etal_2016 (184302)
• Polymorphism (796290)
• CrossHybridization (211330)
• BaseColorChange (359)
• RepeatSequenceElements (149205)

Reason lists for epic and probes removed:

• McCartney_etal_2016 (384537)
• Zhou_etal_2016 (293870)
• CrossHybridization (173793)
• Polymorphism (504208)
• BaseColorChange (406)
methylcheck.probes.drop_nan_probes(df, silent=False, verbose=False)

accounts for df shape (probes in rows or cols) so dropna() will work.

the method used inside MDS may be faster, but doesn’t tell you which probes were dropped.

## plotting functions¶

methylcheck.samples.assign(df, two_pass=False)
Manually and interactively assign each sample to a group, based on beta-value distribution shape.
This function is not documented or supported anymore.
how:
sorts samples by the position of their peak intensity in beta dist, then lets the human assign a number to each. assumes that peaks that differ by > 0.1 of the 0 to 1 range are different clusters. fills in expected cluster #.
options:
if two_pass==True: it lets user go through every cluster a second time and split larger clusters further. sometimes a cluster isn’t obviously two groups until it is separated from the larger dataset.
methylcheck.samples.plot_assigned_groups(df, user_defined_groups)

takes the ‘sample: group’ dictionary and plots each sub-set. returns a lookup dict of each cluster –> [list of samples].

methylcheck.samples.mean_beta_plot(df, verbose=False, save=False, silent=False)

Returns a plot of the average beta values for all probes in a batch of samples.

Input (df):
• a dataframe with probes in rows and sample_ids in columns.
• to get this formatted import, use methylprep.consolidate_values_for_sheet(),

as this will return a matrix of beta-values for a batch of samples (by default).

methylcheck.samples.beta_density_plot(df, verbose=False, save=False, silent=False, reduce=0.1, plot_title=None, ymax=None, return_fig=False, full_range=False, highlight_samples=None, figsize=(12, 9), show_labels=None, filename='beta.png')

Returns a plot of beta values for each sample in a batch of samples as a separate line. Y-axis values is an arbitrary scale, similar to a histogram of probes that have a given beta value. X-axis values are beta values (0 to 1) for a single samples

Input (df):
• a dataframe with probes in rows and sample_ids in columns.
• to get this formatted import, use methylprep.consolidate_values_for_sheet(), as this will return a matrix of beta-values for a batch of samples (by default).
Returns:
None (but if return_fig is True, returns the figure object instead of showing plot)
Parameters:
verbose:
display extra messages
save:
if True, saves a copy of the plot as a png file.
silent:
if True, eliminates all messages (useful for automation and scripting)
reduce:
when working with datasets and you don’t need publication quality “exact” plots, supply a float between 0 and 1 to sample the probe data for plotting. We recommend 0.1, which plots 10% of the 450k or 860k probes, and doesn’t distort the distribution much. Values below 0.001 (860 probes out of 860k) will show some sampling distortion. Using 0.1 will speed up plotting 10-fold of large batches.
ymax (None):
If defined, upper limit of plot will not exceed this value. But it y-range can be smaller if values are less than this range.
full_range: (False)
if True, x-axis will be auto-scaled, instead of fixed in the 0-to-1.0 range.
return_fig: (False)
if True, returns figure object instead of showing plot.
highlight_samples:
a string or list of df col-names that, if provided, will highlight sample(s) in blue and bold in plot returned. all other samples in df will be grayed out. Useful for QC reports.
figsize:
tuple of width, height, with 12,9 being default if ommitted.

show_labels: By default, sample names appear in a legend if there are <30 samples. Otherwise, ommitted. Use this to force legend on or off.

Note:
if the sample_ids in df.index are not unique, it will make them so for the purpose of plotting.
methylcheck.samples.sample_plot(df, **kwargs)

A more intuitive alias of beta_density_plot(), since not all values are beta distributions. Note: This changes the beta_density_plot() defaults to show a reduced, faster version of probe data, sampling 10% of probes present for10-fold faster processing time.

methylcheck.samples.cumulative_sum_beta_distribution(df, cutoff=0.7, verbose=False, save=False, silent=False)

Attempts to filter outlier samples based on the cumulative area under the curve exceeding a reasonable value (cutoff). This method only works on poor quality samples that are better identified using ControlReporter summary (CLI: ‘methylcheck controls’)

Inputs:
DataFrame – wide format (probes in columns, samples in rows) cutoff (default 0.7) silent – suppresses figure, so justs returns transformed data if False. if save==True: saves figure to disk.
Returns:
dataframe with subjects removed that exceed cutoff value.
methylcheck.samples.beta_mds_plot(df, filter_stdev=1.5, verbose=False, save=False, silent=False, multi_params={'draw_box': True}, plot_removed=False, nafill='quick', poobah=None, palette=None, labels=None, extend_poobah_range=True, plot=True)

Performs multidimensional scaling on a dataframe of samples

Arguments

df:
dataframe of beta values for a batch of samples (rows are probes; cols are samples)
filter_stdev:
a value (unit: standard deviations) between 0 and 3 (typically) that represents the fraction of samples to include, based on the standard deviation of this batch of samples. So using the default value of 1.5 means that all samples whose MDS-transformed beta sort_values are within +/- 1.5 standard deviations of the average beta are retained in the data returned.
plot_removed:
if True, displays a plot of samples’ beta-distributions that were removed by MDS filtering. ignored if silent=True.
nafill: (‘quick’ | ‘impute’)
by default, most samples will contain missing values where probes failed the signal-noise detection in methylprep. By default, it will use the fastest method of filling in samples from adjacent sample’s probe values with the ‘quick’ method. Or, if you want it to use the average value for all samples for each probe, use ‘impute’, which will be much slower.
poobah:
path to poobah_values.pkl file. Default is None. If supplied, this will color code dots according to percent of failed probes for each sample as a second dimension of QC on the plot. Does not filter or affect the output dataframe returned.
palette:
Optional - Specify a matplotlib/seaborn palette name, such as ‘CMRmap_r’, ‘coolwarm’, or ‘nipy_spectral’. Default is ‘twilight’.
labels:
pass in a dictionary with sample names found in df columns and a (number or string) representing the groups to assign samples to. Use this to color-code the samples against a known classification scheme, such as cell type, and observe whether the MDS clustering pattern aligns with this input parameter. This feature is not compatible with poobah or multi_params.
extend_poobah_range:
True means 7 colors appear covering 0-30%. False means 5 colors and 0-20%. Default is True.
multi_params:
is a dict, passed into this function from a multi-compare-MDS wrapper function, containing: {return_plot_obj=True, fig=None, ax=None, draw_box=False, xy_lim=None, color_num=0, PSF=1.2 – plot scale factor (margin beyond points to display)}

Options

verbose:
silent:
• if running from command line in an automated process, you can run in silent mode to suppress any user interaction.
• In this case, whatever filter_stdev you assign is the final value, and a file will be processed with that param.
plot: (default True)
• plot is False, this suppresses plots (images) from being generated and shown on screen.
• .png files are still saved if save == True.

Returns

Returns a filtered dataframe. If return_plot_obj is True, it returns the plot, for making overlays in methylize.

Requires

pandas, numpy, pyplot, sklearn.manifold.MDS

Notes

this will remove probes from ALL samples in batch from consideration if any samples contain NaN (missing values) for that probe.
methylcheck.samples.mean_beta_compare(df1, df2, save=False, verbose=False, silent=False)

Use this function to compare two dataframes, pre-vs-post filtering and removal of outliers. args:

the first argument (df1) is the “pre” dataframe of samples the second argument (df2) is the “post” dataframe of samples
kwargs:
verbose: additional output silent: suppresses figure, so no output unless save==True too.
methylcheck.samples.combine_mds(*args, **kwargs)

To combine (or segment) datasets for multidimensional scaling analysis

Use this function on multiple dataframes to combine datasets, or to visualize parts of the same dataset in separate colors. It is a wrapper of methylcheck.beta_mds_plot() and applies multidimensional scaling to cluster similar samples based on patterns in probe values, as well as identify possible outlier samples (and exclude them).

• combine datasets,
• run MDS,
• see how each dataset (or subset) overlaps with the others on a plot,
• exclude outlier samples based on a composite cutoff box (the average bounds of the component data sets)
• calculate the percent of data excluded from the group
• *args:
• pass in any number of pandas dataframes, and it will combine them into one mds plot.
• alternatively, you may pass in a list of filepaths as strings, and it will attempt to load these files as pickles.

but they must be pickles of pandas dataframes containing beta values or m-values

• silent: (default False)
(automated processing mode) if True, suppresses most information and avoids prompting user for anything. silent mode processes data but doesn’t show the plot.
• save: (default False)
if True, saves the plot png to disk.
• verbose: (default False)
if True, prints extra debug information to screen or logger.
• filter_stdev:
how broadly should you retain samples? units are standard deviations, defaults to 1.5 STDEV. if you increase this number, fewer outlier samples will be removed.
• returns a dataframe of transformed samples
methylcheck.qc_plot.run_qc(path)

Generates all QC plots for a dataset in the path provided. if process –all was used to create control probes and raw values for QC, because it uses four output files:

• beta_values.pkl
• control_probes.pkl
• meth_values.pkl or noob_meth_values.pkl
• unmeth_values.pkl or noob_unmeth_values.pkl

output is all to screen, so best to use in a jupyter notebook. If you prefer output in a PDF, use ReportPDF instead.

Note: this will only look in the path folder; it doesn’t do a recursive search for matching files.

methylcheck.qc_plot.plot_beta_by_type(beta_df, probe_type='all', return_fig=False, silent=False, on_lambda=False)

compare betas for type I and II probes – (inspired by the plotBetasByType() function)

Plot the overall density distribution of beta values and the density distributions of the Infinium I or II probe types 1 distribution plot; user defines type (I or II infinium)

Doesn’t work with 27k arrays because they are all of the same type, Infinium Type I.
options:
return_fig: (default False) if True, returns a list of figure objects instead of showing plots.
methylcheck.qc_plot.qc_signal_intensity(data_containers=None, path=None, meth=None, unmeth=None, poobah=None, palette=None, noob=True, silent=False, verbose=False, plot=True, cutoff_line=True, bad_sample_cutoff=11.5, return_fig=False)

Suggests sample outliers based on methylated and unmethylated signal intensity.

path
to csv files processed using methylprep these have “noob_meth” and “noob_unmeth” columns per sample file this function can use. if you want it to processed data uncorrected data.
data_containers
output from the methylprep.run_pipeline() command when run in a script or notebook. you can also recreate the list of datacontainers using methylcheck.load(<filepath>,’meth’)
(meth and unmeth)
if you chose process –all you can load the raw intensities like this, and pass them in: meth = pd.read_pickle(‘meth_values.pkl’) unmeth = pd.read_pickle(‘unmeth_values.pkl’) THIS will run the fastest.
(meth and unmeth and poobah)

if poobah=None (default): Does nothing if poobah=False: suppresses this color if poobah=dataframe: color-codes samples according to percent probe failure range,

but only if you pass in meth and unmeth dataframes too, not data_containers object.

if poobah=True: looks for poobah_values.pkl in the path provided.

cutoff_line: True will draw the line; False omits it. bad_sample_cutoff (default 11.5): set the cutoff for determining good vs bad samples, based on signal intensities of meth and unmeth fluorescence channels. 10.5 was borrowed from minfi’s internal defaults. noob: use noob-corrected meth/unmeth values verbose: additional messages plot: if True (default), shows a plot. if False, this function returns the median values per sample of meth and unmeth probes. return_fig (False default), if True, and plot is True, returns a figure object instead of showing plot. compare: if the processed data contains both noob and uncorrected values, it will plot both in different colors palette: if using poobah to color code, you can specify a Seaborn palette to use.

this will draw a diagonal line on plots

TODO:
doesn’t return both types of data if using compare and not plotting doesn’t give good error message for compare
methylcheck.qc_plot.plot_M_vs_U(data_containers_or_path=None, meth=None, unmeth=None, poobah=None, noob=True, silent=False, verbose=False, plot=True, compare=False, return_fig=False, palette=None, cutoff_line=True)

plot methylated vs unmethylated probe intensities

PATH to csv files processed using methylprep
these have “noob_meth” and “noob_unmeth” columns per sample file this function can use. if you want it to processed data uncorrected data. (If there is a poobah_values.pkl file in this PATH, it will use the file to color code points)
data_containers = run_pipeline(data_dir = ‘somepath’,
save_uncorrected=True, sample_sheet_filepath=’samplesheet.csv’)

you can also recreate the list of datacontainers using methylcheck.load(<filepath>,’meth’)

(meth and unmeth)
if you chose process –all you can load the raw intensities like this, and pass them in: meth = pd.read_pickle(‘meth_values.pkl’) unmeth = pd.read_pickle(‘unmeth_values.pkl’) THIS will run the fastest.
poobah
filepath: You may supply the file path to the p-value detection dataframe. If supplied, it will color code points on the plot. False: set poobah to False to suppress this coloring. None (default): if there is a poobah_values.pkl file in your path, it will use it.
optional params:

noob: use noob-corrected meth/unmeth values verbose: additional messages plot: if True (default), shows a plot. if False, this function returns the median values per sample of meth and unmeth probes. return_fig: (False default), if True (and plot is true), returns the figure object instead of showing it. compare:

if the processed data contains both noob and uncorrected values, it will plot both in different colors the compare option will not work with using the ‘meth’ and ‘unmeth’ inputs, only with path or data_containers.
cutoff_line: True will draw a diagonal line on plots.
the cutoff line is based on the X-Y scale of the plot, which depends on the range of intensity values in your data set.
TODO:
doesn’t return both types of data if using compare and not plotting doesn’t give good error message for compare
methylcheck.qc_plot.plot_controls(path=None, subset='all', return_fig=False)

internal array QC controls (available with the –save_control or –all methylprep process option)

path
can either be a path to the file, or a path to the folder containing a file called ‘control_probes.pkl’, or it can be the dictionary of control dataframes in control_probes.pkl.
subset (‘staining’ | ‘negative’ | ‘hybridization’ | ‘extension’ | ‘bisulfite’ |
‘non-polymorphic’ | ‘target-removal’ | ‘specificity’ | ‘all’):

‘all’ will plot every control function (default)

return_fig (False)
if True, returns a list of matplotlib.pyplot figure objects INSTEAD of showing then. Used in QC ReportPDF.

if there are more than 30 samples, plots will not have sample names on x-axis.

methylcheck.qc_plot.bis_conversion_control(path_or_df, use_median=False, on_lambda=False, verbose=False)

GCT score: requires path to noob_meth or raw meth_values.pkl; or you can pass in a meth dataframe. use_median: not supported yet. Always uses mean of probe values

## sex prediction¶

methylcheck.predict.get_sex(data_source, array_type=None, verbose=False, plot=False, save=False, on_lambda=False, median_cutoff=-2, include_probe_failure_percent=True, poobah_cutoff=20, custom_label=None, return_fig=False, return_labels=False)

This will calculate and predict the sex of each sample.

the “data_source” can be any one of:
path – to a folder with csv data that contains processed sample data path – to a folder with the ‘meth_values.pkl’ and ‘unmeth_values.pkl’ dataframes path – to a folder also containing samplesheet pkl and poobah_values.pkl, if you want to compare predicted sex with actual sex. data_containers – object created from methylprep.run_pipeline() or methylcheck.load(path, ‘meth’) tuple of (meth, unmeth) dataframes
array_type (string)
enum: {‘27k’,’450k’,’epic’,’epic+’,’mouse’} if not specified, it will load the data from data_source and determine the array for you.
median_cutoff (default is -2)
the minimum difference in the medians of X and Y probe copy numbers to assign male or female (copied from the minfi sex predict function)
include_probe_failure_percent:
True: includes poobah percent per sample as column in the output table and on the plot. Note: you must supply a ‘path’ as data_source to include poobah in plots.
poobah_cutoff
The maximum percent of sample probes that can fail before the sample fails. Default is 20 (percent) Has no effect if include_probe_failure_percent is False.
plot
True: creates a plot, with option to save as image or return_fig.
save
True: saves the plot, if plot is True
return_fig
If True, returns a pyplot figure instead of a dataframe. Default is False. Note: return_fig will not show a plot on screen.
return_labels: (requires plot == True)
When using poobah_cutoff, the figure only includes A-Z,1…N labels on samples on plot to make it easier to read. So to get what sample_ids these labels correspond to, you can rerun the function with return_labels=True and it will skip plotting and just return a dictionary with sample_ids and these labels, to embed in a PDF report if you like.
custom_label:
Option to provide a dictionary with keys as sample_ids and values as labels to apply to samples. e.g. add more data about samples to the multi-dimensional QC plot

while providing a filepath is the easiest way, you can also pass in a data_containers object, a list of data_containers containing raw meth/unmeth values, instead. This object is produced by methylprep.run_pipeline, or by using methylcheck.load(filepath, format=’meth’) and lets you customize the import if your files were not prepared using methylprep (non-standand CSV columns, for example)

If a poobah_values.pkl file can be found in path, the dataframe returned will also include percent of probes for X and Y chromosomes that failed quality control, and warn the user if any did. This feature won’t work if a containers object or tuple of dataframes is passed in, instead of a path.

Note: ~90% of Y probes should fail if the sample is female. That chromosome is missing.

methylcheck.predict.infer_strain(beta_df_or_filepath, manifest=None)

Uses SNPS to identify the mouse strain (among 36 possibilities), using an internal lookup ref table from sesame.

argument:
beta_df_or_filepath
note that beta_df does not contain the snps, but you can provide access to a control_probes.pkl file and it will load and pull the snps for analysis. Or, if you use methylcheck.load(<path>, format=’beta_csv’) the beta_df WILL contain snps.
manifest (default: None)
It load the mouse manifest by default. But if you want to provide a custom manifest, you can.

Note

This function calculates Variant allele frequency (VAF) in an intermediate step: - VAF is the percentage of sequence reads observed matching a specific DNA variant divided by the overall coverage at that locus. - VAF is a surrogate measure of the proportion of DNA molecules in the original specimen carrying the variant.

Possible Matching Strains:

‘DBA_1J’, ‘DBA_2J’, ‘AKR_J’, ‘C57L_J’, ‘129S5SvEvBrd’, ‘BALB_cJ’, ‘C57BL_10J’, ‘BTBR_T+_Itpr3tf_J’, ‘LP_J’, ‘A_J’, ‘129S1_SvImJ’, ‘RF_J’, ‘LEWES_EiJ’, ‘PWK_PhJ’, ‘BUB_BnJ’, ‘SPRET_EiJ’, ‘MOLF_EiJ’, ‘NZB_B1NJ’, ‘NZO_HlLtJ’, ‘NZW_LacJ’, ‘KK_HiJ’, ‘129P2_OlaHsd’, ‘C3H_HeJ’, ‘WSB_EiJ’, ‘CBA_J’, ‘C3H_HeH’, ‘NOD_ShiLtJ’, ‘C57BR_cdJ’, ‘CAST_EiJ’, ‘ZALENDE_EiJ’, ‘C58_J’, ‘C57BL_6NJ’, ‘ST_bJ’, ‘I_LnJ’, ‘SEA_GnJ’, ‘FVB_NJ’