methylcheck.load

methylcheck.load(filepath='.', format='beta_value', file_stem='', verbose=False, silent=False, column_names=None, no_poobah=False, pval_cutoff=0.05, no_filter=True)

Methylsuite’s all-purpose data loading function.

When methylprep processes large datasets, you use the ‘batch_size’ option to keep memory and file size more manageable. Use the load helper function to quickly load and combine all of those parts into a single data frame of beta-values or m-values.

Doing this with pandas is about 8 times slower than using numpy in the intermediate step.

If no arguments are supplied, it will load all files in current directory that have a ‘beta_values_X.pkl’ pattern.

Arguments:
filepath:
Where to look for all the pickle files of processed data.
format: (‘beta_value’, ‘m_value’, ‘meth’, ‘meth_df’, ‘noob_df’, ‘beta_csv’, ‘sesame’)

This also allows processed.csv file data to be loaded. If you need meth and unmeth values, choose ‘meth’ and it will return a data_containers object with the ‘meth’ and ‘unmeth’ values, exactly like the data_containers object returned by methylprep.run_pipeline.

If you choose ‘meth_df’ or ‘noob_df’ it will load the pickled meth and unmeth dataframes from the folder specified.

column_names:
if your csv files contain column names that differ from those expected, you can specify them as a list of strings by default it looks for [‘noob_meth’, ‘noob_unmeth’] or [‘meth’, ‘unmeth’] or [‘beta_value’] or [‘m_value’] Note: if you csv data has probe names in a column that is not the FIRST column, or is not named “IlmnID”, you should specify it with column_names and put it first in the list, like [‘illumina_id’, ‘noob_meth’, ‘noob_umeth’].
no_poobah:
if loading from CSVs, and there is a column for probe p-values (the poobah_pval column), the default is to filter out probes that fail the p < 0.05 cutoff. if you specify ‘no_poobah’=True, it will load everything, regardless of p-values.
pval_cutoff:
if applying poobah (pvalue probe detection based on poor signal to noise) this specifies the threashold for cutoff (0.05 by default)
no_filter: (default = True)
if False, removes probes that illumina, the manufacturer, claimed are sketchy in 2019 for a select list of newer EPIC Sentrix_IDs. only affects ‘beta_value’ and ‘m_value’ output; no effect on meth/unmeth raw/NOOB intensity values returned.
file_stem: (string)
Older versions (pre v1.3.0) of methylprep processed with batch_size created a bunch of generically named files, such as ‘beta_values_1.pkl’, ‘beta_values_2.pkl’, ‘beta_values_3.pkl’, and so on. IF you rename these or provide a custom name during processing, provide that name here to load them all. (i.e. if your pickle file is called ‘GSE150999_beta_values_X.pkl’, then your file_stem is ‘GSE150999_’)
verbose:
outputs more processing messages.
silent:
suppresses all processing messages, even warnings.
Use cases and format:
format = beta_value:
you have beta_values.pkl file in the path specified and want a dataframe returned or you have a bunch of beta_values_1.pkl files in the path and want them merged and returned as one dataframe (when using ‘batch_size’ option in methylprep.run_pipeline() you’ll get multiple files saved)
format = m_value:
you have m_values.pkl file in the path specified and want a dataframe returned or you have a bunch of m_values_1.pkl files in the path and want them merged and returned as one dataframe
format = meth: (data_containers)
you have processed CSV files in the path specified and want a data_container returned
format = meth_df: (dataframe)
you have processed CSV files in the path specified and want a dataframe returned take the data_containers object returned and run methylcheck.container_to_pkl(containers, save=True) function on it.
format = noob_df: (dataframe)
loads noob_meth_values.pkl and noob_unmeth_values.pkl and returns two dataframes in a list
format = sesame:
for reading csvs processed using R’s sesame package. It has a different format (Probe_ID, ind_beta, ind_negs, ind_poob) per sample. Only those probes that pass the p-value cutoff will be included.
format = beta_csv:
for reading processed.csv files from methylprep, and forcing it NOT to load from the pickled beta dataframe file, if present.
format = poobah_csv:
similar to beta_csv, this pulls poobah p-values for all probes out of all processed CSV files into one dataframe. These p-values will include failed probes and probes that would be filterd by quality_mask. ‘poobah’ excludes these.
format = poobah:
reads the ‘poobah_values.pkl’ file and returns a dataframe of p-values. Note failed / poor-quality probes are replaced with NaN.

Note

Science on p-value cutoff:
This function defaults to a p-value cutoff of 0.05, which is typical for scientific tests. There is currently no consensus on what percent of a sample’s probes can fail. For example, if a sample has 860,000 probes and 5% of them fail, should you reject the whole sample from the batch? For large batch industrial scale testing, the authors assign some limit, like 5%, 10%, 20%, 30%, etc as a cutoff. And methylcheck’s run_qc() function defaults to 10 percent. But the academics we spoke to don’t automatically throw out any samples. Because it depends. Cancer samples have lots of anueploidy (an abnormal number of chromosomes in a haploid set) and lost chromosomes, so one would expect no signal for these CpG sites. So those researchers wouldn’t throw out samples unless most of the sample fails. People are working on deriving a calibration curve from public GEO data as a guide, and give a frame of reference, but none exist yet. And public data rarely includes failed samples.

Note

  • modified this from methylprep on 2020-02-20 to allow for data_containers to be returned as option
  • v0.6.3: added ‘no_filter’ step that automatically removes probes that illumina, the manufacturer, claims are sketchy for certain Catalog IDs. (Disable this with no_filter=True)