methylcheck.read_geo(filepath, verbose=False, debug=False, as_beta=True, column_pattern=None, test_only=False, rename_probe_column=True, decimals=3)
Use to load preprocessed GEO data into methylcheck. Attempts to find the sample beta/M_values

in the CSV/TXT/XLSX file and turn it into a clean dataframe, with probe ids in the index/rows. Version 3 (introduced June 2020)

  • reads a downloaded file, either in csv, xlsx, pickle, txt
  • looks for /d_RxxCxx patterned headings and an probe index
  • sets index in df to probes
  • sets columns to sample names
  • forces probe values to be floats, if strings/mixed
  • if filename has ‘intensit’ or ‘signal’ in it, this converts to betas and saves even if filename doesn’t match, if columns have Methylated in them, it will convert and save
  • detect multi-line headers and adjusts dataframe columns accordingly
  • returns the usable dataframe

as_beta == True – converts meth/unmeth into a df of sample betas. column_pattern=None (Sample21 | Sample_21 | Sample 21) – some string of characters that precedes the number part of each sample in the columns of the file to be ingested.


[x] handle files with .Signal_A and .Signal_B instead of Meth/Unmeth [x] BUG: can’t parse matrix_… files if uses underscores instead of spaces around sample numbers, or where sampleXXX has no separator. [x] handle processed files with sample_XX [x] returns IlmnID as index/probe column, unless ‘rename_probe_column’ == False [x] pass in sample_column names from header parser so that logic is in one place

(makes the output much larger, so add kwarg to exclude this)

[x] demicals (default 3) – round all probe beta/intensity/p values returned to this number of decimal places. [x] bug: can only recognize beta samples if ‘sample’ in column name, or sentrix_id pattern matches columns.

need to expand this to handle arbitrary sample naming styles (limited to one column per sample patterns)
[-] BUG: meth_unmeth_pval works as_beta but not returning full data yet [-] multiline header not working with all files yet. [-] _family GSM123456-tbl-1.txt files not detected yet
this makes inferences based on strings in the filename, and based on the column names.