methylcheck.read_geo¶
-
methylcheck.
read_geo
(filepath, verbose=False, debug=False, as_beta=True, column_pattern=None, test_only=False, rename_probe_column=True, decimals=3)¶ - Use to load preprocessed GEO data into methylcheck. Attempts to find the sample beta/M_values
in the CSV/TXT/XLSX file and turn it into a clean dataframe, with probe ids in the index/rows. Version 3 (introduced June 2020)
- reads a downloaded file, either in csv, xlsx, pickle, txt
- looks for /d_RxxCxx patterned headings and an probe index
- sets index in df to probes
- sets columns to sample names
- forces probe values to be floats, if strings/mixed
- if filename has ‘intensit’ or ‘signal’ in it, this converts to betas and saves even if filename doesn’t match, if columns have Methylated in them, it will convert and save
- detect multi-line headers and adjusts dataframe columns accordingly
- returns the usable dataframe
as_beta == True – converts meth/unmeth into a df of sample betas. column_pattern=None (Sample21 | Sample_21 | Sample 21) – some string of characters that precedes the number part of each sample in the columns of the file to be ingested.
- FIXED:
[x] handle files with .Signal_A and .Signal_B instead of Meth/Unmeth [x] BUG: can’t parse matrix_… files if uses underscores instead of spaces around sample numbers, or where sampleXXX has no separator. [x] handle processed files with sample_XX [x] returns IlmnID as index/probe column, unless ‘rename_probe_column’ == False [x] pass in sample_column names from header parser so that logic is in one place
(makes the output much larger, so add kwarg to exclude this)[x] demicals (default 3) – round all probe beta/intensity/p values returned to this number of decimal places. [x] bug: can only recognize beta samples if ‘sample’ in column name, or sentrix_id pattern matches columns.
need to expand this to handle arbitrary sample naming styles (limited to one column per sample patterns)- TODO:
- [-] BUG: meth_unmeth_pval works as_beta but not returning full data yet [-] multiline header not working with all files yet. [-] _family GSM123456-tbl-1.txt files not detected yet
- notes:
- this makes inferences based on strings in the filename, and based on the column names.