loading data from methylprep into methylcheck

Some examples of how to load and use everything created with the process --all option.

files produced: 1. ‘beta_values.pkl’ 2. ‘m_values.pkl’ 3. ‘control_probes.pkl’ 4. ‘noob_meth_values.pkl’ 5. ‘noob_unmeth_values.pkl’ 6. ‘meth_values.pkl’ 7. ‘unmeth_values.pkl’ 8. ‘sample_sheet_meta_data.pkl’

the following processed data comes from GSE49618:

[1]:
import methylcheck
import pandas as pd
from pathlib import Path
path = Path('.')
[2]:
beta_df = pd.read_pickle('beta_values.pkl')
beta_df.head()
[2]:
6285625091_R05C01 7796806148_R03C02 7796806148_R01C02 6285625091_R06C02 6285625091_R03C01 7796806148_R04C01 7796806148_R02C01 7796806148_R02C02 6285625091_R01C02 6285625091_R02C01 ... 6285625091_R04C02 6285625091_R04C01 6285625091_R01C01 6285625091_R03C02 7796806148_R05C01 7796806148_R03C01 6285625091_R06C01 6285625091_R02C02 6285625091_R05C02 7796806148_R06C01
IlmnID
cg00035864 0.335582 0.196252 0.200789 0.351494 0.320310 0.330127 0.317109 0.181931 0.249313 0.328995 ... 0.177957 0.259148 0.355798 0.294141 0.367235 0.332893 0.232693 0.334410 0.303589 0.370401
cg00061679 0.322422 0.601861 0.588501 0.317582 0.726056 0.342334 0.334190 0.640104 0.254303 0.299355 ... 0.234025 0.387758 0.693347 0.319939 0.393521 0.347341 0.740913 0.663114 0.308237 0.375465
cg00063477 0.321620 0.938032 0.918401 0.329237 0.938475 0.352150 0.339434 0.919476 0.266977 0.315876 ... 0.056936 0.306449 0.923895 0.315743 0.360381 0.340766 0.942452 0.921626 0.316792 0.384024
cg00121626 0.291997 0.453785 0.449514 0.319824 0.491792 0.366443 0.302788 0.448257 0.260675 0.274178 ... 0.165645 0.308993 0.445349 0.291767 0.343996 0.330553 0.273489 0.410967 0.312232 0.352270
cg00223952 0.038150 0.045307 0.038419 0.022807 0.037974 0.039887 0.053729 0.038440 0.055672 0.060051 ... 0.046917 0.041977 0.030945 0.041563 0.034290 0.067948 0.053686 0.025340 0.052199 0.046804

5 rows × 21 columns

[3]:
m_df = pd.read_pickle('m_values.pkl')
m_df.head()
[3]:
6285625091_R05C01 7796806148_R03C02 7796806148_R01C02 6285625091_R06C02 6285625091_R03C01 7796806148_R04C01 7796806148_R02C01 7796806148_R02C02 6285625091_R01C02 6285625091_R02C01 ... 6285625091_R04C02 6285625091_R04C01 6285625091_R01C01 6285625091_R03C02 7796806148_R05C01 7796806148_R03C01 6285625091_R06C01 6285625091_R02C02 6285625091_R05C02 7796806148_R06C01
IlmnID
cg00035864 -0.514961 -2.015454 -1.958699 -0.382834 -1.059516 -0.656712 -0.628848 -2.139236 -0.919563 -0.360902 ... -1.920971 -1.040128 -0.817558 -0.637878 -0.395978 -0.594394 -1.696670 -0.951421 -0.735929 -0.337719
cg00061679 -0.647032 0.616919 0.547593 -0.596738 1.448606 -0.517611 -0.528048 0.870316 -0.892182 -0.696319 ... -1.404918 -0.120540 1.235855 -0.462396 -0.238241 -0.529807 1.561527 1.031600 -0.730799 -0.347105
cg00063477 -0.639987 4.207996 3.884379 -0.500082 4.317613 -0.496591 -0.476495 3.842226 -0.742681 -0.520904 ... -3.993056 -0.719714 4.060894 -0.536133 -0.461240 -0.586145 4.541773 4.045979 -0.617881 -0.277707
cg00121626 -0.876430 -0.253725 -0.271588 -0.577208 -0.027303 -0.419393 -0.759770 -0.280896 -0.734815 -0.841415 ... -2.120818 -0.609998 -0.292979 -0.684757 -0.568112 -0.653359 -1.395094 -0.494596 -0.617087 -0.505953
cg00223952 -4.638848 -4.387776 -4.630668 -5.404377 -4.650842 -4.576496 -4.121644 -4.631290 -4.063162 -3.952639 ... -4.328896 -4.499491 -4.953910 -4.510208 -4.802441 -3.765488 -4.128769 -5.248283 -4.169188 -4.336239

5 rows × 21 columns

[4]:
meta_df = pd.read_pickle('sample_sheet_meta_data.pkl')
meta_df.head()
[4]:
Cheez BuffyCoat Sentrix_ID Sentrix_Position Sample_Group Sample_Name Sample_Plate Sample_Type Sub_Type Sample_Well Pool_ID GSM_ID Control Sample_ID
0 1 0 6285625091 R05C01 None 8.24 CD34 None Blood Whole None None GSM1185586 False 6285625091_R05C01
1 1 0 7796806148 R03C02 None 7.25 PMN None Blood Whole None None GSM1185602 True 7796806148_R03C02
2 1 0 7796806148 R01C02 None 7.25 PROS None Blood Whole None None GSM1185600 False 7796806148_R01C02
3 1 0 6285625091 R06C02 None 9.1 PMN None Blood Whole None None GSM1185593 False 6285625091_R06C02
4 2 0 6285625091 R03C01 None 8.10 CD19 None Blood Whole None None GSM1185584 False 6285625091_R03C01

these additional (optional) types of data are used for quality control only

(whereas m_values and beta_values are used for experiment analysis)

[5]:
# the structure is different for control -- a dictionary of dataframes, with sample names as keys and dataframes as values
controls = pd.read_pickle('control_probes.pkl')
list(controls.values())[0].head()
[5]:
Control_Type Color Extended_Type Mean_Value_Red Mean_Value_Green snp_beta snp_meth snp_unmeth
10627500 NEGATIVE Purple Negative 265 327.0 327.0 NaN NaN NaN
10673427 SPECIFICITY I Lime GT Mismatch 3 (PM) 406.0 10573.0 NaN NaN NaN
10714330 NORM_T Purple Norm_T46 2378.0 550.0 NaN NaN NaN
10721502 NEGATIVE BlueViolet Negative 472 168.0 309.0 NaN NaN NaN
10731326 NEGATIVE Olive Negative 583 253.0 364.0 NaN NaN NaN
[6]:
nmeth_df = pd.read_pickle('noob_meth_values.pkl')
nmeth_df.head()
[6]:
6285625091_R05C01 7796806148_R03C02 7796806148_R01C02 6285625091_R06C02 6285625091_R03C01 7796806148_R04C01 7796806148_R02C01 7796806148_R02C02 6285625091_R01C02 6285625091_R02C01 ... 6285625091_R04C02 6285625091_R04C01 6285625091_R01C01 6285625091_R03C02 7796806148_R05C01 7796806148_R03C01 6285625091_R06C01 6285625091_R02C02 6285625091_R05C02 7796806148_R06C01
IlmnID
cg00035864 182.287003 1965.806030 1103.547974 185.369995 2678.219971 222.192001 165.600998 1132.555054 90.123001 132.768997 ... 123.458000 125.944000 2092.079102 119.185997 246.345001 203.408005 1826.296997 1784.352051 160.177994 229.951004
cg00061679 187.768005 10542.798828 6613.498047 157.912994 9092.219727 205.153000 182.511993 6541.208008 93.703003 139.722000 ... 162.632004 203.518005 5621.078125 134.235992 277.039001 230.757004 9118.296875 5272.352051 172.156998 256.214996
cg00063477 182.520004 8286.798828 4687.498047 161.169998 6428.220215 234.156006 180.975998 5549.208008 93.879997 137.404999 ... 179.477997 163.300995 4411.078125 140.087997 252.128998 231.925003 5463.295898 4043.352051 161.623993 255.813995
cg00121626 171.238998 8786.798828 5729.498047 158.313995 7006.220215 256.240997 165.227005 6298.208008 85.889999 117.821999 ... 150.302994 141.572998 4946.078125 122.697998 236.746002 222.304993 3850.297119 4122.352051 150.281006 239.852997
cg00223952 414.841003 870.955017 483.114990 286.001007 584.151978 580.906006 569.999023 535.734009 469.842010 676.622986 ... 547.318970 598.908020 403.781006 448.001007 491.671997 956.994019 874.518005 301.476013 701.672974 716.931030

5 rows × 21 columns

[7]:
nunmeth_df = pd.read_pickle('noob_unmeth_values.pkl')
nunmeth_df.head()
[7]:
6285625091_R05C01 7796806148_R03C02 7796806148_R01C02 6285625091_R06C02 6285625091_R03C01 7796806148_R04C01 7796806148_R02C01 7796806148_R02C02 6285625091_R01C02 6285625091_R02C01 ... 6285625091_R04C02 6285625091_R04C01 6285625091_R01C01 6285625091_R03C02 7796806148_R05C01 7796806148_R03C01 6285625091_R06C01 6285625091_R02C02 6285625091_R05C02 7796806148_R06C01
IlmnID
cg00035864 260.908997 7950.952148 4292.503906 242.007996 5583.116211 350.859009 256.619995 4992.636230 171.363007 170.789993 ... 470.295013 260.049011 3687.885986 186.014008 324.464996 307.622986 5922.219238 3451.472900 267.437012 290.865997
cg00061679 294.600006 6874.210938 4524.379883 239.322998 3330.531006 294.125000 263.619995 3577.764893 174.766998 227.020996 ... 432.302002 221.339996 2386.083008 185.332001 326.962006 333.596008 3088.548096 2578.552002 286.364014 326.178986
cg00063477 284.981995 447.441986 316.481995 228.356003 321.424011 330.776001 252.194000 385.976013 157.761002 197.591003 ... 2872.782959 269.579987 263.358002 203.589005 347.489014 348.674011 233.598007 243.843002 248.565994 310.326996
cg00121626 315.200989 10476.533203 6916.490234 236.690002 7140.096191 343.024994 280.459015 7652.217773 143.600998 211.906006 ... 657.078003 216.602997 6059.998047 197.835999 351.477997 350.220001 10128.132812 5808.500000 231.031006 341.026001
cg00223952 10358.999023 18252.509766 11991.741211 12154.230469 14698.804688 13883.028320 9938.714844 13300.982422 7869.617188 10490.825195 ... 11018.473633 13568.578125 12544.721680 10230.872070 13746.908203 13027.292969 15315.113281 11495.925781 12640.651367 14500.761719

5 rows × 21 columns

[8]:
meth_df = pd.read_pickle('meth_values.pkl')
meth_df.head()
[8]:
6285625091_R05C01 7796806148_R03C02 7796806148_R01C02 6285625091_R06C02 6285625091_R03C01 7796806148_R04C01 7796806148_R02C01 7796806148_R02C02 6285625091_R01C02 6285625091_R02C01 ... 6285625091_R04C02 6285625091_R04C01 6285625091_R01C01 6285625091_R03C02 7796806148_R05C01 7796806148_R03C01 6285625091_R06C01 6285625091_R02C02 6285625091_R05C02 7796806148_R06C01
IlmnID
cg00035864 389.0 2566.0 1503.0 409.0 3200.0 102.0 106.0 1586.0 105.0 157.0 ... 113.0 80.0 2471.0 137.0 142.0 0.0 2369.0 2140.0 303.0 110.0
cg00061679 412.0 11143.0 7013.0 289.0 9614.0 0.0 189.0 6995.0 126.0 191.0 ... 320.0 476.0 6000.0 220.0 299.0 157.0 9661.0 5628.0 358.0 253.0
cg00063477 390.0 8887.0 5087.0 305.0 6950.0 166.0 182.0 6003.0 127.0 180.0 ... 385.0 311.0 4790.0 248.0 174.0 163.0 6006.0 4399.0 310.0 251.0
cg00121626 339.0 9387.0 6129.0 291.0 7528.0 271.0 104.0 6752.0 78.0 72.0 ... 265.0 190.0 5325.0 158.0 86.0 112.0 4393.0 4478.0 252.0 167.0
cg00223952 934.0 1442.0 848.0 688.0 1091.0 1059.0 964.0 941.0 774.0 1051.0 ... 1013.0 1150.0 762.0 857.0 944.0 1522.0 1417.0 613.0 1243.0 1326.0

5 rows × 21 columns

[9]:
unmeth_df = pd.read_pickle('unmeth_values.pkl')
unmeth_df.head()
[9]:
6285625091_R05C01 7796806148_R03C02 7796806148_R01C02 6285625091_R06C02 6285625091_R03C01 7796806148_R04C01 7796806148_R02C01 7796806148_R02C02 6285625091_R01C02 6285625091_R02C01 ... 6285625091_R04C02 6285625091_R04C01 6285625091_R01C01 6285625091_R03C02 7796806148_R05C01 7796806148_R03C01 6285625091_R06C01 6285625091_R02C02 6285625091_R05C02 7796806148_R06C01
IlmnID
cg00035864 138.0 6127.0 4246.0 215.0 4067.0 231.0 80.0 4351.0 180.0 27.0 ... 316.0 252.0 2987.0 108.0 95.0 0.0 4250.0 2833.0 283.0 67.0
cg00061679 244.0 5392.0 4442.0 205.0 2613.0 0.0 115.0 3295.0 193.0 229.0 ... 217.0 127.0 2076.0 105.0 107.0 117.0 2439.0 2216.0 332.0 206.0
cg00063477 216.0 438.0 417.0 162.0 429.0 157.0 57.0 474.0 123.0 136.0 ... 1913.0 278.0 323.0 179.0 200.0 178.0 279.0 309.0 228.0 147.0
cg00121626 299.0 7851.0 6464.0 195.0 5072.0 203.0 193.0 6336.0 53.0 184.0 ... 656.0 109.0 4647.0 157.0 217.0 184.0 6938.0 4499.0 170.0 257.0
cg00223952 6117.0 13159.0 10754.0 8489.0 9951.0 10170.0 8688.0 10552.0 6015.0 7364.0 ... 5060.0 8691.0 9185.0 7515.0 11097.0 9878.0 10253.0 8519.0 8109.0 10014.0

5 rows × 21 columns

creating a dataframe from processed csv files requires methylcheck

[10]:
# deprecated -- no longer need containers for processing, but this option is still supported
# this takes a long time to load from disk compared to the pickled dataframes.
containers = methylcheck.load('.', format='meth')
Files: 100%|██████████| 21/21 [00:19<00:00,  1.07it/s]
/Users/mmaxmeister/anaconda3/lib/python3.7/site-packages/tqdm/std.py:658: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
  from pandas import Panel
INFO:methylcheck.load_processed:Produced a list of Sample objects (use obj._SampleDataContainer__data_frame to get values)...
[11]:
containers
[11]:
[<methylcheck.load_processed.SampleDataContainer at 0x7ff12235d5f8>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff0e03cbd68>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff12235dc88>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff12039a0f0>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff0e04ee710>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff1102c46a0>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff1102c4cc0>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff0e04eef28>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff0f8c6d940>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff0f8c6deb8>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff0e04ee940>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff1102c4f28>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff105e25ba8>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff105e25390>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff0f8c81e80>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff105e25438>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff105e25048>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff105e25780>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff105e253c8>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff105e25e80>,
 <methylcheck.load_processed.SampleDataContainer at 0x7ff105e25908>]
[12]:
beta_df_from_csvs = methylcheck.load('7796806148', format='beta_value')
beta_df_from_csvs
Files: 100%|██████████| 9/9 [00:03<00:00,  2.77it/s]
merging...
100%|██████████| 9/9 [00:00<00:00, 944.47it/s]
[12]:
7796806148_R01C01 7796806148_R01C02 7796806148_R02C01 7796806148_R02C02 7796806148_R03C01 7796806148_R03C02 7796806148_R04C01 7796806148_R05C01 7796806148_R06C01
IlmnID
cg00035864 0.289062 0.201050 0.316895 0.182007 0.333008 0.196045 0.330078 0.366943 0.370117
cg00061679 0.310059 0.588867 0.333984 0.640137 0.346924 0.602051 0.342041 0.394043 0.375000
cg00063477 0.326904 0.917969 0.339111 0.918945 0.341064 0.937988 0.352051 0.360107 0.384033
cg00121626 0.304932 0.449951 0.302979 0.447998 0.331055 0.454102 0.365967 0.343994 0.352051
cg00223952 0.057007 0.037994 0.053986 0.037994 0.067993 0.045013 0.040009 0.033997 0.046997
... ... ... ... ... ... ... ... ... ...
cg27614706 0.954102 0.954102 0.940918 0.953125 0.936035 0.954102 0.957031 0.955078 0.966797
cg27619353 0.258057 0.304932 0.129028 0.201050 0.311035 0.337891 0.370117 0.172974 0.372070
cg27620176 0.972168 0.975098 0.974121 0.973145 0.977051 0.976074 0.977051 0.979004 0.977051
cg27647370 0.974121 0.973145 0.970215 0.973145 0.976074 0.972168 0.978027 0.975098 0.976074
cg27652464 0.049988 0.058014 0.062012 0.053986 0.053986 0.062012 0.050995 0.094971 0.067017

485512 rows × 9 columns

[13]:
methylcheck.sample_plot(beta_df)
../_images/docs_demo_qc_functions_15_0.png
[14]:
methylcheck.qc_signal_intensity(meth=meth_df, unmeth=unmeth_df)
methylcheck.qc_signal_intensity(meth=nmeth_df, unmeth=nunmeth_df) # NOOB is slightly different
../_images/docs_demo_qc_functions_16_0.png
List of Bad Samples
[]
../_images/docs_demo_qc_functions_16_2.png
List of Bad Samples
[]
[15]:
methylcheck.plot_M_vs_U(meth=meth_df, unmeth=unmeth_df)
../_images/docs_demo_qc_functions_17_0.png
[16]:
methylcheck.plot_beta_by_type(beta_df, 'all')
INFO:methylprep.files.manifests:Reading manifest file: HumanMethylation450_15017482_v1-2.CoreColumns.csv
Found 135476 type I probes.
../_images/docs_demo_qc_functions_18_2.png
Found 350036 type II probes.
../_images/docs_demo_qc_functions_18_4.png
[17]:
methylcheck.plot_controls(controls, 'all')
../_images/docs_demo_qc_functions_19_0.png
../_images/docs_demo_qc_functions_19_1.png
WARNING:methylcheck.qc_plot:Some Green Hyb (High) values exceed chart maximum and are not shown.
../_images/docs_demo_qc_functions_19_3.png
../_images/docs_demo_qc_functions_19_4.png
../_images/docs_demo_qc_functions_19_5.png
../_images/docs_demo_qc_functions_19_6.png
../_images/docs_demo_qc_functions_19_7.png
../_images/docs_demo_qc_functions_19_8.png
[20]:
methylcheck.get_sex('.', plot=True) # or (meth_df, unmeth_df) tuple as alt input
INFO:methylprep.files.manifests:Reading manifest file: HumanMethylation450_15017482_v1-2.CoreColumns.csv
/Users/mmaxmeister/methylcheck/methylcheck/predict/sex.py:20: RuntimeWarning: divide by zero encountered in log2
  return np.log2(meth+unmeth)
../_images/docs_demo_qc_functions_20_1.png
[20]:
x_median y_median predicted_sex
6285625091_R05C01 13.5 9.5 F
7796806148_R03C02 13.2 13.5 M
7796806148_R01C02 12.8 13.0 M
6285625091_R06C02 13.3 9.3 F
6285625091_R03C01 12.9 13.2 M
7796806148_R04C01 13.8 8.6 F
7796806148_R02C01 13.4 8.4 F
7796806148_R02C02 12.9 13.1 M
6285625091_R01C02 12.8 8.8 F
6285625091_R02C01 13.3 8.4 F
7796806148_R01C01 13.3 8.4 F
6285625091_R04C02 13.1 9.8 F
6285625091_R04C01 13.5 9.0 F
6285625091_R01C01 12.6 12.8 M
6285625091_R03C02 13.2 8.9 F
7796806148_R05C01 14.0 8.9 F
7796806148_R03C01 13.7 8.6 F
6285625091_R06C01 12.8 13.1 M
6285625091_R02C02 12.5 12.8 M
6285625091_R05C02 13.4 9.2 F
7796806148_R06C01 13.8 8.9 F

if you want to run all QC functions and plots, there is a convenience function, run_qc()

Just specify the path to the methylprep processed files.

[21]:
methylcheck.qc_plot.run_qc('.')
../_images/docs_demo_qc_functions_22_0.png
../_images/docs_demo_qc_functions_22_1.png
List of Bad Samples
[]
../_images/docs_demo_qc_functions_22_3.png
../_images/docs_demo_qc_functions_22_4.png
WARNING:methylcheck.qc_plot:Some Green Hyb (High) values exceed chart maximum and are not shown.
../_images/docs_demo_qc_functions_22_6.png
../_images/docs_demo_qc_functions_22_7.png
../_images/docs_demo_qc_functions_22_8.png
../_images/docs_demo_qc_functions_22_9.png
../_images/docs_demo_qc_functions_22_10.png
../_images/docs_demo_qc_functions_22_11.png
INFO:methylprep.files.manifests:Reading manifest file: HumanMethylation450_15017482_v1-2.CoreColumns.csv
Found 135476 type I probes.
../_images/docs_demo_qc_functions_22_14.png
Found 350036 type II probes.
../_images/docs_demo_qc_functions_22_16.png