Outlier detection using Multidimensional Scaling (MDS)¶
Multidimensional scaling (MDS) is one method for comparing the similarity of samples–similar to PCA. MDS based on euclidean distance will also have identical results to PCA, because minimizing linear distance between points is effectively the same as maximizing linear correlation (the latter being what PCA does).
See StatQuest’s very helpful video on MDS and PCoA for more details.
methylcheck
allows users the option of filtering their data based on the results of MDS. The standard cut off is 1.5 standard deviations from the mean of the data, however users have the ability to adjust the cut off value if they are not satisfied by the filtering. After examining the MDS plot, press ‘enter’ to accept the cut-off as it is, or enter a new value and rerun the plot.
We will walk through an example of how to use MDS with this dataset from GEO: GSE111629. This is a large dataset (n=571) of patients with Parkinson’s Disease (n=335) and a group of controls (n=237). These blood samples were run on Illumina’s 450k arrays.
We downloaded the data from GEO and processed it with methylprep
using the following command:
>>> python -m methylprep process -d <filepath> --all
WARNING: This is a huge dataset and methylprep
will take +8 hours to process it. It will also eat up a lot of storage on your machine. We chose this data because it demonstrates the utility of the MDS function. We recommend users pick a different dataset to follow along this example!
[1]:
import methylcheck
import pandas as pd
from pathlib import Path
filepath = Path('/Users/patriciagirardi/methylcheck_tutorial')
[2]:
df = pd.read_pickle('~/methylcheck_tutorial/beta_values.pkl')
metadata = pd.read_pickle('~/methylcheck_tutorial/GSE111629_GPL13534_meta_data.pkl')
metadata.head()
[2]:
GSM_ID | Sample_Name | source | platform | title | disease state | age | gender | ethnicity | tissue | Sample_ID | Sentrix_ID | Sentrix_Position | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GSM3035401 | genomic DNA from 3999979001_R01C01 | X3999979001_R01C01 | GPL13534 | genomic DNA from 3999979001_R01C01 | Parkinson's disease (PD) | 74 | Female | Caucasian | whole blood | 3999979001_R01C01 | 3999979001 | R01C01 | |
1 | GSM3035402 | genomic DNA from 3999979001_R01C02 | X3999979001_R01C02 | GPL13534 | genomic DNA from 3999979001_R01C02 | PD-free control | 73 | Female | Caucasian | whole blood | 3999979001_R01C02 | 3999979001 | R01C02 | |
2 | GSM3035403 | genomic DNA from 3999979001_R02C01 | X3999979001_R02C01 | GPL13534 | genomic DNA from 3999979001_R02C01 | Parkinson's disease (PD) | 62 | Male | Caucasian | whole blood | 3999979001_R02C01 | 3999979001 | R02C01 | |
3 | GSM3035404 | genomic DNA from 3999979001_R02C02 | X3999979001_R02C02 | GPL13534 | genomic DNA from 3999979001_R02C02 | PD-free control | 72 | Male | Caucasian | whole blood | 3999979001_R02C02 | 3999979001 | R02C02 | |
4 | GSM3035405 | genomic DNA from 3999979001_R03C01 | X3999979001_R03C01 | GPL13534 | genomic DNA from 3999979001_R03C01 | PD-free control | 72 | Female | Caucasian | whole blood | 3999979001_R03C01 | 3999979001 | R03C01 |
[3]:
methylcheck.beta_mds_plot(df, filter_stdev=2, poobah=filepath)
INFO:methylcheck.samples.postprocessQC:81518.8 probe(s) [avg per sample] were missing values and removed from MDS calculations; 485512 remaining.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
571 original samples; 542 after filtering
Your scale factor was: 2
[3]:
3999979001_R01C01 | 3999979001_R01C02 | 3999979001_R02C01 | 3999979001_R02C02 | 3999979001_R03C01 | 3999979001_R03C02 | 3999979001_R04C01 | 3999979001_R04C02 | 3999979001_R05C01 | 3999979001_R05C02 | ... | 9721367028_R01C02 | 9721367028_R02C02 | 9721367028_R03C01 | 9721367028_R03C02 | 9721367028_R04C01 | 9721367028_R04C02 | 9721367028_R05C01 | 9721367028_R05C02 | 9721367028_R06C01 | 9721367028_R06C02 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
IlmnID | |||||||||||||||||||||
cg00000029 | 0.586 | 0.553 | 0.586 | 0.551 | 0.533 | 0.698 | 0.713 | 0.413 | 0.367 | 0.655 | ... | 0.706 | 0.719 | 0.521 | 0.643 | 0.531 | 0.578 | 0.518 | 0.606 | NaN | 0.532 |
cg00000108 | 0.978 | 0.974 | 0.972 | 0.973 | 0.968 | 0.973 | 0.971 | 0.967 | 0.970 | 0.976 | ... | 0.969 | 0.974 | 0.970 | 0.973 | 0.972 | 0.968 | 0.969 | 0.975 | 0.966 | 0.968 |
cg00000109 | 0.917 | 0.925 | 0.905 | 0.934 | 0.922 | 0.920 | 0.913 | 0.918 | 0.933 | 0.923 | ... | 0.917 | 0.929 | 0.930 | 0.924 | 0.930 | 0.905 | 0.893 | 0.912 | 0.908 | 0.912 |
cg00000165 | 0.221 | 0.208 | NaN | 0.185 | NaN | NaN | NaN | NaN | 0.160 | 0.209 | ... | NaN | 0.209 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
cg00000236 | 0.812 | 0.881 | 0.867 | 0.888 | 0.843 | 0.852 | 0.889 | 0.815 | 0.809 | 0.887 | ... | 0.899 | 0.888 | 0.902 | 0.889 | 0.875 | 0.836 | 0.884 | 0.868 | 0.882 | 0.874 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
ch.X.97129969R | NaN | 0.088 | NaN | NaN | 0.112 | NaN | 0.102 | NaN | NaN | 0.118 | ... | NaN | 0.110 | 0.103 | NaN | NaN | NaN | 0.104 | 0.077 | NaN | NaN |
ch.X.97133160R | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ch.X.97651759F | 0.032 | 0.030 | 0.044 | 0.031 | 0.029 | 0.043 | 0.024 | 0.038 | 0.038 | 0.024 | ... | 0.028 | 0.025 | 0.032 | 0.039 | 0.041 | 0.042 | 0.027 | 0.027 | 0.022 | 0.043 |
ch.X.97737721F | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ch.X.98007042R | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
485512 rows × 542 columns
There are a few things to note with the MDS plot shown above. We’ve color-coded the samples in the plot by their poobah value failure rates; the samples with the highest failure rates (yellow, orange) tend to separate from the two clusters.
The other thing to note is that there are two clusters. With our color-coding scheme, we can only hope that these are the two groups (Parkinson’s Disease patients vs. controls). So we need to adjust our color-coding to examine that possibility!
[4]:
metadata['disease state'].value_counts()
[4]:
Parkinson's disease (PD) 335
PD-free control 237
Name: disease state, dtype: int64
[5]:
disease_state = metadata[['Sample_ID', 'disease state']]
disease_labels = dict(zip(disease_state['Sample_ID'], disease_state['disease state']))
There’s a small error in the metadata where one of the rows has an empty sample ID column, which will cause an error when we try to plot the mds based on the labels. See below how we call an empty string key from the labels dictionary and still get a value for the disease state. So we’ll make sure to drop that key/value pair out from our dictionary before we continue.
[6]:
disease_labels['']
[6]:
"Parkinson's disease (PD)"
[7]:
del disease_labels['']
Now our dictionary won’t error when we try to label samples in our MDS plot, so we can proceed with plotting the control vs. PD color-coding.
[8]:
methylcheck.beta_mds_plot(df, filter_stdev=2, labels=disease_labels)
INFO:methylcheck.samples.postprocessQC:81518.8 probe(s) [avg per sample] were missing values and removed from MDS calculations; 485512 remaining.
571 original samples; 542 after filtering
Your scale factor was: 2
[8]:
3999979001_R01C01 | 3999979001_R01C02 | 3999979001_R02C01 | 3999979001_R02C02 | 3999979001_R03C01 | 3999979001_R03C02 | 3999979001_R04C01 | 3999979001_R04C02 | 3999979001_R05C01 | 3999979001_R05C02 | ... | 9721367028_R01C02 | 9721367028_R02C02 | 9721367028_R03C01 | 9721367028_R03C02 | 9721367028_R04C01 | 9721367028_R04C02 | 9721367028_R05C01 | 9721367028_R05C02 | 9721367028_R06C01 | 9721367028_R06C02 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
IlmnID | |||||||||||||||||||||
cg00000029 | 0.586 | 0.553 | 0.586 | 0.551 | 0.533 | 0.698 | 0.713 | 0.413 | 0.367 | 0.655 | ... | 0.706 | 0.719 | 0.521 | 0.643 | 0.531 | 0.578 | 0.518 | 0.606 | NaN | 0.532 |
cg00000108 | 0.978 | 0.974 | 0.972 | 0.973 | 0.968 | 0.973 | 0.971 | 0.967 | 0.970 | 0.976 | ... | 0.969 | 0.974 | 0.970 | 0.973 | 0.972 | 0.968 | 0.969 | 0.975 | 0.966 | 0.968 |
cg00000109 | 0.917 | 0.925 | 0.905 | 0.934 | 0.922 | 0.920 | 0.913 | 0.918 | 0.933 | 0.923 | ... | 0.917 | 0.929 | 0.930 | 0.924 | 0.930 | 0.905 | 0.893 | 0.912 | 0.908 | 0.912 |
cg00000165 | 0.221 | 0.208 | NaN | 0.185 | NaN | NaN | NaN | NaN | 0.160 | 0.209 | ... | NaN | 0.209 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
cg00000236 | 0.812 | 0.881 | 0.867 | 0.888 | 0.843 | 0.852 | 0.889 | 0.815 | 0.809 | 0.887 | ... | 0.899 | 0.888 | 0.902 | 0.889 | 0.875 | 0.836 | 0.884 | 0.868 | 0.882 | 0.874 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
ch.X.97129969R | NaN | 0.088 | NaN | NaN | 0.112 | NaN | 0.102 | NaN | NaN | 0.118 | ... | NaN | 0.110 | 0.103 | NaN | NaN | NaN | 0.104 | 0.077 | NaN | NaN |
ch.X.97133160R | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ch.X.97651759F | 0.032 | 0.030 | 0.044 | 0.031 | 0.029 | 0.043 | 0.024 | 0.038 | 0.038 | 0.024 | ... | 0.028 | 0.025 | 0.032 | 0.039 | 0.041 | 0.042 | 0.027 | 0.027 | 0.022 | 0.043 |
ch.X.97737721F | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ch.X.98007042R | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
485512 rows × 542 columns
It appears that disease state isn’t responsible for the clustering we’re seeing here, based on these results. The PD and control patients are evenly interspersed among the two clusters. There might be some other attribute in the metadata responsible for this, though. The best candidate would be something with a binary result in this dataset, like patient gender. So let’s examine what the MDS plot looks like when color-coding by patient gender.
[9]:
gender = metadata[['Sample_ID', 'gender']]
gender_labels = dict(zip(gender['Sample_ID'], gender['gender']))
del gender_labels['']
methylcheck.beta_mds_plot(df, filter_stdev=2, labels=gender_labels)
INFO:methylcheck.samples.postprocessQC:81518.8 probe(s) [avg per sample] were missing values and removed from MDS calculations; 485512 remaining.
571 original samples; 542 after filtering
Your scale factor was: 2
[9]:
3999979001_R01C01 | 3999979001_R01C02 | 3999979001_R02C01 | 3999979001_R02C02 | 3999979001_R03C01 | 3999979001_R03C02 | 3999979001_R04C01 | 3999979001_R04C02 | 3999979001_R05C01 | 3999979001_R05C02 | ... | 9721367028_R01C02 | 9721367028_R02C02 | 9721367028_R03C01 | 9721367028_R03C02 | 9721367028_R04C01 | 9721367028_R04C02 | 9721367028_R05C01 | 9721367028_R05C02 | 9721367028_R06C01 | 9721367028_R06C02 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
IlmnID | |||||||||||||||||||||
cg00000029 | 0.586 | 0.553 | 0.586 | 0.551 | 0.533 | 0.698 | 0.713 | 0.413 | 0.367 | 0.655 | ... | 0.706 | 0.719 | 0.521 | 0.643 | 0.531 | 0.578 | 0.518 | 0.606 | NaN | 0.532 |
cg00000108 | 0.978 | 0.974 | 0.972 | 0.973 | 0.968 | 0.973 | 0.971 | 0.967 | 0.970 | 0.976 | ... | 0.969 | 0.974 | 0.970 | 0.973 | 0.972 | 0.968 | 0.969 | 0.975 | 0.966 | 0.968 |
cg00000109 | 0.917 | 0.925 | 0.905 | 0.934 | 0.922 | 0.920 | 0.913 | 0.918 | 0.933 | 0.923 | ... | 0.917 | 0.929 | 0.930 | 0.924 | 0.930 | 0.905 | 0.893 | 0.912 | 0.908 | 0.912 |
cg00000165 | 0.221 | 0.208 | NaN | 0.185 | NaN | NaN | NaN | NaN | 0.160 | 0.209 | ... | NaN | 0.209 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
cg00000236 | 0.812 | 0.881 | 0.867 | 0.888 | 0.843 | 0.852 | 0.889 | 0.815 | 0.809 | 0.887 | ... | 0.899 | 0.888 | 0.902 | 0.889 | 0.875 | 0.836 | 0.884 | 0.868 | 0.882 | 0.874 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
ch.X.97129969R | NaN | 0.088 | NaN | NaN | 0.112 | NaN | 0.102 | NaN | NaN | 0.118 | ... | NaN | 0.110 | 0.103 | NaN | NaN | NaN | 0.104 | 0.077 | NaN | NaN |
ch.X.97133160R | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ch.X.97651759F | 0.032 | 0.030 | 0.044 | 0.031 | 0.029 | 0.043 | 0.024 | 0.038 | 0.038 | 0.024 | ... | 0.028 | 0.025 | 0.032 | 0.039 | 0.041 | 0.042 | 0.027 | 0.027 | 0.022 | 0.043 |
ch.X.97737721F | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ch.X.98007042R | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
485512 rows × 542 columns
It looks like this might be the culprit of the two clusters in the MDS plot! There are a few outliers, but for the most part, the two clusters are made up of one gender each.
Utilizing one of the functions covered in our Filtering Probes section, we can pull out the beta values from X/Y probes and just examine beta values from probes on the autosomal chromosomes. If MDS is picking up on differences between patient sex and clustering the patients accordingly, we might be able to get at an underlying difference by removing sex from the equation.
[11]:
no_sex_probes = methylcheck.exclude_sex_control_probes(df, array='450k', no_sex=True, no_control=True)
methylcheck.beta_mds_plot(no_sex_probes, filter_stdev=2, labels=gender_labels)
INFO:methylcheck.samples.postprocessQC:79013.0 probe(s) [avg per sample] were missing values and removed from MDS calculations; 473864 remaining.
571 original samples; 539 after filtering
Your scale factor was: 2
[11]:
3999979001_R01C01 | 3999979001_R01C02 | 3999979001_R02C01 | 3999979001_R02C02 | 3999979001_R03C01 | 3999979001_R03C02 | 3999979001_R04C01 | 3999979001_R05C01 | 3999979001_R05C02 | 3999979001_R06C01 | ... | 9721367028_R01C02 | 9721367028_R02C02 | 9721367028_R03C01 | 9721367028_R03C02 | 9721367028_R04C01 | 9721367028_R04C02 | 9721367028_R05C01 | 9721367028_R05C02 | 9721367028_R06C01 | 9721367028_R06C02 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
IlmnID | |||||||||||||||||||||
cg00000029 | 0.586 | 0.553 | 0.586 | 0.551 | 0.533 | 0.698 | 0.713 | 0.367 | 0.655 | 0.695 | ... | 0.706 | 0.719 | 0.521 | 0.643 | 0.531 | 0.578 | 0.518 | 0.606 | NaN | 0.532 |
cg00000108 | 0.978 | 0.974 | 0.972 | 0.973 | 0.968 | 0.973 | 0.971 | 0.970 | 0.976 | 0.976 | ... | 0.969 | 0.974 | 0.970 | 0.973 | 0.972 | 0.968 | 0.969 | 0.975 | 0.966 | 0.968 |
cg00000109 | 0.917 | 0.925 | 0.905 | 0.934 | 0.922 | 0.920 | 0.913 | 0.933 | 0.923 | 0.925 | ... | 0.917 | 0.929 | 0.930 | 0.924 | 0.930 | 0.905 | 0.893 | 0.912 | 0.908 | 0.912 |
cg00000165 | 0.221 | 0.208 | NaN | 0.185 | NaN | NaN | NaN | 0.160 | 0.209 | 0.265 | ... | NaN | 0.209 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
cg00000236 | 0.812 | 0.881 | 0.867 | 0.888 | 0.843 | 0.852 | 0.889 | 0.809 | 0.887 | 0.880 | ... | 0.899 | 0.888 | 0.902 | 0.889 | 0.875 | 0.836 | 0.884 | 0.868 | 0.882 | 0.874 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
ch.9.98937537R | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ch.9.98957343R | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ch.9.98959675F | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ch.9.98989607R | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ch.9.991104F | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
473864 rows × 539 columns
We lost the separation between clusters as a result of pulling out sex probes. There’s still a chance that there is some kind of grouping going on with PD/control patients (for example, the top half of the data is all PD patients, while the bottom half is all control patients). So we’ll color-code by disease state to verify whether or not there’s any grouping going on.
[12]:
methylcheck.beta_mds_plot(no_sex_probes, filter_stdev=2, labels=disease_labels)
INFO:methylcheck.samples.postprocessQC:79013.0 probe(s) [avg per sample] were missing values and removed from MDS calculations; 473864 remaining.
571 original samples; 539 after filtering
Your scale factor was: 2
[12]:
3999979001_R01C01 | 3999979001_R01C02 | 3999979001_R02C01 | 3999979001_R02C02 | 3999979001_R03C01 | 3999979001_R03C02 | 3999979001_R04C01 | 3999979001_R05C01 | 3999979001_R05C02 | 3999979001_R06C01 | ... | 9721367028_R01C02 | 9721367028_R02C02 | 9721367028_R03C01 | 9721367028_R03C02 | 9721367028_R04C01 | 9721367028_R04C02 | 9721367028_R05C01 | 9721367028_R05C02 | 9721367028_R06C01 | 9721367028_R06C02 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
IlmnID | |||||||||||||||||||||
cg00000029 | 0.586 | 0.553 | 0.586 | 0.551 | 0.533 | 0.698 | 0.713 | 0.367 | 0.655 | 0.695 | ... | 0.706 | 0.719 | 0.521 | 0.643 | 0.531 | 0.578 | 0.518 | 0.606 | NaN | 0.532 |
cg00000108 | 0.978 | 0.974 | 0.972 | 0.973 | 0.968 | 0.973 | 0.971 | 0.970 | 0.976 | 0.976 | ... | 0.969 | 0.974 | 0.970 | 0.973 | 0.972 | 0.968 | 0.969 | 0.975 | 0.966 | 0.968 |
cg00000109 | 0.917 | 0.925 | 0.905 | 0.934 | 0.922 | 0.920 | 0.913 | 0.933 | 0.923 | 0.925 | ... | 0.917 | 0.929 | 0.930 | 0.924 | 0.930 | 0.905 | 0.893 | 0.912 | 0.908 | 0.912 |
cg00000165 | 0.221 | 0.208 | NaN | 0.185 | NaN | NaN | NaN | 0.160 | 0.209 | 0.265 | ... | NaN | 0.209 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
cg00000236 | 0.812 | 0.881 | 0.867 | 0.888 | 0.843 | 0.852 | 0.889 | 0.809 | 0.887 | 0.880 | ... | 0.899 | 0.888 | 0.902 | 0.889 | 0.875 | 0.836 | 0.884 | 0.868 | 0.882 | 0.874 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
ch.9.98937537R | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ch.9.98957343R | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ch.9.98959675F | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ch.9.98989607R | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ch.9.991104F | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
473864 rows × 539 columns
It appears that the differences between these two disease states aren’t being picked up by MDS in a meaningful way. We’ll do one more MDS plot on these filtered probes to check how many of those outlier samples have high detection p-value failure rates.
Also, we can define a variable here to access the samples that pass after they’ve been filtered by MDS. Then, we can compare that beta distribution to the original beta distribution to see how the MDS filtering affected it.
[13]:
mds_filtered = methylcheck.beta_mds_plot(no_sex_probes, filter_stdev=1.5, poobah=filepath)
INFO:methylcheck.samples.postprocessQC:79013.0 probe(s) [avg per sample] were missing values and removed from MDS calculations; 473864 remaining.
571 original samples; 448 after filtering
Your scale factor was: 1.5
[14]:
methylcheck.mean_beta_compare(df, mds_filtered)
The difference is a bit subtle since even the low quality data in this dataset is relatively high quality, but you can see that the green curve (MDS filtered samples) has higher and tighter peaks than the blue curve (unfiltered data). This means we were able to eliminate some of the samples that had a lot of beta values falling in that midrange between the peaks.