enrichment function

[1]:
%load_ext autoreload
%autoreload 2
[2]:
import greatpy as great
import pandas as pd
from math import inf
from numpy import log,nan, int64,cov,corrcoef
from scipy.stats import pearsonr
from seaborn import scatterplot as sp
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

Compute the p-values

[3]:
test = "../data/tests/test_data/input/03_srf_hg19.bed"
regdom = "../data/human/hg19/regulatory_domain.bed"
great_out = "../data/tests/test_data/output/03_srf_hg19_output_great_webserver.tsv"
size = "../data/human/hg19/chr_size.bed"
[4]:
enrichment = great.tl.enrichment(
    test_file = test,
    regdom_file = regdom,
    chr_size_file = size,
    annotation_file = "../data/human/ontologies.csv",
    binom = True,
    hypergeom = True,
    )
enrichment = great.tl.set_fdr(enrichment)
enrichment = great.tl.set_bonferroni(enrichment)
enrichment
[4]:
go_term binom_p_value binom_fold_enrichment hypergeom_p_value hypergeometric_fold_enrichment intersection_size recall binom_fdr hypergeom_fdr binom_bonferroni hypergeom_bonferroni
GO:0072749 cellular response to cytochalasin B 2.21968e-12 2.27251e+05 4.28032e-02 2.33627e+01 5 5.00000e+00 4.61693e-09 4.65089e-01 9.23386e-09 1.00000e+00
GO:0051623 positive regulation of norepinephrine uptake 2.21968e-12 2.27251e+05 4.28032e-02 2.33627e+01 5 5.00000e+00 4.61693e-09 4.65089e-01 9.23386e-09 1.00000e+00
GO:0098973 structural constituent of postsynaptic actin c... 2.11740e-10 9.10526e+04 1.60543e-01 5.84068e+00 5 1.25000e+00 2.93613e-07 5.43858e-01 8.80839e-07 1.00000e+00
GO:0097433 dense body 6.40085e-10 1.60618e+04 1.41783e-03 1.16814e+01 8 1.33333e+00 6.65688e-07 3.65408e-01 2.66275e-06 1.00000e+00
GO:0032796 uropod organization 2.69880e-09 5.45449e+04 1.82991e-03 2.33627e+01 5 2.50000e+00 1.93909e-06 3.65408e-01 1.12270e-05 1.00000e+00
... ... ... ... ... ... ... ... ... ... ... ...
GO:0007156 homophilic cell adhesion via plasma membrane a... 9.99960e-01 1.12747e+02 9.84651e-01 3.89379e-01 3 1.66667e-02 1.00000e+00 1.00000e+00 1.00000e+00 1.00000e+00
GO:0007186 G protein-coupled receptor signaling pathway 9.99992e-01 2.46644e+02 1.00000e+00 3.07404e-01 19 1.47059e-02 1.00000e+00 1.00000e+00 1.00000e+00 1.00000e+00
GO:0016021 integral component of membrane 9.99997e-01 3.94493e+02 9.99998e-01 7.05221e-01 112 2.88958e-02 1.00000e+00 1.00000e+00 1.00000e+00 1.00000e+00
GO:0005887 integral component of plasma membrane 1.00000e+00 2.81852e+02 9.99970e-01 5.84068e-01 40 2.50000e-02 1.00000e+00 1.00000e+00 1.00000e+00 1.00000e+00
GO:0005886 plasma membrane 1.00000e+00 3.98798e+02 1.00000e+00 2.55733e-01 159 1.06776e-02 1.00000e+00 1.00000e+00 1.00000e+00 1.00000e+00

4160 rows × 11 columns

Compare p values between binomial and hypergeometric tests

[5]:
fig,ax = plt.subplots(2,2,figsize = (20,10))
great.pl.scatterplot(
    enrichment,
    "binom_p_value",
    "hypergeom_p_value",
    minus_log10 = False,ax = ax[0,0])
great.pl.scatterplot(
    enrichment,
    "binom_p_value",
    "hypergeom_p_value",
    minus_log10 = True,ax = ax[0,1])
great.pl.scatterplot(
    enrichment,
    "binom_fdr",
    "hypergeom_fdr",
    minus_log10 = False,ax = ax[1,0])
great.pl.scatterplot(
    enrichment,
    "binom_fdr",
    "hypergeom_fdr",
    minus_log10 = True,ax = ax[1,1])
plt.show()
../_images/notebooks_02_binom_vs_hypergeom_7_0.png
[7]:
pd.options.display.float_format = "{:.3f}".format
pd.DataFrame({
        "binom_vs_hyper": [cov(m = list(enrichment["binom_p_value"]), y = list(enrichment["hypergeom_p_value"]))[0][1],pearsonr(list(enrichment["binom_p_value"]), list(enrichment["hypergeom_p_value"]))[0]],
        "binom_fdr_vs_hyper_fdr":[cov(m = list(enrichment["binom_fdr"]), y = list(enrichment["hypergeom_fdr"]))[0][1],pearsonr(list(enrichment["binom_fdr"]), list(enrichment["hypergeom_fdr"]))[0]]},
        index = ["cov","pearson"])
[7]:
binom_vs_hyper binom_fdr_vs_hyper_fdr
cov 0.062 0.030
pearson 0.773 0.758

The covariance between the binomial and hypergeometric values is very close to 0 (e-2). There is therefore no or little correlation between the variance of these two data

As we can see here, the correlation is quite good (~``0.75``) between the binomial and hypergeometric values returned by great.

We can explain this difference by the fact that the two statistical tests are not susceptible to the same biases and are therefore complementary.

The biases for the two tests are :

  • The hypergeometric test may be biased by the size of the regulatory domains of the genes since isolated genes have very large regulatory domains and are therefore more likely to generate false positives

  • The binomial test can also be biased if a large number of genomic regions to be tested are associated with a small set of genes that can also generate false positives