enrichment function

[1]:

%load_ext autoreload
%autoreload 2

[2]:

import greatpy as great
import pandas as pd
from math import inf
from numpy import log,nan, int64,cov,corrcoef
from scipy.stats import pearsonr
from seaborn import scatterplot as sp
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

Compute the p-values

[3]:

test = "../data/tests/test_data/input/03_srf_hg19.bed"
regdom = "../data/human/hg19/regulatory_domain.bed"
great_out = "../data/tests/test_data/output/03_srf_hg19_output_great_webserver.tsv"
size = "../data/human/hg19/chr_size.bed"

[4]:

enrichment = great.tl.enrichment(
    test_file = test,
    regdom_file = regdom,
    chr_size_file = size,
    annotation_file = "../data/human/ontologies.csv",
    binom = True,
    hypergeom = True,
    )
enrichment = great.tl.set_fdr(enrichment)
enrichment = great.tl.set_bonferroni(enrichment)
enrichment

[4]:

	go_term	binom_p_value	binom_fold_enrichment	hypergeom_p_value	hypergeometric_fold_enrichment	intersection_size	recall	binom_fdr	hypergeom_fdr	binom_bonferroni	hypergeom_bonferroni
GO:0072749	cellular response to cytochalasin B	2.21968e-12	2.27251e+05	4.28032e-02	2.33627e+01	5	5.00000e+00	4.61693e-09	4.65089e-01	9.23386e-09	1.00000e+00
GO:0051623	positive regulation of norepinephrine uptake	2.21968e-12	2.27251e+05	4.28032e-02	2.33627e+01	5	5.00000e+00	4.61693e-09	4.65089e-01	9.23386e-09	1.00000e+00
GO:0098973	structural constituent of postsynaptic actin c...	2.11740e-10	9.10526e+04	1.60543e-01	5.84068e+00	5	1.25000e+00	2.93613e-07	5.43858e-01	8.80839e-07	1.00000e+00
GO:0097433	dense body	6.40085e-10	1.60618e+04	1.41783e-03	1.16814e+01	8	1.33333e+00	6.65688e-07	3.65408e-01	2.66275e-06	1.00000e+00
GO:0032796	uropod organization	2.69880e-09	5.45449e+04	1.82991e-03	2.33627e+01	5	2.50000e+00	1.93909e-06	3.65408e-01	1.12270e-05	1.00000e+00
...	...	...	...	...	...	...	...	...	...	...	...
GO:0007156	homophilic cell adhesion via plasma membrane a...	9.99960e-01	1.12747e+02	9.84651e-01	3.89379e-01	3	1.66667e-02	1.00000e+00	1.00000e+00	1.00000e+00	1.00000e+00
GO:0007186	G protein-coupled receptor signaling pathway	9.99992e-01	2.46644e+02	1.00000e+00	3.07404e-01	19	1.47059e-02	1.00000e+00	1.00000e+00	1.00000e+00	1.00000e+00
GO:0016021	integral component of membrane	9.99997e-01	3.94493e+02	9.99998e-01	7.05221e-01	112	2.88958e-02	1.00000e+00	1.00000e+00	1.00000e+00	1.00000e+00
GO:0005887	integral component of plasma membrane	1.00000e+00	2.81852e+02	9.99970e-01	5.84068e-01	40	2.50000e-02	1.00000e+00	1.00000e+00	1.00000e+00	1.00000e+00
GO:0005886	plasma membrane	1.00000e+00	3.98798e+02	1.00000e+00	2.55733e-01	159	1.06776e-02	1.00000e+00	1.00000e+00	1.00000e+00	1.00000e+00

4160 rows × 11 columns

Compare p values between binomial and hypergeometric tests

[5]:

fig,ax = plt.subplots(2,2,figsize = (20,10))
great.pl.scatterplot(
    enrichment,
    "binom_p_value",
    "hypergeom_p_value",
    minus_log10 = False,ax = ax[0,0])
great.pl.scatterplot(
    enrichment,
    "binom_p_value",
    "hypergeom_p_value",
    minus_log10 = True,ax = ax[0,1])
great.pl.scatterplot(
    enrichment,
    "binom_fdr",
    "hypergeom_fdr",
    minus_log10 = False,ax = ax[1,0])
great.pl.scatterplot(
    enrichment,
    "binom_fdr",
    "hypergeom_fdr",
    minus_log10 = True,ax = ax[1,1])
plt.show()

../_images/notebooks_02_binom_vs_hypergeom_7_0.png

[7]:

pd.options.display.float_format = "{:.3f}".format
pd.DataFrame({
        "binom_vs_hyper": [cov(m = list(enrichment["binom_p_value"]), y = list(enrichment["hypergeom_p_value"]))[0][1],pearsonr(list(enrichment["binom_p_value"]), list(enrichment["hypergeom_p_value"]))[0]],
        "binom_fdr_vs_hyper_fdr":[cov(m = list(enrichment["binom_fdr"]), y = list(enrichment["hypergeom_fdr"]))[0][1],pearsonr(list(enrichment["binom_fdr"]), list(enrichment["hypergeom_fdr"]))[0]]},
        index = ["cov","pearson"])

[7]:

	binom_vs_hyper	binom_fdr_vs_hyper_fdr
cov	0.062	0.030
pearson	0.773	0.758

The covariance between the binomial and hypergeometric values is very close to 0 (e-2). There is therefore no or little correlation between the variance of these two data

As we can see here, the correlation is quite good (~``0.75``) between the binomial and hypergeometric values returned by great.

We can explain this difference by the fact that the two statistical tests are not susceptible to the same biases and are therefore complementary.

The biases for the two tests are :

The hypergeometric test may be biased by the size of the regulatory domains of the genes since isolated genes have very large regulatory domains and are therefore more likely to generate false positives

The binomial test can also be biased if a large number of genomic regions to be tested are associated with a small set of genes that can also generate false positives