greatpy

Tests Documentation

Implementation of GREAT in Python

Installation

You need to have Python 3.8 or newer installed on your system. If you don’t have Python installed, we recommend installing Miniconda <https://docs.conda.io/en/latest/miniconda.html>_.

There are several alternative options to install greatpy:

  1. Install the latest development version:

pip install git+https://github.com/theislab/greatpy.git@main

Notebook

Information

link

Create regdom

notebook

enrichment

notebook

plot

notebook

Comparaison with GREAT

notebook

Getting started

Please refer to the documentation. In particular, the

What is GREAT :

GREAT (Genomic Regions Enrichment of Annotations Tool) is a bioinformatics tool, this method enables to associate genetic regions to the most probable GO terms.

What can you do with greatpy :

This package is strongly inspired by GREAT allowing Helmholtz to have a stable, perennial and updated version of the package.

GREAT figure issue from GREAT article

1. Create regulatory domain from tss

  • Translate a genetic file in .bed format and containing the following information:

    • TSS file should have the following columns :chromosome_number \t position \t strand \t gene_name.

    • Chromosome size file should have the following columns :chromosome_number \t chromosome_size.

regdom = greatpy.tl.create_regdom(
    tss_file=Input_TSS_path,  # eg : "../data/human/hg38/tss.bed"
    chr_sizes_file=Input_chromosome_size_path,  # eg : "../data/human/hg38/chr_size.bed"
    association_rule="Basalplusextention",
    out_path=path_save_output,
)

The association rules parameters could be :

  • Basalplusextention

  • OneCloset

  • TwoCloset

_images/association_rule.jpg

2. Get enrichment of GO term in the tests genomics regions

  • Analyzes the significance of proximal and distal cis-regulatory regions in the genome.

  • Some files should be used as input :

    • test file should have the following columns :chr \t chr_start \t chr_end.

    • regulatory domain file should have the following columns :chr \t chr_start \t chr_end \t name \t tss strand

    • chromosome size file should have the following columns :chromosome_number \t chromosome_size.

    • annotation file should have the following columns :ensembl \t id \t name \t ontology.group \t gene.name \t symbol

res = greatpy.tl.enrichment(
    test_file=Input_path_or_df,  # eg : "../data/tests/test_data/input/10_MAX.bed"
    regdom_file=regdom_path_or_df,  # eg : "../data/human/hg38/regdom.bed"
    chr_size_file=chromosome_size_path_or_df,  # eg : "../data/human/hg38/chr_size.bed"
    annotation_file=annotation_path_or_df,  # eg : "../data/human/ontologies.csv"
)

Several arguments can be added to this function such as :

  • binom (default True): should the binomial p-value be calculated?

  • hypergeom (default True): should the hypergeometric p-value be computed?

It is then possible to apply a Bonferroni and/or FDR correction to the found p-values:

res = great.tl.set_fdr(res)
res = great.tl.set_bonferroni(res)

3. Plot

1 genomic distribution of data
  • Number of genetic associations per genomic region

  • Distance to the associated gene TSS for each genomic region studied

  • Absolute distance to the associated gene TSS for each genomic region studied

fig, ax = plt.subplots(1, 3, figsize=(30, 8))
greatpy.pl.graph_nb_asso_per_peaks(
    Input_path_or_df,  # eg : "../data/tests/test_data/input/10_MAX.bed"
    regdom_path_or_df,  # eg : "../data/human/hg38/regdom.bed"
    ax[0],
)
greatpy.pl.graph_dist_tss(
    Input_path_or_df,  # eg : "../data/tests/test_data/input/10_MAX.bed"
    regdom_path_or_df,  # eg : "../data/human/hg38/regdom.bed"
    ax[0],
)
greatpy.pl.graph_absolute_dist_tss(
    Input_path_or_df,  # eg : "../data/tests/test_data/input/10_MAX.bed"
    regdom_path_or_df,  # eg : "../data/human/hg38/regdom.bed"
    ax[0],
)
plt.show()
_images/plot1.png
2 Enrichments by GO terms (dotplot) - one input
plot = enrichment_df.rename(columns={"binom_p_value": "p_value", "go_term": "name"})
plt.figure(figsize=(10, 10))
great.pl.plot_enrich(plot)
_images/dotplot.png

3 Enrichments by GO terms (dotplot) - multiple inputs

test = ["name_bindome_biosample_1", "name_bindome_biosample_2", "..."]
tmp_df = great.tl.enrichment_multiple(
    tests=test,
    regdom_file="../data/human/hg38/regulatory_domain.bed",
    chr_size_file="../data/human/hg38/chr_size.bed",
    annotation_file="../data/human/ontologies.csv",
    binom=True,
    hypergeom=True,
)
_images/multidot.png

Note

Both types of tests (binomial and hypergeometric) performed may be susceptible to certain biases of which one must be aware to analyze the results with a critical mind.

  • The hypergeometric test may be biased by the size of the regulatory domains of the genes since isolated genes have very large regulatory domains and are therefore more likely to generate false positives.

  • The binomial test can also be biased if a large number of genomic regions to be tested are associated with a small set of genes that can also generate false positives.

But these biases are partially compensated between each of the tests the binomial test reduces the hypergeometric bias by taking into account exactly the size of the regulatory domains of the genes and the hypergeometric test compensates for the bias of the binomial test by counting each gene only once. The two types of tests are complementary and must be analyzed together to conclude.

Release notes

See the changelog.

Contact

For questions and help requests, you can reach out in the scverse discourse. If you found a bug, please use the issue tracker.

Citation

For cite greatpy:

@software{greatpy,
  author = {Ibarra, Mauger-Birocheau}},
  doi = {},
  month = {},
  title = {{greatpy}},
  url = {https://github.com/theislab/greatpy},
  year = {2022}
}

References

References aviable in the references page.