greatpy

Implementation of GREAT in Python

Installation

You need to have Python 3.8 or newer installed on your system. If you don’t have Python installed, we recommend installing Miniconda <https://docs.conda.io/en/latest/miniconda.html>_.

There are several alternative options to install greatpy:

Install the latest development version:

pip install git+https://github.com/theislab/greatpy.git@main

Notebook

Information	link
Create regdom	notebook
enrichment	notebook
plot	notebook
Comparaison with GREAT	notebook

Getting started

Please refer to the documentation. In particular, the

API documentation.

What is GREAT :

GREAT (Genomic Regions Enrichment of Annotations Tool) is a bioinformatics tool, this method enables to associate genetic regions to the most probable GO terms.

What can you do with greatpy :

This package is strongly inspired by GREAT allowing Helmholtz to have a stable, perennial and updated version of the package.

GREAT figure issue from GREAT article

1. Create regulatory domain from tss

Translate a genetic file in .bed format and containing the following information:
- TSS file should have the following columns :chromosome_number \t position \t strand \t gene_name.
- Chromosome size file should have the following columns :chromosome_number \t chromosome_size.

regdom = greatpy.tl.create_regdom(
    tss_file=Input_TSS_path,  # eg : "../data/human/hg38/tss.bed"
    chr_sizes_file=Input_chromosome_size_path,  # eg : "../data/human/hg38/chr_size.bed"
    association_rule="Basalplusextention",
    out_path=path_save_output,
)

The association rules parameters could be :

Basalplusextention
OneCloset
TwoCloset

2. Get enrichment of GO term in the tests genomics regions

Analyzes the significance of proximal and distal cis-regulatory regions in the genome.
Some files should be used as input :
- test file should have the following columns :chr \t chr_start \t chr_end.
- regulatory domain file should have the following columns :chr \t chr_start \t chr_end \t name \t tss strand
- chromosome size file should have the following columns :chromosome_number \t chromosome_size.
- annotation file should have the following columns :ensembl \t id \t name \t ontology.group \t gene.name \t symbol

res = greatpy.tl.enrichment(
    test_file=Input_path_or_df,  # eg : "../data/tests/test_data/input/10_MAX.bed"
    regdom_file=regdom_path_or_df,  # eg : "../data/human/hg38/regdom.bed"
    chr_size_file=chromosome_size_path_or_df,  # eg : "../data/human/hg38/chr_size.bed"
    annotation_file=annotation_path_or_df,  # eg : "../data/human/ontologies.csv"
)

Several arguments can be added to this function such as :

binom (default True): should the binomial p-value be calculated?
hypergeom (default True): should the hypergeometric p-value be computed?

It is then possible to apply a Bonferroni and/or FDR correction to the found p-values:

res = great.tl.set_fdr(res)
res = great.tl.set_bonferroni(res)

3. Plot

1 genomic distribution of data

Number of genetic associations per genomic region
Distance to the associated gene TSS for each genomic region studied
Absolute distance to the associated gene TSS for each genomic region studied

fig, ax = plt.subplots(1, 3, figsize=(30, 8))
greatpy.pl.graph_nb_asso_per_peaks(
    Input_path_or_df,  # eg : "../data/tests/test_data/input/10_MAX.bed"
    regdom_path_or_df,  # eg : "../data/human/hg38/regdom.bed"
    ax[0],
)
greatpy.pl.graph_dist_tss(
    Input_path_or_df,  # eg : "../data/tests/test_data/input/10_MAX.bed"
    regdom_path_or_df,  # eg : "../data/human/hg38/regdom.bed"
    ax[0],
)
greatpy.pl.graph_absolute_dist_tss(
    Input_path_or_df,  # eg : "../data/tests/test_data/input/10_MAX.bed"
    regdom_path_or_df,  # eg : "../data/human/hg38/regdom.bed"
    ax[0],
)
plt.show()

2 Enrichments by GO terms (dotplot) - one input

plot = enrichment_df.rename(columns={"binom_p_value": "p_value", "go_term": "name"})
plt.figure(figsize=(10, 10))
great.pl.plot_enrich(plot)

3 Enrichments by GO terms (dotplot) - multiple inputs

test = ["name_bindome_biosample_1", "name_bindome_biosample_2", "..."]
tmp_df = great.tl.enrichment_multiple(
    tests=test,
    regdom_file="../data/human/hg38/regulatory_domain.bed",
    chr_size_file="../data/human/hg38/chr_size.bed",
    annotation_file="../data/human/ontologies.csv",
    binom=True,
    hypergeom=True,
)

Note

Both types of tests (binomial and hypergeometric) performed may be susceptible to certain biases of which one must be aware to analyze the results with a critical mind.

The hypergeometric test may be biased by the size of the regulatory domains of the genes since isolated genes have very large regulatory domains and are therefore more likely to generate false positives.
The binomial test can also be biased if a large number of genomic regions to be tested are associated with a small set of genes that can also generate false positives.

But these biases are partially compensated between each of the tests the binomial test reduces the hypergeometric bias by taking into account exactly the size of the regulatory domains of the genes and the hypergeometric test compensates for the bias of the binomial test by counting each gene only once. The two types of tests are complementary and must be analyzed together to conclude.

Release notes

See the changelog.

Contact

For questions and help requests, you can reach out in the scverse discourse. If you found a bug, please use the issue tracker.

Citation

For cite greatpy:

@software{greatpy,
  author = {Ibarra, Mauger-Birocheau}},
  doi = {},
  month = {},
  title = {{greatpy}},
  url = {https://github.com/theislab/greatpy},
  year = {2022}
}

References

References aviable in the references page.