Overview of the methodsΒΆ
pyTEnrich compare the overlap between input bed files and transposable elements families, using a binomial model to compare the observed overlap to the expected one. The expected overlap is derived from genome occupancy of TE families and compared to the observed overlap to highlight potential over-representation of TE families / subfamilies.
Example
We consider a small genome with one gene composed of two exons, and two TE families. There is 3 ChIP-seq peaks detected :
Step 1 : Compute genome occupancy
First, pyTEnrich compute genome occupancy for TE families and input bed files. Genome occupancy is defined as the total number of base pair (bp) spanned by the TE family, divided by the genome size. It is therefore a ratio of genome occupancy.
Note that TE genome occupancy are pre-computed for the provided TE database. If another TE database is given, or if a genome subset is provided (explained below), it will be re-computed (takes a few minutes).
Step 2 : Intersect TE and Input bed files and count overlap
Using Bedtools intersect, we compute a stringent overlap between input bed files and TE database. The observed overlap can then be compared with the expected overlap.
Step 3 : Compute the enrichment of TE subfamily / family
The enrichment is performed using a binomial test. The binomial test is an exact test of the statistical significance of deviations from a theoretically expected distribution, considering two possible outcome. In our case : TE overlap with peak (success) or do not overlap (failure). We can calculate the probability to have at least k success by suming up probabilities, from k success to n success:
This probability is our p-value of having at least k success, given a probability p for the overlap, and n trials. The p-values obtained above are then adjusted with the Benjamin-Hochsberg method to correct for multiple testing.