pyTEnrich class & functions¶
pyTEnrich source code on c4science
pyTEnrich.Analyser¶
-
class
pyTEnrich.Analyser.
Analyser
(peak_vs_subfams, peak_vs_fams, db_container)¶ Bases:
object
Launch analysis, statistical enrichment on a Genomic_region_container. Adjustment of p-values are launched on each containers independently. This object contains all containers and do counting / stats of the overlaps
- Parameters
peak_vs_subfams (Genomic_region_container) – The container object with all relevant information for the comparison peak_in_te to be done at the TE subfamily level. Contains overlap between groups, probabilities based on genome occupancy, etc…
peak_vs_fams (Genomic_region_container) – same as above, but with TE families instead of TE subfamilies.
db_container (Db_container) – contains loaded databases with associated summaries
-
counting
(intersect_file)¶ Here we open and read the file after bedtools intersection, to increment our containers
- Parameters
intersect_file (str) – path to intersection files
-
fdr
(p_vals)¶ Compute adjusted p-values using Benjamin-Hochsberg procedure
-
get_pd_stats
(genomic_container, bedname)¶ Main function to generate stats with a binomial model
- Parameters
genomic_container (Genomic_region_container) – containers associated with a specific grouping of TEs (e.g. TE subfamily) to be used for counting and to make stats.
bedname (str) – name of the sample (linked to input bed file) for which we compute the stats.
- Returns
object containing stats results, one line for each TE group
- Return type
pd_stats (pandas.DataFrame)
-
get_significance
(pval)¶ Draw significance as stars according to input p-value
- Parameters
pval (float) – input p-value
- Returns
string containing n.s//*/*/**
- Return type
significance (str)
-
single_task
(bedname)¶ This function is handling all the analysis (subfam/fam) for one single bed. This is the function launche by multi-processing unit.
- Parameters
bedname (str) – sample name on which we compute stats for each TE grouping type
-
write_stats
(out_dir)¶ Make enrichment statistics and write down results
pyTEnrich.Bedtools_launcher¶
-
class
pyTEnrich.Bedtools_launcher.
Bedtools_launcher
(bedtools_options, db_container=None)¶ Bases:
object
This class handles calls to bedtools to make intersection between input bed files and TE database
- Parameters
bedtools_options – options used for bedtools intersection between TE and bed files
db_container – Db_container object with all database and input/output information
-
clean_up_temp
()¶ Clean up temp files
-
intersect
()¶ Make intersection using multi-intersect bedtools
-
reformat_intersect
()¶ If only one bed file is provided, reformat intersection file
pyTEnrich.Db_container¶
-
class
pyTEnrich.Db_container.
Db_container
(out_dir, in_dir=None, genome_subset=None, te_db=None, size_genome=2861328253, idx_sfam=7, idx_fam=6)¶ Bases:
object
This class load / initialize databases used for enrichment analysis
- Parameters
out_dir (str) – output directory
te_db (str) – path to transposable element database to be loaded (should be bed format)
genome_subset (str) – path to genome subset bed file
in_dir (str) – path to input directory where input bed files should be
size_genome (int) – genome size in bp - used to compute probabilities in Analyser
idx_sfam (int) – index indicating column with TE subfamily names in te_db file
idx_fam (int) – index indicating column with TE family names in te_db file
-
check_te_db
()¶ Check is TE database exists and has good shape
-
clean_up_temp
()¶ Clean up temp files in output directory
-
compute_size_genome
()¶ Here we use the genome subset file to compute the new genome size to consider for enrichment analysis
-
get_names
(list_beds)¶ Return names as [str] for each bed files
- Parameters
list_beds (list) – list containing path to input bed files
-
handle_genome_subset
()¶ If a genome subset is provided, this function launch new TE summaries generation
-
make_bed_summary
()¶ Make a summary for input bed files (bp coverage, peak average size)
-
make_peak_subset
()¶ Subset multiple bed files using a genome subset
-
make_single_peak_subset
(bed)¶ Do subset for a single bed file
- Parameters
bed – input bed file to be subseted
-
make_te_subset
()¶ Subset TE database with genome subset and make new TE genome occupancy summaries Rely on a predefine perl script utils/make_ref_TE.pl to compute genome occupancy
-
make_te_summary
()¶ make subfam / fam summaries using utils/make_ref_file.pl
-
sort_genome_subset
()¶ If not set to None, sort the genome subset with UNIX sort
pyTEnrich.Genomic_region_container¶
-
class
pyTEnrich.Genomic_region_container.
Genomic_region_container
(peak_summary)¶ Bases:
object
Contains all Genomic_regions objects and control them - e.g. contains all subfams or all fams (one container by grouping type)
-
increment_peak_n_i
(name1, name2)¶ Add one to the overlap on the bed side
-
increment_te_n_i
(name1, name2)¶ Add one to the overlap between group name1 and group name2
- Parameters
name1 – reference genomic region, it needs to be in summary
name2 – the other region to which we intersect - usually comes from input bed files
-
load_te_summary
(summary_file)¶ Load TE summary information to the right Genomic_region object
-
pyTEnrich.Genomic_regions¶
-
class
pyTEnrich.Genomic_regions.
Genomic_regions
(name, list_targets)¶ Bases:
object
For group of regions (TE family, subfamily or group of peaks from same TF), define the number of overlap, the name and the “targets”
- Parameters
name – name of the transposon group (e.g. subfam name)
list_targets – list of targets for a given group
- Returns
total number of element in this transposon group n_i: dictionnary containing number of intersection between this group and group named as hashtag name: this group’s name (e.g. TE family name)
- Return type
n_T
-
increment_n_i
(name_bed)¶ add one to an intersection. Name correspond to bed file (e.g. TF name) n_i corresponds to observed intersection with bed file
- Parameters
name_bed – string with name associated with bed file (hashtag to use in n_i dictionnary)
pyTEnrich.funs¶
-
pyTEnrich.funs.
basen_no_ext
(my_str)¶ Return the name of the file without directory path nor extension
- Parameters
my_str (str) – string to convert
-
pyTEnrich.funs.
create_dir
(d)¶ Function to create directory
-
pyTEnrich.funs.
logger
(comment)¶ Function to print Class::Name - Comments from any classes
- Parameters
comment (str) – comment to add to the logger
-
pyTEnrich.funs.
test_gz_file
(filepath)¶ Detect gzip compressed file