pyTEnrich class & functions

pyTEnrich source code on c4science

pyTEnrich.Analyser

class pyTEnrich.Analyser.Analyser(peak_vs_subfams, peak_vs_fams, db_container)

Bases: object

Launch analysis, statistical enrichment on a Genomic_region_container. Adjustment of p-values are launched on each containers independently. This object contains all containers and do counting / stats of the overlaps

Parameters
  • peak_vs_subfams (Genomic_region_container) – The container object with all relevant information for the comparison peak_in_te to be done at the TE subfamily level. Contains overlap between groups, probabilities based on genome occupancy, etc…

  • peak_vs_fams (Genomic_region_container) – same as above, but with TE families instead of TE subfamilies.

  • db_container (Db_container) – contains loaded databases with associated summaries

counting(intersect_file)

Here we open and read the file after bedtools intersection, to increment our containers

Parameters

intersect_file (str) – path to intersection files

fdr(p_vals)

Compute adjusted p-values using Benjamin-Hochsberg procedure

get_pd_stats(genomic_container, bedname)

Main function to generate stats with a binomial model

Parameters
  • genomic_container (Genomic_region_container) – containers associated with a specific grouping of TEs (e.g. TE subfamily) to be used for counting and to make stats.

  • bedname (str) – name of the sample (linked to input bed file) for which we compute the stats.

Returns

object containing stats results, one line for each TE group

Return type

pd_stats (pandas.DataFrame)

get_significance(pval)

Draw significance as stars according to input p-value

Parameters

pval (float) – input p-value

Returns

string containing n.s//*/*/**

Return type

significance (str)

single_task(bedname)

This function is handling all the analysis (subfam/fam) for one single bed. This is the function launche by multi-processing unit.

Parameters

bedname (str) – sample name on which we compute stats for each TE grouping type

write_stats(out_dir)

Make enrichment statistics and write down results

pyTEnrich.Bedtools_launcher

class pyTEnrich.Bedtools_launcher.Bedtools_launcher(bedtools_options, db_container=None)

Bases: object

This class handles calls to bedtools to make intersection between input bed files and TE database

Parameters
  • bedtools_options – options used for bedtools intersection between TE and bed files

  • db_container – Db_container object with all database and input/output information

clean_up_temp()

Clean up temp files

intersect()

Make intersection using multi-intersect bedtools

reformat_intersect()

If only one bed file is provided, reformat intersection file

pyTEnrich.Db_container

class pyTEnrich.Db_container.Db_container(out_dir, in_dir=None, genome_subset=None, te_db=None, size_genome=2861328253, idx_sfam=7, idx_fam=6)

Bases: object

This class load / initialize databases used for enrichment analysis

Parameters
  • out_dir (str) – output directory

  • te_db (str) – path to transposable element database to be loaded (should be bed format)

  • genome_subset (str) – path to genome subset bed file

  • in_dir (str) – path to input directory where input bed files should be

  • size_genome (int) – genome size in bp - used to compute probabilities in Analyser

  • idx_sfam (int) – index indicating column with TE subfamily names in te_db file

  • idx_fam (int) – index indicating column with TE family names in te_db file

check_te_db()

Check is TE database exists and has good shape

clean_up_temp()

Clean up temp files in output directory

compute_size_genome()

Here we use the genome subset file to compute the new genome size to consider for enrichment analysis

get_names(list_beds)

Return names as [str] for each bed files

Parameters

list_beds (list) – list containing path to input bed files

handle_genome_subset()

If a genome subset is provided, this function launch new TE summaries generation

make_bed_summary()

Make a summary for input bed files (bp coverage, peak average size)

make_peak_subset()

Subset multiple bed files using a genome subset

make_single_peak_subset(bed)

Do subset for a single bed file

Parameters

bed – input bed file to be subseted

make_te_subset()

Subset TE database with genome subset and make new TE genome occupancy summaries Rely on a predefine perl script utils/make_ref_TE.pl to compute genome occupancy

make_te_summary()

make subfam / fam summaries using utils/make_ref_file.pl

sort_genome_subset()

If not set to None, sort the genome subset with UNIX sort

pyTEnrich.Genomic_region_container

class pyTEnrich.Genomic_region_container.Genomic_region_container(peak_summary)

Bases: object

Contains all Genomic_regions objects and control them - e.g. contains all subfams or all fams (one container by grouping type)

increment_peak_n_i(name1, name2)

Add one to the overlap on the bed side

increment_te_n_i(name1, name2)

Add one to the overlap between group name1 and group name2

Parameters
  • name1 – reference genomic region, it needs to be in summary

  • name2 – the other region to which we intersect - usually comes from input bed files

load_te_summary(summary_file)

Load TE summary information to the right Genomic_region object

pyTEnrich.Genomic_regions

class pyTEnrich.Genomic_regions.Genomic_regions(name, list_targets)

Bases: object

For group of regions (TE family, subfamily or group of peaks from same TF), define the number of overlap, the name and the “targets”

Parameters
  • name – name of the transposon group (e.g. subfam name)

  • list_targets – list of targets for a given group

Returns

total number of element in this transposon group n_i: dictionnary containing number of intersection between this group and group named as hashtag name: this group’s name (e.g. TE family name)

Return type

n_T

increment_n_i(name_bed)

add one to an intersection. Name correspond to bed file (e.g. TF name) n_i corresponds to observed intersection with bed file

Parameters

name_bed – string with name associated with bed file (hashtag to use in n_i dictionnary)

pyTEnrich.funs

pyTEnrich.funs.basen_no_ext(my_str)

Return the name of the file without directory path nor extension

Parameters

my_str (str) – string to convert

pyTEnrich.funs.create_dir(d)

Function to create directory

pyTEnrich.funs.logger(comment)

Function to print Class::Name - Comments from any classes

Parameters

comment (str) – comment to add to the logger

pyTEnrich.funs.test_gz_file(filepath)

Detect gzip compressed file