Practical II - Footprint calling & Transcription factor prediction

In the second practical, we will perform a footprint analysis with HINT to identify cell specific binding sites from open chromatin data (ATAC-seq). Next, we will combine the footprints with the differential histone peaks detected by histoneHMM (c.f. practical 1). Thereby, we will find tissue specific TF binding sites, which are located in regions with cell specific histone peaks. These regulatory regions are used in a DYNAMITE analysis with the aim of inferring TFs that might be related to gene expression differences between the tissues of interest.

The final version of the practical will be available at 19.07.2017 at the latest.

Step1: Footprint calling

First, we will use HINT to find genomic regions (footprints) with cell specific TF binding sites. For this, HINT requires (1) a sorted bam file containing the aligned reads from the sequencing library (DNase-,ATAC- or histone ChIP-seq) (2) and a bed file including peaks detected in the same sequencing library provided in (1). These peak regions are used by HINT to reduce the search space and can be generated by any peak caller.

Here, we will analyze ATAC-seq data from LSK cells (equivalent to MPP cells), B cells and T CD4 cells obtained from Lara-Astiaso et al 2014. We have performed low level analysis steps including read alignment and peaks calling in chromossome 1, which can be found in this folder /EpigenomicsTutorial-ISMB2017/session2/step1/input. Check here for a script showing commands used to generate these files for B cell ATAC-seq experiments.

1. First, go to the EpigenomicsTutorial-ISMB2017 directory and generate an output folder for the result files:

cd EpigenomicsTutorial-ISMB2017
mkdir session2/step1/output

2. Execute the following commands to call footprints on B, LSK and CD4 cells for chromosome 1:

rgt-hint --atac-footprints --organism=mm10 --output-location=session2/step1/output/ --output-prefix=B_ATAC_chr1_footprints session2/step1/input/B_ATAC_chr1.bam session2/step1/input/B_ATACPeaks_chr1.bed
rgt-hint --atac-footprints --organism=mm10 --output-location=session2/step1/output/ --output-prefix=LSK_ATAC_chr1_footprints session2/step1/input/LSK_ATAC_chr1.bam session2/step1/input/LSK_ATACPeaks_chr1.bed
rgt-hint --atac-footprints --organism=mm10 --output-location=session2/step1/output/ --output-prefix=CD4_ATAC_chr1_footprints session2/step1/input/CD4_ATAC_chr1.bam session2/step1/input/CD4_ATACPeaks_chr1.bed

This will generate an output file, i.e session2/step1/output/B_ATAC_chr1_footprints.bed, containing the genomic locations of the footprints. HINT also produces a file with ending ”.info”, which has general statistics from the analysis as no. of footprints, total number of reads and so on.

We can use IGV to vizualise ATAC-seq signals and footprint predictions in particular loci. First, we can use a special HINT command to generate genomic profiles (bigWig files).

rgt-hint --print-signal --bc-signal --bigWig --organism=mm10 --reads-file=./session2/step1/input/B_ATAC_chr1.bam --regions-file=./session2/step1/input/B_ATACPeaks_chr1.bed --output-location=./session2/step1/output --output-prefix=B_ATAC_chr1
rgt-hint --print-signal --bc-signal --bigWig --organism=mm10 --reads-file=./session2/step1/input/CD4_ATAC_chr1.bam --regions-file=./session2/step1/input/CD4_ATACPeaks_chr1.bed --output-location=./session2/step1/output --output-prefix=CD4_ATAC_chr1
rgt-hint --print-signal --bc-signal --bigWig --organism=mm10 --reads-file=./session2/step1/input/LSK_ATAC_chr1.bam --regions-file=./session2/step1/input/LSK_ATACPeaks_chr1.bed --output-location=./session2/step1/output --output-prefix=LSK_ATAC_chr1

This bigwig file contains number of ATAC-seq (or DNAse-seq) reads at each genomic position as estimated by HINT after signal normalization and cleveage bias correction. This is therefore more accurate than simply looking a coverage profiles of a bam file.

Open all bigwig and footprint files (bed) generated above in IGV. Remember to set the genome version to mm10 beforehand. You can also enrich the data by opening bam files of histone modifications as H3K4me3, H3K4me1 and H3K27ac (provided in session 1). Check for example the genomic profiles around the gene Zp70, which is part of T cell receptor. We observe that this gene has several open chromatin regions and high levels of histone levels in the start of the gene, which are specific to CD4 T cells. There is also some open chromatin level around exon 3, which is supported by H3K4me1 modifications. This potential enhancer region is also present in distinct degrees in B and LSK cells.

3. An important question when doing footprint analysis is to evaluate which TF motifs overlap with footprints and evaluate the ATAC-seq profiles around these motifs. RGT suite also offers a tool for finding motif predicted binding sites (mpbs). For example, we analyze here motifs from factors SPI1 and ELK4, which were discussed in Lara-Astiaso et al. 2014 to be associated respectively associated to LSK and CD4 cells.

Execute the following commands to do motif matching inside footprints for chromosome 1:

rgt-motifanalysis --matching --organism=mm10 --output-location=session2/step1/output/ --use-only-motifs=session2/step1/input/motifs.txt session2/step1/output/B_ATAC_chr1_footprints.bed
rgt-motifanalysis --matching --organism=mm10 --output-location=session2/step1/output/ --use-only-motifs=session2/step1/input/motifs.txt session2/step1/output/CD4_ATAC_chr1_footprints.bed
rgt-motifanalysis --matching --organism=mm10 --output-location=session2/step1/output/ --use-only-motifs=session2/step1/input/motifs.txt session2/step1/output/LSK_ATAC_chr1_footprints.bed

The file session2/step1/motifs.txt contains a list of JASPAR motif ids to be used in the analysis. Ignoring this option will search for all JASPAR motifs. The above commands will generate bed files (i.e. LSK_ATAC_footprints_mpbs.bed) containing mpbs overlapping with distinct footprint regions. The 4th column contains the motif name and the 5th column the bitscore of the motif match. If you open these files in IGV, you will observe that a ELK4 binding site near the 3rd exon of the gene Zp70.

4. Finally, we use HINT to generate average ATAC-seq profiles around binding sites of particular TF. This analysis allow us to inspect the chromatin chromatin accessibility and the underlying sequence. Moreover, by comparing the cut profiles from two ATAC-seq libraries (i.s. LSK vs T CD4 cells ), we can get insights on changes in binding in two cells. For this, execute the following commands:

mkdir session2/step1/output/LSK_B
rgt-hint --diff-footprints --organism=mm10 --mpbs-file=session2/step1/result/LSK_B_ATAC_footprints_mpbs.bed --reads-file1=session2/step1/input/LSK_ATAC.bam --reads-file2=session2/step1/input/B_ATAC.bam --output-location=session2/step1/output/LSK_B --output-prefix=LSK_B

mkdir session2/step1/output/LSK_CD4
rgt-hint --diff-footprints --organism=mm10 --mpbs-file=session2/step1/result/LSK_CD4_ATAC_footprints_mpbs.bed --reads-file1=session2/step1/input/LSK_ATAC.bam --reads-file2=session2/step1/input/CD4_ATAC.bam --output-location=session2/step1/output/LSK_CD4 --output-prefix=LSK_CD4

The above commands will generate eps files with a ATAC-seq profile for each of the motifs founds in the provided mpbs bed files. Let’s check the profiles in the comparison LSK and CD4, you will see that ELK4 has higher number of ATAC-seq counts in CD4 cells, while SPI1 has more ATAC-seq in LSK cells. Higher ATAC counts indicates higher activity of the factor in that particular cell. This fits with the results discussed in Lara-Astiaso that SPI1 are more relevant/active in LSK, while ELK4 in CD4 cells.

Step2: Intersecting footprints with differential histone peaks

To derive candidate regions for TF binding, we combine (1) genome wide footprint calls and (2) genome wide differential histone peak calls using the active chromatin marks H3K4me3 and H3K27ac. In addition to default unix functions we use bedtools to combine the respective bed files.

All input files are available in the folder /EpigenomicsTutorial-ISMB2017/session2/step2/input.

1. Assure that you are in the directory EpigenomicsTutorial-ISMB2017/session2/step2, otherwise cd to that directory.

2. Generate an output folder for the resulting bed files and enter the folder:

mkdir output
cd output

3. Combine the Differential peak calls for H3K4me3 and H3K27ac in one, sorted bed file. This needs to be done for each pairwise comparison and each cell type individually:

cat ../input/Dif_Histone_Peaks/B_H3K27ac-vs-CD4_H3K27ac-B.bed ../input/Dif_Histone_Peaks/B_H3K4me3-vs-CD4_H3K4me3-B.bed | sort -k1,1 -k2,2n > B_vs_CD4_H3K27ac_H3K4me3_B_sorted.bed
cat ../input/Dif_Histone_Peaks/B_H3K27ac-vs-CD4_H3K27ac-CD4.bed ../input/Dif_Histone_Peaks/B_H3K4me3-vs-CD4_H3K4me3-CD4.bed | sort -k1,1 -k2,2n > B_vs_CD4_H3K27ac_H3K4me3_CD4_sorted.bed

cat ../input/Dif_Histone_Peaks/LSK_H3K27ac-vs-B_H3K27ac-LSK.bed ../input/Dif_Histone_Peaks/LSK_H3K4me3-vs-B_H3K4me3-LSK.bed | sort -k1,1 -k2,2n > LSK_vs_B_H3K27ac_H3K4me3_LSK_sorted.bed
cat ../input/Dif_Histone_Peaks/LSK_H3K27ac-vs-B_H3K27ac-B.bed ../input/Dif_Histone_Peaks/LSK_H3K4me3-vs-B_H3K4me3-B.bed | sort -k1,1 -k2,2n > LSK_vs_B_H3K27ac_H3K4me3_B_sorted.bed

cat ../input/Dif_Histone_Peaks/LSK_H3K27ac-vs-CD4_H3K27ac-LSK.bed ../input/Dif_Histone_Peaks/LSK_H3K4me3-vs-CD4_H3K4me3-LSK.bed | sort -k1,1 -k2,2n > LSK_vs_CD4_H3K27ac_H3K4me3_LSK_sorted.bed
cat ../input/Dif_Histone_Peaks/LSK_H3K27ac-vs-CD4_H3K27ac-CD4.bed ../input/Dif_Histone_Peaks/LSK_H3K4me3-vs-CD4_H3K4me3-CD4.bed | sort -k1,1 -k2,2n > LSK_vs_CD4_H3K27ac_H3K4me3_CD4_sorted.bed

The cat command aggregates the input files for H3K27ac and H3K4me3 and pipes them (using the | operator) to a sort function which sorts by chromosome (k1,1) and first genomic coordinate (k2,2n). The result is stored in a specified output bed file (using the > operator).

4. Merge overlapping histone peaks using bedtools merge and intersect the merged regions with HINT-BCs footprint calls using bedtools intersect:

bedtools merge -i B_vs_CD4_H3K27ac_H3K4me3_B_sorted.bed | bedtools intersect -a stdin -b ../input/Footprints/B.bed > Footprints_B_vs_CD4_H3K27ac_H3K4me3_B.bed
bedtools merge -i B_vs_CD4_H3K27ac_H3K4me3_CD4_sorted.bed | bedtools intersect -a stdin -b ../input/Footprints/CD4.bed > Footprints_B_vs_CD4_H3K27ac_H3K4me3_CD4.bed

bedtools merge -i LSK_vs_CD4_H3K27ac_H3K4me3_LSK_sorted.bed | bedtools intersect -a stdin -b ../input/Footprints/LSK.bed > Footprints_LSK_vs_CD4_H3K27ac_H3K4me3_LSK.bed
bedtools merge -i LSK_vs_CD4_H3K27ac_H3K4me3_CD4_sorted.bed | bedtools intersect -a stdin -b ../input/Footprints/CD4.bed > Footprints_LSK_vs_CD4_H3K27ac_H3K4me3_CD4.bed

bedtools merge -i LSK_vs_B_H3K27ac_H3K4me3_LSK_sorted.bed | bedtools intersect -a stdin -b ../input/Footprints/LSK.bed > Footprints_LSK_vs_B_H3K27ac_H3K4me3_LSK.bed
bedtools merge -i LSK_vs_B_H3K27ac_H3K4me3_B_sorted.bed | bedtools intersect -a stdin -b ../input/Footprints/B.bed > Footprints_LSK_vs_B_H3K27ac_H3K4me3_B.bed

The bedtools merge command combines two overlapping regions into one region. The result of the intersection is piped into the standard input stream (stdin) of the bedtools intersect -a argument, while the -b argument is result of the Footprint calling. The resulting files will contain only footprints that intersect with a differential H3K4me3 and/or H3K27ac peak. In the next step, we will use these regions as candidate regions for TF binding. Precomputed results are stored in /EpigenomicsTutorial-ISMB2017/session2/step2/result.

By combining both footprints and differential peak calls of active chromatin marks we obtain a collection of canidate binding sites for TFs that are unique for expressed genes in one of the two tissues of interest.

Step3: Deriving candidate transcriptional regulators using DYNAMITE

During a DYNAMITE analysis, two main computational tasks are undertaken:

  1. We calculate TF binding affinities for an example data set of 93 TFs and aggregate those to gene-TF scores using TEPIC. TF affinities are a quantitative measure of TF binding to a distinct genomic region.
  2. A logistic regression classifier is learned that uses changes in TF gene scores between two samples to predict which genes are up/down- regulated between them. Investigating the features of the model allows the inference of potentially interesting regulators that are correlated to the observed expression changes.

Please check the documentation for details on the method.

We provide a script that automatically performs steps (1) and (2) as well as necessary data processing and formatting steps (See DYNAMITE documentation for details). All files used in this step are available in /EpigenomicsTutorial-ISMB2017/session2/step3/input. Additionally, we require the mm10 reference genome, which you should have downloaded while installing HINT.

Note that we precomputed the differential gene expression estimates. Computing those is neither part of the actual tutorial nor of the DYNAMITE workflow. However a tool you could use to compute differential gene/transcript expression is Cuffdiff.

1. Assure that you are in the directory EpigenomicsTutorial-ISMB2017/session2/step3, otherwise cd to that directory.

2. Generate an output folder for the resulting files:

mkdir output

3. To run the DYNAMITE script go to the DYNAMITE folder in the TEPIC repository TEPIC/MachineLearningPipelines/DYNAMITE. We provide three configuration files for the DYNAMITE analyses:

  1. DYNAMITE-LSKvsB.cfg
  2. DYNMAITE-LSKvsCD4.cfg
  3. DYNAMITE-BvsCD4.cfg

The configuration files are stored in the directory EpigenomicsTutorial-ISMB2017/session2/step3/input. They list all parameters that are needed for a run of DYNAMITE. To help you customise these files for later usage, we explain the essential parameters here:

  • open_regions_Group1: One ore more files containing candidate transcription factor binding sites for samples belonging to group 1
  • open_regions_Group2: One ore more files containing candidate transcription factor binding sites for samples belonging to group 2
  • differential_Gene_Expression_Data: Differential gene expression data denoted with log2 fold changes
  • outputDirectory: Directory to write the results to
  • referenceGenome: Path to the reference genome that should be used
  • chrPrefix: Flag indicating whether the reference genome uses a chr prefix
  • pwm: Path to the pwms that should be used
  • cores_TEPIC: Number of cores that are used in the TEPIC analysis
  • geneAnnotation: Gene annotation file that should be used
  • window: Size of the window around a genes TSS that is screened for TF binding sites
  • decay: Flag indicating whether TEPIC should be using exponential decay to downweight far away regions while computing gene-TF scores
  • peakFeatures: Flag indicating whether TEPIC should compute features based on peaks, e.g. peak count, peak length, or signal intensity within a peak

In the scope of the tutorial, you do not have to change any of those. A full description of all parameters is provided here.

4. Run the individual pairwise comparisons for LSK vs B:

bash runDYNAMITE.sh $HOME/EpigenomicsTutorial-ISMB2017/session2/step3/input/DYNAMITE-LSKvsB.cfg

LSK vs CD4:

bash runDYNAMITE.sh $HOME/EpigenomicsTutorial-ISMB2017/session2/step3/input/DYNAMITE-LSKvsCD4.cfg

and B vs CD4:

bash runDYNAMITE.sh $HOME/EpigenomicsTutorial-ISMB2017/session2/step3/input/DYNAMITE-BvsCD4.cfg

Note that you have to replace the prefix $HOME with the proper path to the tutorial repository, if you have not cloned it to your home directory. The results of the analysis will be stored separately for each run in EpigenomicsTutorial-ISMB2017/session2/step3/output. There are three subfolders for each comparison:

  1. Affinities
  2. IntegratedData
  3. Learning_Results

The folder Affinities contains TF affinities calculated in the provided regions for both groups, gene TF scores for both groups, and a metadata file that lists all settings used for the TF annotation with TEPIC (subfolders group1 and group2). The subfolder mean contains the mean gene TF scores computed for group1 and group2. This is needed if you analyze more than one biological replicate per group. The folder ratio contains the gene TF score ratios computed between the gene TF scores of group1 and group2.

The folder IntegratedData encloses matrices that are composed of (1) gene TF score ratios and (2) a measure of differential gene expression. In the folder Log2 the differential gene expression is represented as the log2 expression ratio between group1 and group2. In the folder Binary, the differential gene expression is shown in a binary way. Here, a 1 means a gene is upregulated in group 1 compared to group 2, whereas a 0 means it is down-regulated in group1. The binary format is used as input for the classification.

The folder Learning_Results comprises the results of the logistic regression classifier. The following files should be produced if all R dependencies are available:

  1. Performance_overview.txt
  2. Confusion-Matrix_<1..6>_Integrated_Data_For_Classification.txt
  3. Regression_Coefficients_Cross_Validation_Integrated_Data_For_Classification.txt
  4. Regression_Coefficients_Entire_Data_Set_Integrated_Data_For_Classification.txt
  5. Performance_Barplots.png
  6. Regression_Coefficients_Cross_Validation_Heatmap_Integrated_Data_For_Classification.svg
  7. Regression_Coefficients_Entire_Data_SetIntegrated_Data_For_Classification.png
  8. Misclassification_Lambda_<1..6>_Integrated_Data_For_Classification.svg

The file Performance_overview.txt contains accuracy on Test and Training data sets as well as F1 measures. These values are visualized in Performance_Barplots.png. As the name suggests, the files Confusion-Matrix_<1..6>_Integrated_Data_For_Classification.txt contain the confusion matrix computed on the test data sets. They show model performance by reporting True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN) in the following layout:

Observed/Predicted Positive Negative
Positive TP FN
Negative FP TN

The heatmap Regression_Coefficients_Cross_Validation_Heatmap_Integrated_Data_For_Classification.svg shows the regression coefficients of all selected features in the outer cross validation. This is very well suited to find features that are stably selected in all outer cross validation folds. The raw data used to generate the figure is stored in Regression_Coefficients_Cross_Validation_Integrated_Data_For_Classification.txt. The stronger a regression coefficient, the more important it is in the model.

In addition to the heatmap showing the regression coefficients during the outer cross validation, we also show the regression coefficients learned on the full data set: Regression_Coefficients_Entire_Data_SetIntegrated_Data_For_Classification.png and Regression_Coefficients_Entire_Data_Set_Integrated_Data_For_Classification.txt.

The figures Misclassification_Lambda_<1..6>_Integrated_Data_For_Classification.svg are of technical nature. They show the relationship between the misclassification error and the lambda parameter of the logistic regression function.

5. In addition to the plots describing model performance and feature selection generated by DYNAMITE (as described here), you can create further Figures for a distinct feature of interest using the script TEPIC/MachineLearningPipelines/DYNAMITE/Scripts/generateFeaturePlots.R. This will provide you with density plots showing the distribution of the feature in the two cell types, scatter plots linking feature value to gene expression changes, and a mini heatmap visualising the features regression coefficients.

To use this script, go to the output folder of step 3 EpigenomicsTutorial-ISMB2017/session2/step3/output and use the command

Rscript $HOME/TEPIC/MachineLearningPipelines/DYNAMITE/Scripts/generateFeaturePlots.R LSK-vs-CD4 HOXA3 LSK CD4

This command will generate a plot comparing HOXA3 in LSK against CD4. Feel free to look at further features as you wish. The figure will be stored in the specified directory that contains the results of the DYNAMITE analysis. Again, note that you have to replace the prefix $HOME with the proper path used on your system, if necessary. Precomputed results are stored in /EpigenomicsTutorial-ISMB2017/session2/step3/result.