High-throughput sequencing data analysis

5.1 General workflow

A typical workflow for the analysis of transcription factor (TF) ChIP-seq data may consist of the following steps. Tools are given in brackets.

Quality assessment of raw reads (fastqc)
Adapter trimming and quality filtering (cutadapt, trimmomatic, bbduk.sh)
Read alignment to reference genome (bowtie2)
Quality assessment of mapped reads (QualiMap)
Read de-duplication (samtools, picard-tools)
ChIP library assessment (deepTools)
Meta-gene analysis (deepTools)
Peak calling (MACS2)

Note that there exist two different major versions: MACS (latest version 1.4.2) and MACS2 (latest version 2.1.0); both have been developed in Shirley Liu’s lab. I recommend using the most up-to-date MACS2 version, which can be downloaded via its GitHub project site.

5.2 Online resources

5.3 The Cistrome project

The cistrome refers to “the set of cis-acting targets of a trans-acting factor on a genome-wide scale, also known as the in vivo genome-wide location of transcription factor binding-sites or histone modifications”.

The cistrome project includes a variety of ChIP-seq data analysis tools developed by the lab of Shirley Liu.

For example, the Cistrome Data Browser is a good resource for identifying publicly available ChIP-seq and DNase-seq data.

5.4 Controls and replicate libraries

It is critical to have controls for the ChIP libraries to estimate non-specific binding across the genome. The two most frequently used methods use (1) DNA input samples, or (2) IgG “mock-ChIP” control samples. Various publications discuss potential benefits of one method over the other, and it appears that the choice of controls is to some extent based on individual experience and personal preference. Overall, the use of input DNA as controls seems to be preferred over IgG controls (see e.g. Kiddler et al., Landt et al., Meyer and Liu). It is important to keep in mind that control libraries are subject to systematic biases due to genomic copy number variations, sequencing biases, read mapping ambiguities, and cell-type specific chromatin structure [Vega et al.].

The original 2012 ENCODE guidelines [Land et al.] state that “more than two replicates did not significantly improve site discovery”, and therefore define a standard for ChIP measurements to “be performed on two independent biological repliates”. Results from more recent publications strongly encourage \(n\geq 3\) replicate libraries to increase the reliability in the identification of binding sites. For example, [Yang et al.] demonstrate that a simple majority-rule-approach “identifies peaks more reliably in all biological replicates than the absolute concordance of peak identification between any two replicates, further demonstrating the utility of increasing replicate numbers in ChIP-seq experiments”.

5.5 Library quality assessment

5.5.1 Alignment stats

Following read alignment, bowtie2’s alignment summaries give a first insight into library quality and sequencing depth. Using an unmasked reference genome, mappability of input samples is commonly around 70-90%. In a repeat-masked reference genome, around 50% of the genomic sequence is identified as interspersed repeats and low-complexity DNA and masked with N’s (see e.g. RepeatMasker); consequently, alignment of input reads against a repeat-masked reference genome will give lower mappability rates around 35-55%. For example, we use a repeat-masked mouse reference genome with an additional copy of the canonical rDNA repeat unit to explore protein binding to the rDNA, and obtain mappability rates of around 40% for high-quality input samples.

5.5.2 Read duplication

Generally, quality control of aligned reads should include an assessment of the duplication percentage of read (pairs). It is generally advisable to remove duplicate read pairs prior to proceeding with any down-stream analysis; for example, Dozmorov et al. conclude that

[b]oth adapter trimming and duplicate removal moderately improved the strength of biological signals in RNA-seq and ChIP-seq data. Aggressive filtering of reads overlapping with low complexity regions, as defined by RepeatMasker, further improved the strength of biological signals, and the correlation between RNA-seq and microarray gene expression data.

Some peak-callers (like MACS2) will indepedently (and by default) remove read duplicates as part of their analysis workflow.

We can identify and mark duplicate reads using picard-tools MarkDuplicates and/or remove duplicates directly using picard-tools MarkDuplicates REMOVE_DUPLICATES=TRUE. Alternatively, we can use samtools rmdup to remove duplicates. The total percentage of duplicate reads and their chromosomal distribution provide insight into library complexity and biases.

Additional notes on samtools rmdup vs. picard-tools MarkDuplicates taken from a discussion on the GATK forum:

MarkDuplicates by default marks properly paired PE reads and SE reads (e.g. mate unmapped but not secondary/supplementary) with the 1024 flag (hexadecimal: 0x400) without removing the read. MarkDuplicates scores primary mapped reads based on the sum of base quality scores for both mates of a pair.
rmdup retains the read pair with the highest mapping quality, and removes any duplicate read pairs from the output BAM/SAM file.

Typical duplication rates:

Library	Duplication rate
IP	30-60%
Input	1-10%

Average sample duplication rates will depend on e.g. the read length, sequencing depth, etc. For example, in the table above, the larger numbers in the ranges may correspond to shorter read lengths, whereas longer reads will give smaller duplication rates. Additionally, increasing the sequencing depth will lead to an increase in duplication rates.

Unexpectedly high duplication rates may be due to:

binding factors interacting predominantly with very few sites,
binding factors interacting predominantly with highly repetitive genomic regions (e.g. a Pol-I associated TF that interacts primarly with the rDNA repeat units)
a failed IP and/or the amount of IP’ed chromatin being too low for library construction resulting in a large amount of PCR cycles giving rise to a large number of PCR duplicates

5.5.3 Fingerprint plot

A powerful tool to assess the quality of libraries from a ChIP-seq experiment is to plot the fingerprint of a library. The term “fingerprint (plot)” has been coined by the deepTools dev’s, but the original idea was introduced by Diaz et al.

From deepTools:

This quality control will most likely be of interest for you if you are dealing with ChIP-seq samples as a pressing question in ChIP-seq experiments is “Did my ChIP work?”, i.e. did the antibody-treatment enrich sufficiently so that the ChIP signal can be separated from the background signal? (After all, around 90% of all DNA fragments in a ChIP experiment will represent the genomic background).

The fingerprint of a ChIP/input library characterises how much of the genome (that is covered by reads) is covered by how large a fraction of the total number of reads. The following figure is taken from deepTools’ documentation. ChIPseq_fingerprint

A simple example on how the fingerprint distribution is calculated from a read distribution is shown in the following diagram.

example_fingerprint

Detour: Fingerprint plots are closely related to Lorenz curves in economics. Usually in a Lorenz curve the fraction of a population is plotted against the fraction of total income, and they are a useful tool for assessing income inequality. If income is perfectly equally distributed, the Lorenz curve will be a straight line (just like an equal non-specific read distribution across the genome will give rise to a straight line with zero offset and unit slope). In the context of Lorenz curves, it is common to summarise income inequality in terms of the Gini coefficient: the larger the Gini coefficient, the larger the income inequality. It is interesting to note that – to my knowledge – the close connection between fingerprint plots and Lorenz curves has not been discussed. In ChIP-seq fingerprint plots, the corresponding Gini coefficient could be used as a critical summary statistic, characterising the degree of specific binding, with a zero Gini coefficient indicating completely uniform non-specific background binding.

5.5.4 Further QC

Further quality control checks typically involve a principle component analysis of read coverage (1) across the full (binned) genome, and (2) within identified common peak regions. Following appropriate centering of read counts, samples should separate along the first principle component according to whether they are input or ChIP libraries. Separation on the second principle component should then occur according to e.g. biological replicates, time-points or ChIP antibody (depending on the experiment design).

Additionally, a correlation analysis based on either counts directly or on their ranks gives a summary statistic characterising within-replicate reproducibility. It is important to keep in mind that Pearson’s product moment correlation coefficient will dependent on the dynamic range of the two input variables; we therefore expect on average larger correlation coefficients in pairwise comparisons of input libraries than of ChIP libraries.

5.6 References

More web resources

Notwithstanding the username, crazyhottommy lists many resources/links associated with ChIP-seq data analysis.

5 ChIP-seq data analysis