(upbeat music) - [Mason Brooks] Greetings everyone. My name is Mason Brooks, and welcome to this installment of the Active Motif webinar series. Today's presentation will be given by Steve Stelman, Active Motif's director of IT and bioinformatics.
Steve will be walking us through the bioinformatics data analysis process that occurs after all the ChIP-Seq work in the lab has been done, that allows scientists to make sense of their results. - [Steve Stelman] Thank you for the introduction. I'm happy to be here, and glad you could join us.
Today I'll be presenting an introduction to bioinformatics pipelines for ChIP-Seq data. The presentation is meant to be a basic primer for those unfamiliar with ChIP-Seq pipelines. We plan to revisit this subject matter in greater depth in future presentations.
Before jumping into the steps for ChIP-Seq data analysis, it's useful to consider the types of questions that might lead one to perform ChIP-Seq. For example, I'm interested in the estrogen receptor alpha transcription factor, and want to find out where in the genome it is acting. What do I do?
Or perhaps I have tissue from a cancer patient, and I want to know where enhancers are located. What do I do? These are both great examples of the type of question that ChIP-Seq experiments are suited to answer.
ChIP-Seq can provide accurate, reproducible identification of histone modifications across the genome. ChIP-Seq can also map binding sites of transcription factors, identify consensus motifs, and map changes in transcription factor binding. ChIP-Seq combines chromatin immunoprecipitation, or ChIP, with next generation DNA sequencing.
Initially DNA is cross-linked to bound proteins fragmented by sonication, and then immunoprecipitated with a specific antibody. The enriched DNA is purified and used to prepare sequencing libraries. Because next generation sequencing produces so much information, an analysis pipeline is required to reduce this massive genome-scale data to something digestible that can be used to interpret answers to scientific questions, such as those posed at the beginning of this presentation.
This diagram represents the ChIP-Seq bioinformatics pipeline. The steps of the ChIP-Seq bioinformatics pipeline which we will review in this presentation are, one, sequencing including quality control, two, mapping including duplicate removal and tagged normalization. Three, peak calling, including blacklist filtering and peak quality control.
Four, calling differential peaks between samples. Five, annotating peaks for both genes and conserved motifs. Six, preparing fragment density bigWig files, and seven, visualizing the data.
The first step begins in the lab with the production of ChIP-Seq libraries, each with a unique DNA index. Multiple libraries are pooled together and then sequenced on an Illumina DNA sequencer. It is important to obtain a sufficient number of sequencing reads necessary for data interpretation.
This is also called depth of sequencing. The required sequencing depth depends on the assay and the tissue. But for mammalian cells and tissues, we typically recommend around 30 million reads to get a good picture of transcription factor binding, or histone mark locations.
After sequencing, the Illumina base called data files need to be converted from bcl format to readable DNA sequence. Illumina's bcl2fastq software performs this conversion, while simultaneously demultiplexing the pooled samples using sequence of the unique indices. The resulting FASTQ format files contain information about each sequencing read, including a unique identifier, the sequence of the index, the sequence of the insert, and information about the quality of the sequence data encoded in Phred+33 form.
Before continuing with further analysis, evaluating sequence quality is important to ensure high quality data has been obtained. The FASTQC software provides a quick evaluation of a number of metrics. Note that with ChIP-Seq data, it is common for not all metrics to have passing quality, but the following metrics are important to review.
Per base sequence quality shows the values of the FASTQ quality string across the entire sample. High quality across the entire sequence is important for subsequent steps. Per sequence GC content is shown as two curves representing the actual and theoretical values.
These two curves should roughly match. A large difference in these curves can also indicate problems with the data. Finally, the frequency of overrepresented sequences, shows whether unexpected remnants such as untrimmed adapters, or large stretches of homopolymers, due to poor library prep, may exist.
A second useful software package to run for QC of FASTQC data is the FASTQ-Screen software. This sub-samples each FASTQ sequence for 100,000 reads, and maps them to user defined genomes. This allows a quick check for the presence of sequences from other organisms, which might indicate sample mix up or cross contamination.
In this example, most of the sequence maps accurately to the mouse genome. This would be a red flag, if you expected your samples to be a human origin, for example. The next stage of our bioinformatics pipeline is to map the FASTQ data.
Each Sequence read, also called a tag for ChIP-Seq data, is mapped to the reference genome of the organism from which the starting library was constructed. This produces an alignment file in SAM, or preferably the more compact, BAM binary format. Popular software packages for mapping Illumina DNA data are BWA and Bowtie, both of which are based on the Burrows-Wheeler transform.
The SAM tool software is useful for SAM to BAM format conversion if necessary, as well as sorting the BAM file by the mapping position of each tag. Producing a position sorted BAM alignment file, is the goal of this step. Before proceeding further, it is good practice to remove potential PCR duplicates which can arise if multiple PCR products from the same template molecule bind on the flow cell, and can occur due to over amplification of the starting material.
In the simplified diagram shown here for single-line sequencing, potential PCR duplicates are marked by the red arrow, and have the same five prime locations on the reference genome. Removing these duplicates is important to avoid biasing the results with artificially enriched regions. Note that for paired end sequencing, the five prime map coordinates of both mate pairs would be considered.
While the Picard software mark duplicates tool is more accurate for paired end reads, The SAM tools remove dupe tool may be faster for single end reads. To more accurately remove PCR duplicates, and to avoid removing real biological duplicates, unique molecular identifiers, also called molecular IDs can be used. This is accomplished through the use of an additional sequencing index in the library, which consists of a unique randomer to identify each molecule.
In the simplified diagram here, the duplicate tags are shown under the red arrow, but this time each is distinguished by different colored dots representing the UMIs, while the two red dots are PCR duplicates with identical UMI sequences. The blue, green and pink dots, represent true biological duplicates with unique UMI sequences, and are not removed from the BAM alignment. This approach, while more accurate, requires protocol modifications, during the library construction, Illumina sequencing and deduplication steps.
To adjust for differences in sequencing depth and mapping efficiency between samples, normalization is performed. Normalization is essential to detect site specific as well as most global differences in target enrichments between samples. To achieve this tag numbers of all samples within a comparison group are reduced by random sampling to the number of tags present in the smallest sample.
Spike-in adjustment can be useful for rare essays where standard normalization does not result in the expected global differences in enrichments between test samples. If spike-in of Drsophila chromatin was performed, the number of test tags are adjusted again by downsampling, by a factor that would result in the same number of spike-in Drsophila tags for each sample. The third step of our ChIP-Seq pipeline is peak calling.
An interval peak is a genomic region where ChIP-enriched tags are around the site where the protein of interest is bound. There are two general types of peaks, and peak calling algorithms, narrow and broad. Transcription factor binding sites, as well as many active histone marks and methyl-C enriched regions are examples of narrow peaks.
Histone modifications covering entire gene bodies such as repressive histone marks, and RNA polymerase II are examples have broad peaks. As far as software used for peak calling, the MACS2 software has modes for both narrow and broad peak calling. Another package SICER, is used specifically for broad peaks.
Input data, which is non-IP DNA, is typically used as control files to increase specificity of peak calling, by comparing to background. The input can be pooled among all or specific groups of samples depending on the experimental design. The output peaks are stored in a BED file, which is a tab-delimited text format.
The ENCODE Consortium produces lists of organism-specific regions where signal is anomalous, unstructured, or unusually high. Since these high signal regions or black lists are independent of cell line or experimental conditions, it is essential to remove these regions from predicted peaks when analyzing ChIP-Seq data. This is done by filtering out interval peaks that intersect with ENCODE blacklist regions using a tool, such as BEDtools intersect.
Typically, a minority of reads in ChIP-Seq experiments occur in significantly enriched genomic regions or peaks. The remainder of the reads represent background. Therefore, the FRiP score, or fraction of reads in peaks, is a useful first pass quality metric for the success of the immunoprecipitation.
It is calculated by dividing usable reads in significantly enriched peaks, by the total of all usable reads. The resulting FRiP score allows comparison of results from IPs with the same antibody across different cell lines, or with different antibodies against the same factor. This, of course assumes that the same peak-calling algorithm and parameters were used.
With exception for known outliers, the ENCODE Consortium recommends scrutinizing experiments where the FRiP falls below 1%. The fourth stage of our pipeline is differential analysis. In order to perform differential analysis between samples, first, the interval peak data for each sample, needs to be combined into what we refer to as merged regions.
Merged regions are defined by the coordinate of the most upstream overlapping interval and the end coordinate of the most downstream. For each merged region contributing peak statistics for each sample are recorded, as well as a binary flag for the presence of the merged region within each sample. There are two main ways to identify peaks within the same merged region that have differential signal.
In the first method, the fold change between pairs of conditions can be calculated, and a threshold can be set to identify regions with a strong change in signal. In the second method, DESeq2 can be used. If replicates are available, this method can add statistical rigor to the differential analysis, and each site will have an associated p-value.
The fifth stage of our pipeline relates to annotation. Locations of genomic annotations that fall within, for example, 10 kB upstream or downstream of a peak, are added to that peaks list of annotations. These annotations, including gene lists, can provide targets for future downstream analysis.
Predicting motifs near peaks is useful to confirm the specificity of the antibody used, as well as to identify candidate proteins that bind to the same site, such as cofactors for your protein of interest. Software packages such as HOMER and MEME, allows sequences surrounding peaks to be analyzed for known or de-novo motifs. The sixth step of our pipeline is to create a signal map in bigWig format, which can be used to display fragment density in a genome browser.
This histogram or signal map, shows the normalized tag coverage as the number of reads, for example, per 32 base pair bin. This provides a useful visual way to compare the enriched fragment density across different ChIP-Seq samples. This brings us to the seventh and final step of our pipeline, visualization.
One benefit of creating bigWig embed files is their utility with visual genome browsers. These files can be loaded as visual tracks along the genome and genomic annotations. Here, the fragment densities in the bigWig are displayed in the top purple track.
The peak interval is displayed in orange, and the individual aligned reads or tags, are displayed in red. At the bottom are genomic annotations for this region, which can be inspected to determine the effect of different experimental conditions as represented by the tracks. IGB and IGV are two useful visualization tools that can be downloaded and run locally on your own computer.
The advantage of local desktop genome browsers is the ease of visualizing any custom data tracks you have produced. If you prefer a web based tool, the UCSC Genome Browser provides the ability to upload custom tracks for display, along with their hosted annotations. Unlike with desktop browsers, custom bigWig files must be hosted on an FTP or web server in order to view in the UCSC browser.
To review, we've now completed our discussion of the seven primary steps of the ChIP-Seq Bioinformatics pipeline. Sequencing, mapping, peak calling, differential analysis, annotations, fragment density and visualization. While such pipelines undoubtedly, require some programming knowledge to construct and customize, some of the open-source software tools we have discussed in this presentation, are freely available for download, as noted in the links shown on this slide.
At the beginning of this presentation, we posed a couple questions that we will now review. If I'm interested in the ER alpha transcription factor, and want to find out where in the genome it is acting, I can use transcription factor ChIP-Seq. If I have a tissue from a cancer patient and want to know where enhancers are located, I can use histone mark ChIP-Seq.
So, in conclusion, a ChIP-Seq bioinformatics analysis pipeline, reduces genome-scale next-generation sequence data into digestible answers to questions, such as where proteins bind to DNA, and which gene may be regulated, or where histone modifications are located and what genes may be under epigenetic control. Now, before we say goodbye, it's worth noting that ChIP-Seq isn't always the end of the search for answers. Questions may persist such as, how does one go about validating potential targets discovered with ChIP-Seq, or how do other features of the epigenome relate.
For these we may need to look at integration of ChIP-Seq with other technologies. Examples may include RNA-Seq, which measures transcriptome-wide expression patterns, ATAC-Seq which stands for, Assay for Transposase-Accessible Chromatin using sequencing, and which measures genome-wide chromatin accessability, and RIME, which stands for, Rapid Immunoprecipitation Mass-spectrometry of Endogenous proteins. And deduces transcriptional cofactors and chromatin associated proteins.
These in addition to ChIP-Seq and many others, are all available services offered by Active Motif. More information on any of these can be found on the Active Motif website. I'd also like to acknowledge the contributions to the presentation, of members of our Bioinformatics team at our Epigenetic Services Group.
Doctors Paul Labhart, Wayne Doyle and Felizza Gunderson. Thank you very much for taking the time to listen to our webinar on an introduction to bioinformatics pipelines for ChIP-Seq. If you have any questions, or would like additional information, please visit our website www.
active motif. com, or contact us at, tech_service@activemotif. com.