in the transcriptomics one course we cover key concepts needed to understand the analysis of gene expression data we've already discussed how genes are expressed in the cell and what methods are available to detect and quantify expression in this video we will focus on the analytical steps used a process next generation sequencing data next-generation sequencing technology allows for advanced studies of gene expression because it captures a snapshot of the whole transcriptome rather than a predetermined subset of genes which was previously possible with rt-pcr and microarray technologies whole transcriptome sequencing provides a comprehensive view of the cellular
transcriptional profile at a given biological moment and greatly enhances the power of RNA discovery methods to study this data we have to understand and run processing pipelines that turn raw reads into structured data ready for analysis next-generation sequencing is a technology that captures DNA sequences and generates digital data using several steps first RNA that was extracted from cells is converted to cDNA that is then sheared and placed into a flow cell inside the flow cell cDNA fragments are amplified using bridge PCR amplification and the flow cell is then inserted into the sequencer this machine uses
image analysis to capture each letter in the flow cell fragments by analyzing visual patterns as well as converting them into a sequence of letters typically NGS reads are between 30 and 300 base pairs long they consist of a series of letters T C G ma the main question now is how to analyze these sequences to find which genes they came from to better understand these steps and how they influence analysis let's take a look at the sequence elements that we'll be capturing in the process this is the reference genome where you have gene a gene
B and G and C not everything is transcribed for example gene a and gene C are transcribed but gene B is not the product that is transcribed is an RNA isoform or transcript of the gene as it's encoded in DNA this transcript gets shattered as it converted into cDNA fragments adapters are attached at the beginning in the end of each fragment the nucleotide information is amplified using PCR and is only then read by the sequencer as a result we get reads which are short sequences of genetic code that are stored in fast queue files sequencing
can be single or pair end pair and sequencing combines two reads that were sequence from opposite ends of a DNA fragment because the reads contain overlapping sequences they can be merged into a single read that is more reliable however the reads are stored in separate files in the case of parrot and sequencing each sample will have to fast queue files you can see it in the file names file names ending with underscore one and underscore two will be used to name paired-end sequence data inside each fast QL labels are used to name each read then
we'll see the sequence of that read and the quality score for each nucleotide letter the reads have to be aligned to the reference genome to find where they came from they will be aligned using match quality to the sequence of the genes then the number of the reads across the gene sequence will be quantified to give us a level of expression the alignment process is also known as mapping and it is a procedure that determines a score even for a short read finding its place in the genome can be computationally expensive the human genome is
over 3 billion base pairs long the best match is determined by how close the sequences to the read on the sequence of the reference genome let's take a look at the reads alignment to the reference genome by running our first pipeline once the files are loaded in the start function we can use bowtie 2 to align the reads to the reference genome and visualize alignment coverage and alignment quality using the visualization method bowtie 2 is an algorithm that utilizes the burrows-wheeler algorithm to make alignments of short reads to a long sequence of the reference genome
as a result we will see the visualization of reads aligned at the reference genome the sample name is mentioned here and can have multiple layers of information for alignment coverage shows how many reads align to the region and alignment visualize the types of reads and any changes that might be observed from the reference genome this visualization is taking information from the alignment file which is stored in Sam or BAM formats and is used in many subsequent steps of analysis let's take a look at the different files that are used to compile this visualization sequencing data
is stored in several different file formats for example fast queue files provide us with reads from the sequencer fast a is a text-based format for representing DNA and protein sequences typically this is the reference file for the genome then we have the genome transfer format or GTF file which contains genome annotation for example the GTF will store the start and end of each genomic element which are exons isoforms and genes to process next-generation sequencing data we must map reads onto the reference genome to do so we have to make sure we know where on the
genome we have exons this information is in the GTF file then we take the sequences from the fast queue file where we have short reads to align to some annotated position on the genome the genome reference gives us information on where the gene is located and what exons are on that gene then as different isoforms are created from the single gene reference different reads are mapped on the specific exons they are mapped in different quantities as fast q files that give different sequences of the reads as you can see from this GTF example on chromosome
15 we have an exon that starts and ends at a specific place and belongs to a given gene and transcript ID etc a typical RNA seek pipeline includes three main steps pre-processing mapping and quantification pre-processing is needed to clean up our data by removing the adapters trimming some of the reads and removing the PCR duplicates this is important because PCR amplification is not uniform across all reads then we take the cleaned up read sequences to map them onto a reference using a fast day file for sequences of the genome and GTF files for annotation based
on the quality of annotation we can use various strategies for mapping once this step is complete we have to quantify the expression levels and give each expressed element a number in the beginning uploaded files have to be unzipped - with pairs prepared together in the right reference and the pipeline folder structure built the start button runs these scripts at the beginning of the pipeline cleaning reads from unwanted variation can make a large difference to the speed and quality of mapping and assembly results these steps include removing reads and/or bases that are from unwanted sequences such
as poly a tails and RNA seek data bases that artificially added on to the sequence of primary interest such as vectors adapters and primers bases that join short overlapping pair end reads low-quality bases or bases that originated from PCR duplication pre-processing also produces a number of statistics that are used to evaluate experimental consistency bow tie to g-star and bwa our alignment algorithms that look for optimal positions of mapping on a read to the reference genome a variety of algorithms have been introduced to improve the speed efficiency and flexibility of the algorithms when considering standard genomic
variations like insertions deletions and rearrangements isoform construction and merging of isoforms in the GTF file occurs once the reads are aligned to the reference genome with abundant area is considered to be exons reads then are analyzed to find those that contain the end of one exon in the beginning of another identifying them as junctions once all junctions are identified exon spliced together are assigned to an isoform isoforms are recorded in the updated GTF file cuff merge is responsible for updating the GTF file where positions of exons and their order in each isoform are stored top
hat and high side are versions of the same algorithm that combines several steps such as exon detection isoform construction and the merging of the GTF file once the annotation of alignment patterns and genomic elements are available have to be realigned to this information for quantification a typical pipeline will combine these methods into a sequence of scripts or algorithms that can produce certain outputs useful for further steps for example after the files are loaded it can be aligned to the genome using bowtie to G and then visualized an alternative route can use alignment - known transcripts
and ignore any transcription variation we can find in a given set of samples we can also run more detailed pipelines that will map on the genome identify exon junctions and link them into isoforms this creates a transcriptome not being reads to the transcriptome and quantifying all found transcripts during the quantification step different algorithms generate specific files that can be useful in some subsequent step so the pipeline will limit connections between steps that do not have appropriate file outputs to make the pipeline function properly one major difference between these pipelines is the mapping step we can
use known transcripts to find the ones transcribed in a given data set by using different mapping strategies the difference is in the output and time that it will take to process a data set after the mapping is complete we have to quantify RNA expression there are three ways that this can be done raw counts are PKM fpkm and TPM these are normalization techniques that are typically used to eliminate the variation caused by differences in coverage and sample quality fpkm stands for fragment per kilo base per million of map reads this approach will quantify expression normalized
for the total number of reads available as well as the reads divided by gene link our p km is the same as fpkm but an application to single end reads TPM does not consider the length of the gene and averages all reads dividing the read count by a million now that we reviewed the theoretical part let us run the pipeline's of practice what we have learned the first pipeline is the most simple we will load the data and quantify it using an alignment free quantification the result will be quick but we might find that while
all possible transcripts are included we will have many with zero expression levels that means that we will have to work more with the output table another issue we'll notice that selfish will quantify transcripts and not genes in some cases this will require additional steps to make gene expression tables the second pipeline we can build will include pre-processing steps and then use the GTF file with our seventh quantification to set a set of standard outputs filtered gene and ISO form expression tables in raw count and fpkm units we can leverage the top hat or high sat
option to build a pipeline that will map on the genome and construct new isoforms from this sample set the resulting tables will be different from the previous pipeline because we mapped on the genome and possible new isoforms can be found in the output file the same pipeline can contain pre-processing steps typically pre-processing steps with the most visible in the genes that are highly expressed in the dataset after high sat or top hat to an alternative method for quantification can be used HTC is a Python library that was designed to facilitate parsing of common data formats
and high-throughput sequencing projects as well as classes to represent data such as genomic coordinates sequencing reads alignments gene model information and variant calls and provides data structures that allow for quiring via genomic coordinates finally we can use a more detailed approach to analyze the way reads align to the reference genome Augustus is a method that starts with exon detection and then looks for junctions between identified exons this method will have the most accurate and specific results for non-standard gene isoform variations often time this pipeline will be reserved for non-model organisms where a well annotated GTF
file is not available a typical output from these pipelines will vary but for the most part results will be clearly organized by element such as exon isoform and gene we will also see normalized and filtered tables of gene or isoform expression we can also check the quality of our mapping by looking at the mapping statistics file we've processed the raw read sequences quantified them and now have structured data where columns are sample names and rows or gene or isoform IDs and in Samba format inside each cell we will find the quantified expression level depending on
the quantification level this number can be fpkm rpkm TPM or raw count units looking through the resulting table you will notice that some genes start with the letter e NSG and some start with ENS mus these show mouse and human genes and isoforms since we used a concatenated human mouse genome we can separate the stroma from the tumor expression the bars on the bar chart show an expression profile now we can work with this table and further explore the expression profiles of specific genes gene sets and annotate them by their biological functions in the transcriptomics
1 course we covered key concepts needed to understand the analysis of gene expression data in this video we focused on the analytical steps needed to process next-generation sequencing data