Unknown

0 views5199 WordsCopy TextShare

Unknown

Video Transcript:

hello everyone thank you for taking the time to join our webinar today's topic is next-generation sequencing data analysis covering the basics as well as whole genome sequencing to give you a better idea of what you can do with your NGS data our webinars are designed to provide helpful guidelines and information that will keep you up to date on our important on important areas of life science research after the webinar we will be sharing a copy of the slides and the recorded webinar if you are signed into a Google account please use the chat box on the right side of the screen to submit any questions you may have if you do not register for the webinar earlier there is a registration link in the video description today we will begin with a brief overview of what next-generation sequencing data looks like how the data can be processed and work through a few examples of what this of what this analysis of the different types of analysis that can be done and the tools that can be used to accomplish this because we have a lot to cover for this webinar we will do a briefer intro than for our previous NGS webinars now I'd like to take a moment to introduce my colleague dr. Christopher semantic an NGS specialist at ABM with nearly 10 years of experience in research and experimental design he can assist with nearly any project from initial setup to post sequencing data analysis along time alongside him you have myself boshy boshy a product specialist with over six years of R&D experience in molecular biology and biochemistry our main goal is to provide researchers with the support the ABM has been renowned for for those of you who are new to ABM I'd like to provide a bit of background on our company ABM was founded in 2004 in Vancouver Canada and we have been working hard to catalyze scientific discoveries in the field of life science and drug development for the last 15 years due to our success we've been able to reach we've been able to expand our reach with a global distribution network and a branch in China and assumed to be open branch in the United States these expansions have better allowed us to serve researchers around the globe with world-class support wherever you may be with our team of dedicated scientists ABM is empower is committed to empowering researchers with the latest innovations for all their scientific needs and now with this introduction complete here we can see a brief overview of the topics for today's webinar first I'll provide you with a brief review of NGS sequencing before passing things over to Christopher who will go over the finer details of NGS analysis now with an NGS with NGS data regardless of your project type there are some important things to consider whether you are interested in whole genome sequencing to compare mutations between samples using RNA seek to learn about differentially expressed genes or studying microbial diversity as part of a meta genomic study NGS sequencing data is the starting point for each of these there is a simple workflow where you begin with starting material either DNA or RNA cleaning libraries sequence each sample and then begin data analysis as an Illumina certified service provider IBM uses several different Illumina technologies for our NGS services now the main advantages of Illumina sequencing are the ability to run in many samples together at the same time the large amounts of data that can be generated from a single sequencing run the end the unparalleled accuracy and speed of the Illumina platform however there are a few things to consider alongside this for example while large amounts of data can be generated this amount of data can be overwhelming for those who are new to NGS in addition bioinformatic analyses often require a highly skilled bioinformatician and finally larger data sets and certain types of analysis can take a significant amount of time and computing power to run now with that being said I'll hand things over to Christopher who will walk you through some of the basics of NGS data and different analysis workflows and just a quick reminder to please post any questions you have in the chat box during the course of the webinar for our Q&A at the end thank you Thank You Bao chief for that lovely introduction next I'm going to go over how to understand the data puts from next-generation sequencing with this there's a basic workflow that we'll go through first for whole genome sequencing and then for RNA seek the starting point is always the raw sequencing data next this is processed to be associated with quality control information the sequencing data is that aligned relative to a reference genome or reference sequence you can then identify specific mutations in your samples use different software to visualize this and then you also want to verify that the mutations that you are identified that the mutations that you identified are at in fact correct present mutations the basic workflow for this is you input DNA prepare a sequencing library from a lot of samples and then sequence it one of the powerful benefits of next-generation sequencing is that you can run many samples together but how do you know which data set after sequencing belongs to which sample in order to separate these out there's a specific process we'll discuss that can correlate these back to the original samples you were working with the raw output for all next generation sequencing rbcl files or binary based call files these are generated by the Illumina sequencers during the sequencing process with NGS because you can sequence multiple samples together you have to take the pooled sequencing data and separate it afterwards and assign each data set to a specific sample this process is called D multiplexing essentially it's the process of separating sequencing data into separate files for each individual sample this converts it from the BCL format we previously discussed to the fast q format vast queue is basically a text-based format for storing nucleotide sequences and their associated quality scores these seal files contain all of the data from all samples in a sequencing run so when you start with a be CL file it would be nucleotides from every sample and then after demultiplexing you'll be able to separate it out from which specific sample that set of sequencing data belongs to after you convert the B CL files to fast queue you might ask what does the data look like now fast you data appears as four distinct lines which will go through one by one first is the sequence ID which includes information about the specific sample type as well as the sequencing platform next is the raw nucleotide sequence from that sample next there's a plus symbol which is used simply as a spacer between lines two and four and a fourth line which contains the quality values for each nucleotide that is shown in line two together this is considered the fast queue data for a given sample but what does the quality score line actually mean essentially it's a way to use numbers letters and symbols to represent the quality or accuracy of a base call that's identified during the sequencing process what this boils down to is that symbols represent the lowest quality score numbers represent a better quality score and letters represent the best quality score when we discuss quality though we're actually talking about the phred quality score so in this case q10 represents the probability of an incorrect base call occurring one in ten bases so this would represent a base call accuracy of 90% similarly Q twenty represents the probability of an incorrect base call in one out of every 100 bases which means 99% accuracy in base calling finally q thirty represents 99 99. 9 percent accuracy or the probability that only a single base out of every 1000 will be incorrectly called when the sequencing quality reaches Q 30 almost all of the reads can be considered to be perfect with no errors or ambiguities in the sequencing hence q30 is considered to be a quality benchmark and next generation sequencing how would this look in a sequencing report though in all sequencing reports that ABM provides you generally see the sample name in the first column the number of reads associated with it in this example there's about 72 million reads and the percent of bases with a q30 score representing greater than 99.

9 percent accuracy in this specific example you can see that read one has a q30 score of eighty seven point nine and read two has a value of eight three point five there's an additional column in the table that shows you the percentage of guanine and cytosine from the sequencing read this can often be used to infer why there may have been an issue with sequencing but you can't really adjust the key or the GC percentage to improve the sequencing results for a second sample you can see that the number of reads also changes as does the cue 34 both read one and read two as well as the GC percentage from this you can understand that many variables can affect the q30 value including read length GC content for a given stretch of DNA the sequencer at sequenced on whether it's my seek next seek high seek or another platform variations in the samples or other factors now understanding the data output from next-generation sequencing is only the first step we've gone through processing of raw data and data with QC info for the second steps performing the actual data analysis which includes the alignment identifying mutations verifying them and visualizing the result next we'll go into whole genome sequencing data analysis to give you a bit of idea of the tools that it can be use for this process now all analysis begins with assembly or allotment where the raw data is converted to fast queue before being aligned to a reference genome or reference sequence there are many different tools that you can use for data alignment for next-generation sequencing they all perform quite similarly so we'll go over one example now which is the burrows-wheeler a liner or bwa with this alignment tool it's effectively a software package for mapping sequences against a reference genome it uses three different algorithms to do this each of which has a different use the first is bwa backtrack which is used for very short reads such as those that are less than 50 base pairs bwa SW which has improved sensitivity when there are gaps in the alignment and BW amem which is the preferred for standard Illumina sequencing and our default at ABM burrows-wheeler also has the benefit that's the most commonly used software package it can do very long reads and it's one of the quickest alignment tools out of the most popular ones now you may ask do I need a control for my sample or can I just use the reference genome for comparison even with a reference genome though you still need a control sample next we'll work through an example here where you can see the reference shown in orange the control shown in purple and the experimental sample shown in green often the control will match the reference genome you can see that the reference genome has an a nucleotide indicated here as does the control sample whereas the experimental has a C nucleotide which may represent a mutation sometimes though the background genotype for your sample might be different than the reference genome so at this second nucleotide you can see at the reference indicates AG whereas both the control and experimental sample represent a T which might be a difference in the actual genotype in the third example the reference nucleotide represents a common nucleotide this doesn't necessarily mean a healthier wild-type individual though so you can see that the reference and the experimental sample both have a gene Ian Cleo tied whereas the control shows a T here if you don't have a reference genome available though you must first perform de novo assembly which can often be quite challenging de novo assembly effectively combines overlapping paired reads into contiguous sequences eventually this will generate a con take next the con digs are then assembled into a scaffold to try to generate as complete of an assembly as possible often though scaffolds can have gaps between context where the nucleotide sequence isn't known scaffolds can be used for alignment afterwards this information is stored in sequence alignment map files or Sam files which are the universal file format for mapped sequence reads these contain the sequence as well as quality scores of each read and provide more detailed information than the fastq file and I'll turn out to this our bam files which are the binary version of a SAM file this is the preferred format for some downstream software applications there are many different tools available for comparing your samples including for identifying single nucleotide polymorphisms insertions and deletions or copy number variations next I'll go over to major variant calling programs the first of which is the genome analysis toolkit or gatk and var scan 2 or variant detection in massively parallel sequencing data gatk was developed at the Broad Institute and he uses a wide variety of tools that can be used for variant detection and genotyping var scan tool on the other hand was developed by the genome Institute at Washington University and it uses the limited number of tools for snip detection in Dell calling and copy number variation analysis gatk uses more stringent criteria and as a result has a longer run time this increases the chance of false negatives or possibly missing some variations in your sample analysis var scan tool on the other hand uses less stringent criteria and as a result has a shorter run time but this increases the chance of false positives where it may over call some variations including ones that have lower confidence values both of these that will generate a list of mutations or variations that you can then follow up on and do additional analysis with and you can also filter out variations based on high or low confidence in the call of those variations both programs will highlight these variations relative to the reference genome though in this example you can see that the reference genome has a G nucleotide a T nucleotide and an A nucleotide at the positions were indicating with this software you'll be able to see in your reads if there is actually any variation between the reference and your given sample additionally they can highlight copy number variation which will be shown as this graph with a peak which represents an increased number of copies for that specific sequence relative to the reference this is particularly important for cancer research and studying gene duplication events next there's different software that can be used for visualizing these variations all of them will serve the same basic function where you'll be able to see an image output similar to this where you can see the mapped reads from your sample in this case demonstrated with blue or yellow and variations in your sample relative to the reference genome shown in red there are three popular tools you can use for visualizing this the first is integrative genomics viewer the second is G browse and the third is J browse a successor to G browse the integrative genomics view or igv is easy to use where you can just drag and drop your sequencing data into the program it's fast and has full functionality offline some of the considerations though are that you need a lot of computing power for larger data sets including large genomes and it can only be run on one computer at a time the output for this is a little bit simpler but easier for many researchers to understand G browse on the other hand creates a website for a project and allows people to work on the same project simultaneously it is a bit more difficult to use than igv and it must be connected to the Internet for real-time collaboration the overall look of it though is a bit more polished than igv and many researchers still use gbrowse as their default j browse was developed as an advanced version of g browse to be faster and fixed many of the bugs or issues in G browse it is still a bit more difficult to use than igv and you must still be connected to the internet to have real-time collaboration the overall output though is much cleaner and can provide a lot more information than G browse now to summarize these three popular tools that we went over IG v is among the easiest to use of the three both G browse and J browse are best if you want to collaborate on a project with other people and J browse is the most updated version of the software that you can use which people are slowly transitioning to use more in summary of what we've gone over whole-genome seek is converting BCL files to fast you through demultiplexing converting fast queue files to Sam or bound files using bwa as an alignment tool variation calling using gatk or VAR scan to and how to visualize your data using igv G browse and J browse next we'll go over how to work with RNA seek data similar to whole genome seek analysis is going to begin with assembly or alignment once the reads are aligned to the reference genome though you must first normalize them relative to the gene length this is important as we'll demonstrate with this example below you can see that for gene a there are two mapped sequencing reads gene B is larger and has a greater number of map sequencing reads and then gene C is an intermediate length and has fewer sequencing reads you want to make sure though that the number of reads relative to a gene are independent of the gene length and instead represent changes in gene expression whether it's higher expression or lower expression the way to do this is to normalize it using something called fpkm or fragments per kilobase per million maps reads now when you want to normalize an expression using FB can you can do this and more roundabout ways or you can use software to do this the software we use is string tie which is a free software for transcript assembly and quantification for RNA seek it's easy to use it can automatically normalize to fpkm it can do standard analysis but it cannot do differential gene expression analysis which many researchers are interested in for their projects effectively what this gives you is a value for gene expression in this example gene a as an fpkm 24. 8 if you have several replicates you'll see slight differences in fpkm values between them for a given gene in this example you can see the 3 replicates all have different fpkm values and the mean fpkm value for these three is 24. 8 in a second sample you can see that there are lower fpkm values overall for all three replicates representing a lower overall mean fpkm which in decree indicates a decrease in gene expression in the experimental sample or to summarize this via 6.