so this is a big beginner's guide to rna-seq analysis so i'm going to go over some of the tools and the procedures that we use in our lab and what our experience has been over the few years i'm not going to give a big um in-depth comparison of different methods for doing doing this but instead show you what we use and tell you about why we think these are good tools for this and so just to remind you where this part fits in into this overall schema of how we get from qtl analyses of complex
traits that eventually identify a genomic region and then how we can get to genes and eventually the biological pathway or mechanisms and so this particular webinar is going to cover really this this area in more of a broad sense of how we get from this genomic region to genes and so rna expression is one of our tools for doing that and so what we're going to talk about today or give you an outline we'll talk about why rna seek just a couple slides to that a technical overview of the illumina method for sequencing which is
one of the most popular technologies to do this some of the considerations when it comes to making an experimental design and then finally and we'll talk about how we get from rna seek reads to quantitative expression measurements and then a common tool that's used for that differential expression analysis so to start us off with why rna seek rna is one of my favorite things to study because it's one of the first quantitative links between dna sequence and phenotype so it's that first place for translation from what a variant that we may have identified in a
qtl and can or possibly the pathway to affecting that behavioral or physiological phenotype and so it really is that first step and then from a statistics point of view or a systems genetics point of view we also get real excited about the rna dimension because we have all these fantastic tools including graph theory to really dig deeper into this data now that it's quantitative to talk about interactions and possible hypothesized mechanisms to test so that was why rna so why rna seek so it's super popular now there's not a lot of uh convincing people um
that it's valuable but one of the things that we've had has had a profound impact on what we do is that not only can we do this quantitation of protein coding genes like we've been able to do in the past with a microarray but now we can also do it for a variety of non-coding genes so anything from micro rnas to long non-coding rnas to snows to even circular rnas and how we can get um more data for that full transcriptome not just that the items in the transcriptome that are translated into proteins and we
can also do discovery because we're realizing there's a lot we still don't know about the transcriptome despite some of the really in-depth work we're discovering new things and new role of rna all the time and so with rna-seq we're not um relegated to only study things that we know are there but we can also identify what else is there so we can look at novel protein coding genes novel splice variants and novel and non-coding transcripts that might not be in the database for your particular species or for really any species at all and so that
brings us to the technical overview so if you're con this is meant to be a beginner's guide so if you are complete novice to rna-seq this is kind of the general overview of what happens so you take a sample of interest you isolate or extract the rna then the next thing the next step is to decide what fraction of rna you're most interested in and select that out of your pool of rnas then it comes to the library prep step where we're taking those rnas we're breaking them up into small pieces we're adding all the
adapters to enable them to be sequenced on a sequencing machine we're doing that sequencing here on um in this next step we put them in a flip cell and put them through the machine and what we get out are these reeds that we eventually have to analyze and piece back together um the library prep stuff is that first step where in this example we've taken the poly a part of it we fragmented and primed it and then we do a lot of synthesizing to get it down to this double strand c dna so then we
can add all our bard codes and eventually it gets amplified on the the system um the luminous system for sequencing so that includes your high seek your nova seek your my seek uses this technology where this is what and we're describing when we're talking about a flow cell and then you can see that within this flow cell there are should be eight lanes for sequencing and so what we're going to do is add our processed rna that has been library prepped to these flow cells and then this is what is going to get imaged and
where we're going to get the reads sequenced from so this is a really for as a statistician this is a really fascinating process for me um to understand and i highly recommend um this youtube video so it's a five to six minute video from the illumina group that really it's a cartoon that kind of takes you step by step through this whole process and talks about how um these rna pieces are tethered to the flow cell and then how they go through this amplification process so that they get clusters of breed or clusters of rna
transcripts that are identical so that then we can use this high throughput sequencing so they call it sequencing by synthesis where as they add the matching base pair they get this fluorescence and then we read the color of the fluorescence to be related to the base that it's detecting and so it goes starts at the beginning and just looks for different flashes of different colors so that's a very broad overview from a non-um technical expert but the idea is in the end then we know this sequence of nucleotides that are matching our originally tethered piece
of rna or processed dna so one of the common things when people are first starting off into thinking about an rna seek project is the study design and there's a lot more options that i think people first realize when thinking about um studying up a rna-seq experiment and so the main elements that you need to make a decision on right away when talking to your core or talking to a for-profit business that would um give you these read counts are which rna fraction you're going to look at and we'll go over that in a little
bit more detail and the library type that we're going to do the sequencing length sequencing depth depth and then number of replicates so let's break these each one down individually so rna fraction this i i tried to look for updated numbers this is a little old and they came from a few older publications but when i look at this stuff there's a lot of controversy on what exactly is being transcribed from the genome how much of that is useful and biologically meaningful and how that difference differs from tissue to tissue or sample to sample depending
on developmental stage developing on environmental and determinants things like that but in general and if we think of this light blue part of the square as being the full genome then within this bigger circle are all the areas of the genome that can be or we have some evidence of them being transcribed and this smaller part here is i believe the part that's being transcribed from both um strands of the dna and when we look at protein coding so coding sequences they really make up a small portion of what's being transcribed from the dna not
to mention the actual untranslated regions is what a utr stands for and then you also have some of these popular smaller non-coding um transcripts that are coming from it so so the gist of this is that there's a lot of things that are being transcribed and what we want to talk about what we want to hypothesize about is going to drive which fraction we choose to take for our rna seek experiment and we would never we probably wouldn't want to take everything and a lot of the reason why we wouldn't want to take everything is
if we break down this total rna distribution a bit more um we would get that only five percent of the total rna is codeine meaning pro and it's translated into a protein where about 80 percent is ribosomal rna and about 15 is transfer rna and so there's a few different ways to decide on what fraction you want to want so a typical traditional rna-seq study we're going to go in and pull anything out of our rna pool that has a poly a tail and so the reason why we're doing this is we're really trying to
target those protein coding genes obviously we're getting we're getting um rna transcripts that have a poly a tail even if they're not protein coding but that's the general idea between this poly a selection we're also interested in the long non-codings we can instead of pulling out everything that has a poly a tail what we can do instead is go into our rna pool and take out or clear out some of that ribosomal rna and so what that does is it leaves room for us to measure everything else because if we didn't do this our ribosomal
rna depletion then 80 percent of our reads would align to our rnas so it's targeting both those messenger rnas and our long non-coding rnas and then finally the other tool that we can use in combination with you know either one of these is a size selection and so what a size selection does is say i only want rnas that are less than 30 base pairs and what that will do will enrich our rna for small rnas and micro rnas that we can then run throughout our study and the library type here typically you have a
choice between a paired end and a single end for a lot of the studies and i think especially the novaseq and a lot of the high throughput facilities now they won't even give you the option of a single end read that everybody just gets a pair to end read um but the the history behind it is that in the single read experiment um you're you're just tar you're getting a read whatever read length and and that's all you're getting and it's pretty much it's good enough for every situation in a paired and read what we're
doing is we're taking that piece of rna and we're reading from both ends and so what that does is that not only do we see how well this read one aligns to the genome and we see how well read two lines to the genome but then we can also double check that alignment and see if it makes sense where these two were aligning because they came from the same piece of rna so if the reeds are aligning to two different chromosomes then there may be something wrong with the alignment of that reed or it may
may help you choose if there read lines to multiple places to to attribute it to the area of the genome that makes these what we call concordant and so sometimes in the beginning it could be more expensive but like i said now a lot of times we don't get an option and everything is paired it gives us a little bit more power on um getting unambiguous alignment of the reads and especially when we come to do things like um transcript discovery and isoform level expression these types of reads are helpful the other library type option
that you'll often get is a stranded versus unstranded and again more often than not we see stranded libraries nowadays just because the the pricing and the technology has come to make it that there's really no not a whole lot of reason to do an unstranded so in an unstranded read during the reading of the sequences we don't really pay attention to whether or not we're doing the complement dna or the original rna direction and so if you have two transcripts like we do up here on the corner that overlap but are on different strands we
wouldn't be able to distinguish which reads belong to which gene and so we'd have to deconvolute that later in the analysis steps whereas in our stranded libraries we can tell precisely what's coming from which transcript even though they overlap the genomic region and one of the next choices that you always get is sequencing length so again the longer you get the more expensive you get and some of the cores some of the machines um you just get 150 and you don't get a choice um but in general longer reads tend to improve map ability improve
accuracy of transcript identification but when you're doing that small rna fraction definitely these longer reads are an overkill because you're really only interested in that 30 base pair region that you were targeting the next question is sync when seen death and this does vary from experiment to experiment so usually when we say sequencing depth what we mean is how many reads are you getting per sample or per library and so really the deeper you go the more precise estimates you can get on more genes and so this is what we're looking at if we're looking
at example general transcriptome coverage so the number of genes that have some kind of reads we're going to get lines that look like this the problem is that is we don't know if we're on the purple line or the green line necessarily so in general the more reads the more information you get but at some some point there is a diminishing returns um so one of the things i like to to outline or recommend when you're thinking about sequencing depth if what you're interested really is in a gene level analysis of protein coding genes and
you're looking at a relatively homogeneous tissue so for example a genoa analysis with a known transcriptome meaning that our database is like ensembl and refseq have done a really good job of annotating genes for this particular species then we can do a poly a selected rna in the liver for 20 to 30 million reads per sample and get pretty good coverage of most genes and have reasonable power for differential expression everyone's a question for you by malad about the definition of strandedness whether that refers to haplotypes or the plus or the minus strand in terms
of five prime three prime utr so it is it is respect to the sensor antisense strand right so it's which strand is that particular rna coded from so it doesn't give us necessarily haplotype structure is that does that help okay perfect thank you all right so we talked a little bit about kind of one of the most basic analyses where we can get away with me with maybe 20 million reads per sample if what instead we're looking at is an iso form level analysis we're interested in discovering splice variants we're interested in discovering non-coding rnas
or we have a poor transcriptome to begin with we usually need to go deeper and what that means is if we want to do a really good job of defining our transcriptome that often can mean 50 million reads 70 million reads something like that to get um real deep obviously most people aren't going to want to do it and obviously most most experiments i see are really in that 20 to 30 range unless you're doing a deep dive into it and unfortunately and we'll talk a little bit about this and and how many samples to
to do and any kind of omics study that you do and especially an rna seek um a traditional sample size or power calculation is really hard because that is at a at a low sequencing depth you're gonna have enough coverage for many genes you may even have enough coverage for most genes but you're not going to have enough coverage for all genes without doing millions and millions and millions of reads right and we don't go in knowing what is our gene of interest so it's really hard or what level of coverage our gene of interest
will have it's hard it's um very much a global power analysis it's some of these global decisions we're never going to catch everything but our hope is to catch many things so the quote from the paper is the number of reads that's required in an experiment is determined by the least abundant rna species of interest and so obviously we don't know what our species of our rna species interest is is before doing the experiment and so the same thing is true about the number of replicates again power is a little bit more difficult because not
only do you have read coverage contributing to it but you also have differences in when then group variances and the just desired detectable effect i just wanted to show this table out of the kung konesa paper to just kind of outline the few things that you can move around to really decide on the number of replicates per group um that you need kind of a rule of thumb um for inbred strains which we do a lot of experiments on and we talk a lot about during these webinars so really homogeneous groups you can get away
with and you're doing a gene level analysis with four replicates per group right the more heterogeneous your groups are or your treatments are or your outcomes are the more samples you're going to need and again you'll not have 80 power for all genes but your hope is to design it in such a way as to have 80 percent power for most genes other things that can go into the sequencing part of it so we've talked a little bit about the experimental design and now we're actually coming to the part where we're making the libraries and
putting them on the sequencer to synthetic there are synthetic spikens available and um the goal of the synthetic spiken is to give you an idea of technical effects to give you an idea of could we talk about absolute quantitation and so these are transcripts that are not found in most species and you can spike them in they give you two mixes this this one always sounds great in theory we in a in our experience have always put in external controls these ercc's and have really struggled to to get a significant amount of information from them
but i'm not giving up i think that they're a good theory and we're working with some different statistical methods on really how to take advantage of these to their full potential and really getting rid of some of our technical effects and understanding some of the the issues that are coming up in the rna seek um the the biggest thing that um i i can preach about and will make the biggest effect is if you are have enough samples that you have to do in batches to really be conscientious about making sure that you are randomizing
at that library prep step so there's lots of steps in doing an rna-seq experiment but by far the library preparation step is the source of the most profound batch effects so making sure that you are if you have a two two um group comparison that if you have three batches you have one from each group in that all the way through to um the sequencing part you can do randomization at the sequencing room and that helps a lot so most with the luminaid we can do now what we call multiplexing so what that means is
each sample gets a specific tag and then we can pool all the samples and spread them across as many lanes as we want that way if a single lane fails which i haven't ever seen the whole time i've been doing rna sequencing that you would just you lose a portion of your reads for all samples versus losing um a few samples completely so this technology has really helped a lot on um getting rid of any kind of batch effects that may be due to lane or sequencing run but we typically don't see a lot of
variation in the sequencing part so if we prep the same library and ran it on multiple days so we ran it on two different lanes we don't see many differences or batch effects there so finally getting to the transcriptome profiling part so actually getting to the analysis so thank you for bearing with me as we kind of set the stage for what we wanted to the main topic of interest and so how i'm going to break it down today is we're going to go over some of the things that we typically do to look at
the quality of our raw reads and to make sure that we're getting um good information out and we know what we're getting we can do this alignment to the genome in a transcriptome discovery so if you'll notice these two bars are in green which i am using to indicate that this is an optional step that we don't necessarily need to do that we can go straight from those raw reads to an alignment of a transcriptome and quantification if we're confident in our transcriptome so we'll start off with this quality control of the raw wreath so
just to let you know i am i am going to use this example data set throughout it's available on geo it's a crisper casino knockout of lrap and um my intent is to put some example code up with um these lecture notes in this recording within the next week or so and so it's three groups wild-type heterozygotes homozygotes we've got three biological replicates per group we did a whole brain sample we did ribosomal rna depletion of total rna and we're looking at paired end reads and so if you've never seen a fastq file whenever we
do get um a paired end reads they come to us in two separate fastq files so you can notice here that the for that the names of the files are identical except for this r1 right here so what this indicates is this is um read one from each pair and this is read two from each pair and they're kept in what we call a fastq file and then the dot gz if you're not familiar with it just means that they're zipped so that makes them smaller a fastq file here below is an example of that
so each file each um file has four rows per read so this first row right here that starts with the at sign is just giving us identification information for that read and so for example um these particular reads sequencing was done at gene whiz and so that's their little symbols here this gw and then it was run on their high seek the next line is the actual base pair calls then a little bit of a spacer and then these um letters actually correspond to the quality of the call for each of these base pairs and
so we typically don't use it for alignment but we can use this quality information to really check on the general quality of this reed experiment and we can even um chop off some of these and base pairs or these nucleotides as if they're of poor quality and so in general when we do quality at the base pair level because of the nature of the technology the quality of this base pair calls tends to drop off at the as the read gets longer so the machine does a really good job at the beginning and slowly gets
worse but in general the um they tend to look good and so one of the things i always recommend is counting the number of raw reads and making sure that um you have the equal number of reads between your two read sides for each read pair making sure it's the length that you're expecting and in general we want to just compare across samples to make sure that all of them are between 20 million and 40 million for instance when we have a particular sample that is much different as far as library size goes from the
other samples it tends to create problems down the road and so we just need to be made sure of that or made aware of that earlier on to be able to track that sample throughout the quality control process in general if you get low rate counts for a sample much lower than everyone else then usually something went wrong with that sample and so that's reason for concern so if everybody else has 30 and you have 5 million for a single sample and that that usually means that something's wrong not that you should go ahead and
use that 5 million reads because they tend to look different than the other ones and then finally there are programs out there like fast qtc that will look more in depth into the quality of these raw reads and looking at things like the drop off of the quality as the reed gets longer one of the things that we do in our lab is is trimming so we'll go into the reeds and we'll chop off these adapters in case that the sequencer read through some of these adapters we'll also chop off um base pairs that are
of low quality in general it's a little bit controversial so this is a williams it's not rob williams but it was a group that came out against trimming the reason that we like trimming is because it is a good quality measure for us so we can see if for example if rna is degradated in a sample what you'll see is that you're trimming a lot from those reeds because what the rna went in too small and so you're reading a lot of these adapter sequences and so this is a really good point to double check
that and so we have seen samples that it looks like we have really good you know a great number of reads they look like all the other samples but when we look at this trimming they're having way more base pairs trimmed off it they're having a lot of breeds eliminated because after trimming they're so small but in general when we do trimming hardly any base pairs are trimmed off our average read length if we start off with 100 is usually 99 and we get rid of very very few reads for trimming process so we're not
as concerned about what it does downstream for our alignment as much as we are using it as a quality control measure so if you're going through the path of transcriptome um transcript discovery our next step is to align to the genome so we're going to talk about aligning to the genome here first because that's what you have to do for de novo transcriptome but like we said you can go right to the alignment to the transcriptome if you prefer um it's also come up in rode-up populations like rats if you have inbreds or you have
um heterozygous stock or heterogeneous stocks excuse me if you have their genotype information you could align to a sample specific genome we haven't seen huge differences with this because a lot of our aligners are pretty robust to um single nucleotide variants so it'll still line there and just say that oh one of the base pairs were mismatched but those are the options when it comes to the sample specific versus reference genome um so this is one of the points where there's a lot of read alignment tools we've looked at star is one that we've compared
to high sat2 before and the the generation before high set two before was bow tie if you've heard of that one um we found in highsat 2 to be a good balance of being fast and being accurate and so and what it does is it does a first couple of passes and what why we need these special aligners for rna-seq that are even more important than when we do something like dna measurements is we have to be prepared for these um gaps in alignment due to splice junctions right and we have to have specialized aligners
that allow for these really large gaps from where the start of the read aligns to where the end of the read aligns and it's really important that we don't just throw out those reads because they give us a ton of information about splicing in our data set so we want to keep those in what this is just an example of the quality control that comes out when you align to the genome with something like hysat2 so this is a simulated data set so there's only ten thousand raids just to make it a little bit easier
to look at and comprehend but what it does is it tells us that oh yes you've paired all of your reads and um these are your primo sets so these are your pairs that aligned concordantly so meaning that the two sides of that paired end read align in reasonable places and they only align once but then we also get several that align multiple places and some that were not coordinated at all so but as we break these down these are kind of our best case scenarios but we do have a lot of single you know
just one of the parallel lines or one pair of lines multiple places and things like that and so these general stats just break this down in much more um detail and then give you an overall alignment rate at the end and so the example here i have is about 96.7 percent and that is generally what we see with a good data set with a good genome so when we're talking about um rat rn 6 that's what we're seeing when we do it to rn 7 we get a little bit of boost so there's a new
version of the rat genome we get a little bit of boost like one or two percentage points but we get a huge um boost in this concordance rate what you get out of that genome alignment is what we call a bam file so a bam and sam file contain the same information bam as a binary sequence one so it takes up less space on your computer so you typically wouldn't save anything in a sam file because it would take up so much space um there's a whole suite of tools called sam tools that are super
easy to use that can do a lot of manipulation with these data files we can sort them we can filter them we can flag them we can get summary information from from them using this um statism tools we can put merge two together and in this data file we get one row per alignment so one read may have multiple alignments so it will appear in multiple rows again this is our unique identifier for that read a flag is telling us how well it aligns so it's giving us lots of qualities about the concordance and things
like that and i have included on the github for this particular lecture a website where you can plug in these numbers and it'll translate it to what all they mean this is the chromosome at latitude the position the mapping quality and then this cigar measurement which is interesting in that it's a hundred this particular reed is a hundred base pairs and all hundred base pairs match smoothly so once we have it aligned to the genome then we can do something for transcript discovery and so when and why would you want to do a transcript discovery
on your data set so here for example i have looked at the latest version of ensembl for three species human mouse and rat and so what you can see here is that when it comes to coding genes they're pretty comparable across the three species meaning we have about 20 000 um coding genes so we're feeling pretty confident that mouse and rat are well annotated at that gene level when it comes to non-coding genes we see much more of a discrepancy between the three species and our general assumption here is that this is not due to
evolution or something like that this is due to lack of annotation and so for example there's only 3 000 non-coding genes that are annotated in rats where there are almost 24 000 in human so if we are targeting non-coding and are super interested in the non-coding transcripts in rat then we definitely want to consider a transcript discovery type algorithm and then this file final column is one of the more drastic ones too and this is we've gone here from protein coding genes to gene transcripts so this is splice variants per protein coding gene in addition
to the non-coding genes so what this is telling me is that in humans we tend to have about 10 splice variants annotated for each gene in rodents we've got something like seven where in rats we have one or two so again if you're interested in splicing then transcript discovery is something that would be of interest to do whereas if you're only interested in doing a gene level quantitation it may not be necessary to go through the transcript discovery phase again there's lots of tools out there we tend to use string tie it's got some pluses
and minuses but it's simple to use there's some really good pipeline papers about how to use it um effectively but the general idea is that it looks at all these reads and builds what is called a splice graph and uses this algorithm then to determine how to put these exons together into a single transcript and so that's both based on coverage of those exon junctions or splicing junctions but also then the read depth at different places um like i said this is one of the uh the good protocol papers that came out in nature that
really does a step-by-step guide on how to run through it to get this transcriptome but in general their recommendation is you build these sample specific transcriptomes or do some transcript discovery in an individual sample and then merge those assemblies later on so there are some there are some caveats to this whole process but um we in in our hands we do see a significant improvement in alignment rates going from alignment to an ensemble transcriptome to a reconstructed or de novo transcriptome and so then so this is just an example of what you get out of
the transcript discovery is what's called a gtf formatted file so it's basically a map of where all these transcripts are and one of the best tools for for visualizing the gtf is that you can simply upload it into the uc santa cruz genome browser and so you can see here these first five transcripts are ones that we derived from our transcriptome data the refseq gene this is one of our favorite genes gnb1 is right here so we can see that there's only one isoform that's annotated for it in our species but we're finding five and
if we look at non-rat refseq genes we see lots of evidence that these are probably valid splice variants because they've been identified in other species so this has been a good example of information that may not be in the database but we have other sources that indicate that we're probably on the right check and it's worthwhile um i do want to point out a few of the caveats just because it's not perfect so the accuracy of the reconstruction is still not optimal we do get some um false signals in there we do get it's not
perfect um it's super terrible at the start and the stop sights but that's really not what it was designed for it was designed more for what we call exon chaining but just to be aware that that for instance alternative polyadenylation is not identified using um string time and finally from a statistics point of view our uncertainty in the structure structure is not propagated into that differential expression analysis so when we take all of these transcripts on to do statistical tests in we're not treating the ones that were super confident in any different than the ones
that were novel and maybe there so and so then the last part in this transcriptome profiling is um alignment to the transcriptome and quantification and so like we've talked about before we can do this on the gene level analysis or we can do it on the iso form level um part so gene level we're super confident in but we're because we don't have any of these ambiguous reads that align to multiple genes or we have very few reads out aligned to multiple genes but then you miss some of the information that may be important about
different isoforms or splice variants having different functions where the isoform level analysis is interesting but there's more ambiguity on our actual read counts there again one of our favorite methods for quantification is rsem which is rna-seq by expectation maximization and so in this procedure we use um we align the reads directly to the transcriptome and those that um align to multiple transcripts we divvy it based on a probability model and so back in 2016 one group compared several different of these quantitative measures and rsem in the real data set was near the top was very
wasn't first but second but very close and then the simulation well outperformed the other methods but to give you a general idea of the concept before behind rsam and so this is a really oversimplified toy example of two isoforms that each have three exons they share two of them perfectly and the other two are different between the the um exons this top row is the number of reads that align to each region so in our first pass we have a hundred reads that align to this region so we're just going to split them 50 50.
so 50 50 but all 10 reads go to this one because it's exclusive to this all isoform and all 90 are exclusive to this one because that exon is exclusive we add up the number of reads so it's 90 to the 190 to the first 110 and so we readjust our estimated ratio and instead of splitting this hundred reads 50 50 we split it 63 70 or 37 based on this 190 to 110 ratio we do this whole exercise calculate our number of reads per isoform again and adjust our our ratio once more and we
keep iterating in such a way until our numbers stabilize and so when we do this rsem what we get out are estimated recounts um fpcam which is fragments per kilobase of transcript per million mapped reads and and transcripts per million transcripts and so this estimated read count it's not adjusted for library size which we'll talk a little bit about later fpkm is adjusted both for the killer bases of the transcript in million mapped reits although it's popular even the people who first coined it say do not use this so there are definitely some biases in
this um fpcam and they promote the transcripts per million transcripts above the fpkm when we're talking about quality control of this quantification we really want to do a lot of exploratory data analysis principle component analysis hierarchical clustering negative log expression and with all of these tools what we're trying to do is really say do these samples look similar and and what we're trying to do is identify an individual sample that may look like an outlier maybe something that's uh had some problem along the process the other key observation is are we getting correlations are samples
similar that are from the same group than they are between groups right and one of the things that that no one pointed out to us on the first process but we've found extremely useful in some of our data sets another metric beyond the simple exploratory data analysis and visualizations is looking at the proportion of reads that aligned that align that are attributed to the top 10 transcripts and the proportion of transcripts or genes within a sample that have a zero read count in this top 10 transcripts a lot of times this is really interesting especially
in the context of the total rna fraction where you can get the majority of the reads um aligning to something non-coding or even to our rna that wasn't washed away with the filtering process and so that has been good a good quality can check for us um this is just an example and i'm and i know i'm brushing over this i'm way over time here so some of these details you can feel free to read later but i want to make sure we get through the differential expression so here's just a example of a hierarchical
clustering match and some of the things that we're particularly looking for so as we move to the differential expressions so we've gotten our reads we've designed our experiment we've got them aligned we now have that almighty what we call a count matrix so it's got genes or isoforms along the rows and samples across the columns and we're ready to get into the nitty-gritty of our actual statistical hypothesis with this data set but we do have to do some more pre-processing analyses to this and one of the pre-processing steps that we often do is get rid
of genes that have really low read counts and the gist of all of this text is that we don't really know the most appropriate threshold it depends on your your threshold for noise and yourself um the the good the best advice here is to make some kind of cutoff that you apply to everything that you feel comfortable in and to get on my soapbox a little bit my general of philosophy is never perform a statistical test that you don't intend on interpreting meaning that if you got a result where a gene was differentially expressed between
two groups and you came out really significant but you only had three reads per sample in one group and none in the other would you really spend your time and money following that up or interpreting that and so thinking about what you yourself would need to to feel like these results were biologically relevant um there's definitely a library size bias so we never compare um just raw reads to each other right so in this example if we compared this first example to the second example just looking at gene a we'd say that there's twice as
much expression for gene a in the second sample in reality it just had more reads so rna sequencing just like micro rna or micro arrays is more of a relative abundance question than an absolute abundance question of these genes genes so we not only have to um talk about the read count for a gene but how many reads we got for that entire sample um if you're doing a bigger study so something that's going across batches there are gonna i guarantee there will be batch effects and they'll be really strong batch effects one method among
many for getting rid of these and it's slowly not becoming my favorite is the remove unwanted variants of the ruv method and what that does is try to identify these latent factors in your data sets to then eliminate them from your those latent factors from your actual statistical analysis and so there's a package in r that's r u v seek that you can identify um these factors and then either adjust your read counts or put them directly into your differential expression model the one thing that i did want to highlight but before moving on too
quickly is that in our hands we have tried all three there's three ways to identify the latent factors you can use um control samples or negative controlled genes we've tried both of these because we have the spike in controls and we do a technical replicate that's shared across all batches but the empirically derived negative controlled genes tend to work much much better than the other two methods for identifying these latent factors even when they're present the control samples or the negative control genes um so finally getting you into the framework of what's the appropriate statistical
model for differential expression i wanted to show you these two samples across a couple different scenarios to kind of get your intuition and your instinct going so if we look at a row as a gene in this first row it looks sample b has twice as many reads than sample a and in the second row sample b has twice as many reads than sample a but instinctually we know that um the second gene is much more likely to be differentially expressed than the one that goes simply from one into two so we're much more confident
in this difference than we are in this difference laura the question from david that i think we might want to handle right now he asked is are you v seek or the equivalent still possible if you don't have a binary phenotype or binary categorization of samples so um i i think that's a good question hypothetically it is still available so you could put it in as um because some of the issue with some of these latent factor analyses is you don't want to you don't want to adjust for your factor of interest right and so
you need to tell the model don't eliminate effects due to strain don't eliminate sex due to genotype so you would have to put that quantitative value into the ruv model and the reason why it often works with those empirical negative controls is because you calculate those empiric or you identify there's a pair called empirical negative controls because they don't associate with your quantitative trait right and so as long as you're doing that to identify those empirical court control genes it shouldn't matter what your phenotype of interest is and so just getting back to the instincts
here too we'd also recognize that this 100 to 200 is a more dramatic change than a 100 difference here between the 1100 and the 1200 and 0 to 2 is not very impressive but 0 to 2000 probably is impressive and so as we start thinking through these examples and as we start developing this instinct we realize that some of our traditional like linear based tools aren't going to work for this data and so what we need instead is what um typically used for rna seq data is instead of a linear model we're using a negative
binomial model and so for a negative binomial model the the idea is how many failures do you have before a given number of successes so in the context of i don't know if this is spurred from missing being in the office but if our example is we want to find five people that have seen the movie office space um in the background in the negative binomial model we want to say how many people will we have to ask before we find those five people and so just like our instincts from before if we have to
ask 20 people to invite five people then we're guessing that about 25 of our population um has seen office space but if we instead switch it from five people to 50 people and our numbers get bigger our confidence and the precision of our estimate gets bigger and so when it comes to rna-seq we're going to use this type of model that deals with discrete observations because we're talking about discrete reads to identify genes that are expressed in different proportions across their groups one of the most popular tools again not one of the only tools but
one of the ones that we've had some ex success with in our lab applying is um d e seek two so it uses this negative bi no mo model but also uses this um concept of shrinking which helps stabilize the dispersion estimate so a dispersion estimate in um a negative binomial model is kind of like conceptually wise variance in a linear model and so what we're doing here is using information from all our genes to say that our estimate of the dispersion for this particular gene is a little bit off and so we're going to
bring it down and make it a little bit better and so this is a this isn't anything new as far as the this empirical base method for handing dispersion we've used it if you were a microarray person you probably used it in the lima package it's just a way to handle the smaller sample sizes and not being able to estimate really precisely that variance component of your model it's easy to implement it can handle multiple covariates it even can compare between models so we can do what's called an omnibus test so if you think about
a one-way anova with three groups you get that first test that says is there any difference between all three and if they're any of the three and if they do then you can break it down we can do those types of omnibus tests in dec 2. one of the big drawbacks as rna-seq becomes more accessible and we can do bigger studies is that it doesn't handle random effects so we can't do longitudinal studies we can't do replicates within a strain if strain is not our factor of interest things like that so the more complex statistical
models we can't do with um the current implementation of dac2 but luckily there are ways to transform our data to make it look somewhat um normal so that we can use some of our other tools in our toolbox for statistics false discovery rates i'm not going to read all this because i'm over time but in general we're doing lots of comparisons we've got to do some kind of multiple testing correction and in any most omic studies we use something like a false discovery rate and finally there's some other tools and these are just resources for
looking at alternative splicing and alternative polyadenylation so jumping right to the conclusions i'll let you read those on your own since we're out of time and probably have more fun talking about the discussion later but i want to do wanted to to point out that this is sponsored by our center we're trying to do some outreach here and do some training and then finally grants and all the wonderful people in my lab and collaborators labs and everything