How Ray Tracing (Modern CGI) Works And How To Do It 600x Faster

590.47k views5627 WordsCopy TextShare
Josh's Channel
In which we explore ray tracing, the reason modern CGI can look so convincing, and ReSTIR, a recent ...
Video Transcript:
looking back just 30 years ago it was almost unimaginable the heights modern cgi would eventually reach we've gone from clearly artificial and plasticky looking abominations to beautiful nearly photorealistic artworks now if you're thinking well i can still tell that cgi you're wrong because this is a photo this is cgi how did we get here what algorithms enable such photorealistic results today we'll look at ray tracing the dominant technique used for movies artworks and infrequently but increasingly in video games we'll also look at rester a recent innovation that substantially increases speed without sacrificing much quality [Music]
at its core raytracing is able to replicate reality because it almost exactly replicates how light works light is generated by light sources bounces around a scene and eventually hits the retina or sensor of an observer as it bounces around the light becomes tinted by the colors of surfaces it encounters this is because the surface always absorbs a certain amount of the incoming light what's left over gets reflected [Music] ray tracing replicates this it creates many mathematical objects called rays emitting them from light sources and seeing where they intersect objects in the scene when they do
other rays are randomly fired from the points of intersection hitting other points where other rays are emitted repeating until the rays hit the simulated camera the original color of the light and the color of the surfaces the rays hit determines what color of light has arrived at the camera the exact spot the camera has hit determines where on the image that color goes if multiple rays hit the same spot their colors are averaged together the more rays we fire the brighter and less grainy the image gets if you've taken pictures with an old film camera
you may be familiar with this effect where long exposure times are required to get an image that isn't noticeably grainy let's dig a bit deeper into the details of how this all works every ray of light starts its life imbued with a particular color given by the light source it's emitted from in computer graphics we often represent colors using amounts of red green and blue light we do this because it's enough to replicate most of the colors visible to the average person here's some examples of how to make different colors by combining different amounts of
red green and blue light what this means mathematically is that every color in computer graphics is represented by three numbers let's say our light source gives every ray it emits a color notated as 10 10 10 a very bright white the next step is to shoot that ray to see which surface it first interacts with looks like i hit this blue surface we multiply the color of the light with the color of the surface because the ladder represents what percentage of light gets reflected instead of absorbed the result is the color of the light once
it bounces off the surface notice that all the colors of these objects have components less than one this is because no real life object reflects more light than is shown on it respecting the all-important conservation of energy [Music] another real-life detail that we have to take into account is that light is tinted differently depending on the angle it comes in at and the angle it leaves at for example if you look at a phone screen straight on it's mostly black with just a hint of reflection but when you tilt it flat it reflects the surrounding
environment almost perfectly like a well-polished mirror this discrepancy is encoded in something called the bi-directional reflectance distribution function now this doesn't quite sound like english so i'll break it down real quick reflectance means that this function tells us how light is tinted when it's reflected by a surface bi-directional tells us that the tinting can change based on an incoming direction where the light is coming from in an outgoing direction where the light is headed distribution is kind of redundant in this phrase what this all means is that we can plug in an incoming direction and
an outgoing direction into the brdf to get the tinting of light passing through that path a brdf for a phone screen would show darker tinting at steep angles and brighter tinting at shallow angles we can graph a brdf like this this visualization shows us how bright the tinting is for all possible outgoing directions given a particular incoming direction so for example a lot of light is getting reflected in this direction but not much is getting reflected in this direction notice that some amount of the incoming light gets scattered in all directions this is called the
diffuse component of the brdf some light is instead directed sharply towards a particular output direction this is called the specular component different materials can have radically different brdfs for example here's what cardboard looks like and here's what a mirror looks like this brdf numerically reaches 70 suggesting that light reflected in that direction is 70 times more intense than the light coming in this may seem like it's violating the conservation of energy when i first saw this i spent the better part of a day making sure i got the brdf equations correct it turns out that
this does still conserve energy it's all dependent on how we use the brdf we start at each light source creating a bunch of rays initialized with the red green and blue values of that light source we then find out where these rays first intersect a surface we pick an outgoing direction and use the brdf of the surface we hit to see how we should tint the light we then tint it accordingly and determine which surface we hit if we go in the outgoing direction repeating the process when we get there eventually we'll reach the camera
in which case the final color carried by the ray is averaged into the image this last step is ultimately why our mirror brdf conserves energy rays coming in from a handful of directions will produce concerningly high results but these will ultimately be balanced out by other rays some of the rays will hit the same spot in the image but pass through a path where the brdf is zero when this is averaged with the rays that bring lots of light into the camera we get a result that conserves energy ultimately the fact that the brdf peaks
at a concerningly high value doesn't matter because the brdf averages to a value less than 1. like most first attempts at computer algorithms this exact technique is very slow let's look at why this is the case and how we can fix it first notice that many rays will have to bounce many times before hitting the camera the core issue here is that the camera sensor is tiny and only accepts rays from a limited range of angles this makes it very unlikely for any individual ray to hit the camera we can solve this problem by doing
the whole process in reverse instead of shooting rays out of light sources and into the camera we shoot rays out of the camera and into light sources we then perform our tinting calculations in reverse starting with the light source hit at the end and progressively multiplying by the brdf result of earlier and earlier hit surfaces this technique works well because typically more of the scene is occupied with lights than with cameras however doing things in reverse introduces some inaccuracies we'll need to fix for example let's say we're rendering an image of a plane with our
old technique light will be emitted by the slight source and would bounce into the camera notice how if we rotate this plane the light reflected into the camera gets more spread out resulting in a darker image now let's switch to shooting rays out of the camera notice how no matter what angle we rotate the plane at the rays can always hit the light source there's nothing to indicate that a plane at a shallow angle should be darker than a plane at a sharp angle since the light source emits the same brightness at all points in
general this means that any part of a surface that isn't pointing directly towards a light source would show up brighter than it should be in real life we need to correct for this to get a realistic result what we want is a way to darken light that hits surfaces at low angles while preserving light that hit surfaces straight on a useful construct here is the normal vector array pointing one unit in the direction the surface is facing the normal vectors of a sphere look like this and on a cube they look like this now what
we want to find is how close this normal vector is to the direction we choose to simulate light coming from we can determine this using a dot product which when used on two vectors of length 1 gives us a value of 1 when they are pointed in the same direction and a progressively lower value as they drift apart eventually reaching 0 when they are perpendicular tinting our light by this amount produces correct results now's a good time to look at what's called the rendering equation don't run away i know it looks scary but it's actually
just fancy notation for all the things we've talked about so far this equation tells us how much light is emitted by a particular surface in a particular direction the direction the light leaves in is called omega o o for outgoing this squiggle and this d omega i is an integral all it's saying is to take the average of this inner expression over all possible incoming light directions it says that the amount of light emitted by a surface is affected by the light that comes in from any direction called omega i the expression inside the integral
says what color of light is coming from each direction it does this by multiplying a few terms remember that multiplication tints the color of light this l term is the light purely coming in from a particular direction this f term is the brdf telling us how much the light specified by l is reflected in the direction we're currently talking about and finally this dot product tints the light according to the angle between the incoming light and the surface normal so that hitting it at a shallower angle results in less light being emitted once you know
what it means this equation serves as a handy reminder of all the things that go into rendering one point of difference between this equation and what we're actually doing is that the equation says to average over all possible directions but we're just picking a single direction and running with it this ultimately works the same as averaging over all possible directions because if we shoot enough rays going in enough random directions we'll get close enough to the actual result that no one will be able to tell the difference this kind of technique is actually a way
to solve complex integrals in general it's referred to as a monte carlo method because it relies on random sampling to get close enough to a complex result a more involved way to speed up our algorithm is by sampling more often directions that contribute more light to the final image in this scene we can sample more often the directions that lead immediately to this large light source and sample other dimmer directions less often simply dividing the result to account for the fact that these few directions are technically unlikely and make up only a small portion of
possible directions is all that's needed to produce a correct result from this kind of technique this approach is called important sampling because we're favoring more important paths through the scene over less important ones this is one of the stepping stones to understanding rester so it's worth getting into some more of the details in rester we're solving a slightly different problem than the one we've been looking at research solves the problem of direct lighting how light bounces at most once off an object and into the camera we're not dealing with indirect lighting here where light bounces
more than once though we'll later generalize the algorithm to work with such cases in rester instead of using our old integral we use this newer shinier one the major difference is that instead of averaging over all possible directions we're now averaging over all possible points on all possible light sources this however introduces a couple problems just like going backwards did earlier for one we need this new v term that excludes light sources that aren't visible the value x being plugged in is the ray from the current point to the considered light source the value of
v is zero whenever said ray is blocked by an object another difference is that we have introduced something called the inverse square law with our old system if an object were farther away from a light source fewer rays would hit it resulting in it looking darker with our new system we're averaging over every light emitting point all the time so we always have the same number of rays going from the surface to the light source meaning it would always appear to be the same brightness this inverse square term corrects for that i'll be skipping over
the explanation for brevity but you can pause if you want the details anyway to render an image we want to find the average value of this inner expression we're going to be exploring ways of more efficiently shooting our rays so that we get faster results however when talking about rays it can be hard to tell what's going on for example the image on the left is actually a more optimal way to shoot rays than the image on the right but you can't really tell because there's so much going on let's instead make a simpler function
as an analogy for the part of the rendering equation we're trying to find the average of since there are many directions where little light comes in our analogous function should have a low value in most places since there are only a few directions where a lot of light comes from our analogous function should have a high value in only a few places we'll now explore better ways to find the average of this analogous function occasionally looking back to see how what we found applies to rendering when we were looking at rendering earlier we picked random
points and sampled the function at those points taking the average of the results we could do the same thing here and our result would eventually converge on the correct answer of one but notice how most of the samples end up sampling places where the function is close to zero this is like shooting most of our rays in directions where light doesn't exist what we really care about are the lit parts of the scene or the high parts of the function we've already assumed that most of the function will be close to zero we don't need
hundreds of samples to confirm this fact it would be good to take only a few samples of the low parts we just have to take into consideration that these few samples will count for a large part of the function we'll generate more samples at the high parts and fewer samples at the low parts formally what we're doing is sampling according to a probability density function or pdf for short this graph we're trying to find the average of doubles as a graph of which points we should sample more often if we want a faster result it
is both the sampled function and the pdf we're using to sample it however if we just take the average of these new samples we'd end up with an unnaturally high result since all the samples are clustered near the high parts now we instead want to use something called a weighted average we give samples at the low parts of functions high weights and samples at high parts of the function low weights this means that low samples will count for more making up for the fact that there are less of them taking an average while considering these
weights produces a correct result will eventually arrive at the correct answer of 1 while needing fewer samples to actually get to that point there is however one problem with this technique we're assuming that it's possible to easily draw samples proportional to our pdf we're assuming that it's possible to make samples that are more tightly concentrated towards the higher regions however this usually requires that we can write the function as a simple mathematical expression the light coming in from any particular direction is dependent on the rendering equation which as we've seen before is far from simple
so that means it's not exactly possible to draw samples such that brightly lit regions are sampled more often than dimly lit ones [Music] so important sampling is unfortunately a bit of a dead end right now what would make important sampling substantially better is if there was some kind of magical machine that could take our uniform samples and distribute them according to some complicated pdf too bad that hasn't been invented yet so turns out in 2005 a technique called resampled important sampling was invented that does exactly what we want it's a technique for turning bad samples
into good samples and estimating the average of a function with those good samples you start by generating a bunch of bad samples say evenly distributed ones these samples need to be distributed according to a simple pdf then you compute how much weight they have in the more complicated pdf over how much weight they have in the simpler pdf since our simple pdf has everything equally likely the denominator is always 1 so all the weights are exactly as they are in the more complicated pdf the way we would apply this to rendering would be to make
the simple pdf pick random directions to shoot raisin while the complex pdf computes how much light is coming from that direction this results in brighter directions having a higher weight than darker ones if one sample has a weight of one but another has a weight of 4 is saying that the ladder should appear 4 times as much if the samples really were drawn from our more complicated pdf [Music] the next step is to take all these samples and pick one according to their weights higher weighted samples being more likely to be selected this sample can
then be treated as if it was drawn from the more complex pdf we can plug it into the function we want to find the average of and give it a weight of one over the complex pdf at that point just like we did with important sampling remember that this weight compensates for certain regions being sampled more often than others the result is an estimate of the average value of the function in the case of our rendering algorithm we can use it to efficiently find the average of the rendering equation because this whole procedure we just
did prefers samples pointing towards light over samples pointing towards darkness unfortunately this procedure has an issue if we take only a small number of samples in the first step the sample we pick won't actually conform to the more complex distribution like we wanted for example if we start with two samples from a flat pdf and pick from them with weights according to this more complicated pdf the sample we get won't actually conform to the more complicated pdf instead it conforms to a pdf sort of halfway in between in other words the sample we pick is
still somewhat distributed according to a flat distribution the way we fix this problem is by taking the average weight of all the bad samples and use it to weight our estimate of the average of the function this means that if we happen to draw a bunch of bad samples the result will be given a lower weight it's not exactly intuitive but this procedure perfectly cancels out the effect of only picking a few samples if you're curious i've linked the paper that introduces this technique in the description it gives a full proof for why this last
step actually works like before with the rendering equation let's write down all the steps involved in ris using mathematical symbols first x1 through xm are values from the simple bad pdf we then compute a weight for each sample which is how likely it is in the more complex pdf over how likely it is in the simpler pdf if the simple pdf is one everywhere like when drawing uniformly distributed samples then this just reduces the complex pdf we then draw a sample y proportional to these weights which is roughly like drawing a sample from the more
complex pdf this is our good sample we plug the sample into the function f we're trying to find the average of and give it a weight of one over the complex pdf at that point remember that this weight compensates for certain regions being sampled more often than others if we were doing regular important sampling this would be our estimate but since we're doing ris we need to do one final step we need to multiply the weight by the average weight of all the bad samples remember that this term mysteriously corrects for the inaccuracies introduced by
only picking a few samples at the start now that we know how it works we can use ris in our rendering algorithm in restore we start by generating a bunch of bad samples by picking random points on random light sources an equal weight to all light emitting points in the scene is our original pdf we then weigh these samples according to something called the unshadowed light contribution basically saying how much light will bounce off the given surface from the given light point assuming there's nothing blocking it it takes into account how bright the light source
is how far away it is what angle it hits at and the brdf of the surface higher weights are assigned to sample points that are bright close and hit the surface straight on lower weights are given to sample points that are dark far away and hit the surface at shallow angles this is our complex pdf we don't include the visibility term at this stage because computing it by far takes the most time out of all the parts of the rendering equation this way we can filter out a bunch of bad candidates before ever needing to
do this more intensive computation so now that we have our samples and our weights we pick one of the samples according to the weights looks like this one got selected this time it's not the brightest candidate we could have selected but hey that's randomness for you we then do the full computation of the rendering equation including the visibility term we then weight this value to correct for the fact that some regions are being sampled more densely than others and to correct for the error introduced by ris and we're done we can repeat this process as
many times as we'd like and average the results to get a progressively clearer picture here's some results from the original restart paper in their notation the value m means how many bad samples were generated to pick from and the value n is how many times they picked a sample to then plug into the rendering function notice how many of the pixels are dark in earlier images by generating only a very small number of bad samples only a handful of pixels look out and get good ones but by generating loads of bad samples and picking on
average only the best ones we get a much better result now we've skipped over how exactly we're supposed to draw samples such that higher weighted samples are more likely one way is to arrange all the samples on a number line with higher weights taking up larger space we then pick a random number and see which part of the number line it corresponds to while it works it's not ideal computationally we have to check about half of the segments on average to determine which one the number is in we also have to make room to store
each of the samples it would be nice if there was some kind of magic bucket we could throw all the samples into and extract a valid result out at the end reservoir sampling is a magic bucket you can throw samples into and get one out at the end it works by keeping a single sample that will be our result as well as a running total of all the weights it's seen so far when we throw a new sample into this bucket we flip a coin where heads is weighted the same as the incoming item and
tails was weighted with the total weight of the reservoir if it comes up heads the incoming item replaces the current candidate if not the existing item stays and the new item is discarded no matter the result of the coin flip the weight of the item gets added to the total if we also keep track of the weight of the current sample and the total number of items we've seen then it becomes easy to compute the correction weight we saw earlier all this means that when we're performing ris we can throw all the candidates into this
magic bucket and afterwards get out the numbers we need for the rest of the procedure there's a useful property of reservoirs that we can take advantage of since multiple reservoirs pick pretty good candidates from what's been thrown into them we can combine them into one super reservoir that picks a pretty good candidate from all of the things that have been thrown in notably we don't actually have to throw in each of the items we threw into the other reservoirs to make this work we just need to throw in the singular samples picked by each reservoir
one of the ways we can leverage this is to reuse candidates from past frames for example if this pixel had picked from 10 candidates on the previous frame and it picks 10 candidates on the current frame we can combine both results very quickly to effectively select from 20 candidates and if we use this for the next frame which picks another 10 candidates we'll effectively be picking from 30. this procedure is called temporal reuse because we're reusing things across time we do have to be careful of a couple of things though for one both objects in
the scene and the camera viewing the scene may move between frames if you've seen my previous video or just have experience with computer graphics in general you'll know that we store the position rotation and scale of objects using matrices by using the matrices both for the current frame and the previous frame we're able to calculate for every pixel in the current frame the position it must have appeared at on the previous frame allowing us to reuse the correct pixels the other thing is that the lighting conditions may change added due to moving objects or to
changing materials this means we need to re-weight each sample to account for any changes that occur doing so is pretty straightforward we first recompute the unshadowed light contribution to see how much weight we should give the sample under the new conditions then the final weight we give each sample is the correction weight computed in ris times the new recomputed weight what we're doing is effectively resampling the samples a second time the samples start off in one pdf the like contribution during the previous frame and end up in another one the light contribution during the current
frame unfortunately there is a drawback to resampling like this the ris weights we calculated are really only designed to compensate for the specific situation they were made for generating a bunch of samples from one pdf and picking from those it's not built for combining multiple samples from different pdfs like samples from different frames with different lighting conditions this inaccuracy is not present in regular monte carlo sampling which has the nice property that all you need to do to get rid of its inaccuracies is run it for longer the way we're abusing ris breaks this characteristic
no matter how long you run our new algorithm for the result will always be noticeably inaccurate there's a limit to how close it can get to an accurate result algorithms that are inaccurate in a way that cannot be solved by running them for longer are referred to as biased the restore paper offers a way to correct for this but it perceptually looks worse than the bias technique due to being noticeably slower and this video is getting long enough as is so i'll skip over it for brevity since we've come up with a technique for using
samples that came from different situations let's try applying it a bit more aggressively we can add a step to our algorithm after temporal reuse to combine results from neighboring pixels we can have each pixel look at its own result and the results of its eight neighbors and combine them all into a new supervisor for this technique is called spatial reuse because we're reusing data across different parts of the image this kind of reuse is helpful because neighboring pixels are often close to each other in the scene meaning that they will want similar kinds of candidates
what's likely a good candidate for one pixel is likely a good candidate for its neighbors we can then repeat spatial reuse but since each pixel has already incorporated the best candidates from its neighbors we can instead tell every pixel to select candidates from their neighbors 3 pixels over this effectively incorporates candidates from a total of 81 pixels while only doing calculations on 9 pieces of data repeating this process again with neighbors 9 pixels over yields a total area of 729 pixels and again with neighbors 27 pixels over yields 6561 pixels with this technique even if
only one pixel in a thousand randomly happens upon a good candidate by the end of the process the whole image will be filled with good candidates you can imagine how this technique results in images being rendered hundreds of times faster than the typical monte carlo technique here's some more results from the paper we're back to only generating 32 candidates per pixel notice how sparse everything looks on the left because of it but as we progressively apply more and more spatial reuse we get a progressively fuller image as the rare but powerful candidates found by a
few pixels propagate to their neighbors one final thing to wrap up is how to generalize this algorithm to work with indirect lighting where light hits multiple surfaces before entering the camera the procedure is not too different first instead of generating candidates on random light sources we generate candidates by shooting rays in random directions and seeing where they hit we then simulate multiple bounces for each of these points to get an estimate of how much light reaches that point these are our samples and our complex pdf is how much of this light reaches and bounces off
of the original surface we can then select from and reuse these samples just like we did before and that's pretty much it if you want more details i've linked both the original restart paper and the paper that applies it to indirect lighting in the description and with that we're done we've covered how ray tracing generates realistic images by simulating how light works in the real world we've looked at how this can be improved by sampling bright sources more regularly we've looked at resampled importance sampling a practical technique for accomplishing such a thing we checked out
reservoir sampling a neat way to implement ris that also lets us easily combine samples from different domains we used this to invent two techniques spatial and temporal reuse to reuse samples across space and time respectively finally we covered how to generalize these techniques to work with global illumination and not just direct lighting now that you have an understanding of these techniques i would encourage anyone who has some coding experience to take a shot at implementing these algorithms it's very rewarding to see how effective these algorithms are at improving the quality of rendering all this is
the power behind modern cgi which when used well can pass for photography i hope you learned something interesting and thanks for watching [Music] so [Music] you
Copyright © 2025. Made with ♥ in London by YTScribe.com