Video games have spectacular graphics, capable of transporting you to incredibly detailed cities, heart-racing battlegrounds, magical worlds, and breathtaking environments. While this may look like an old western train station and locomotive from Red Dead Redemption 2, it’s actually composed of 2. 1 million vertices assembled into 3.
5 million triangles with 976 colors and textures assigned to the various surfaces, all with a virtual sun illuminating the scene below. But perhaps the most impressive fact is that these vertices, textures, and lights are entirely composed of ones and zeroes that’s continuously being processed inside your computer’s graphics card or a video game console. So then, how does your computer take billions of ones and zeroes and turn it into realistic 3D graphics?
Well, let’s jump right in. The video game graphics rendering pipeline has three key steps: Vertex Shading, Rasterization, and Fragment Shading. While additional steps are used in many modern video games, these three core steps have been used for decades in thousands of video games for both computers and consoles and are still the backbone of the video game graphics algorithm for pretty much every game you play.
Let’s begin with the first step called vertex shading. The basic idea in this step is to take all the objects’ geometries and meshes in a 3D space and use the field of view of the camera to calculate where each object falls in a 2D window called the view screen, which is the 2D image that’s sent to the display. In this train station scene, there are 1,100 different models and the camera’s field of view sections off what the player sees, reducing the number of objects that need to be rendered to 600.
Let’s focus on the locomotive as an example. Although this engine has rounded surfaces and some rather complex shapes, it’s actually assembled from 762 thousand flat triangles using 382 thousand vertices and 9 different materials or colors applied to the surfaces of the triangles. Conceptually, the entire train is moved as one piece onto the viewscreen, but actually, each of the train’s hundreds of thousands of vertices are moved one at a time.
So, let’s focus on a single vertex. The process of moving a vertex, and by extension, the triangles and the train, from a 3D world onto a 2D view screen is done using 3 transformations. First moving a vertex from model space to world space, then from world space to camera space, and finally from the perspective field of view onto the view screen.
To perform this transformation we use the X,Y, and Z coordinates of that vertex in modeling space, then the position, scale, and rotation of the model in world space, and finally the coordinates and rotation of the camera and its field of view. We plug all these numbers into different transformation matrices and multiply them together resulting in the X and Y values of the vertex on the view screen as well as a Z value or depth, which we’ll use later to determine object blocking. After three vertices of the train are transformed using similar matrix math, we get a single triangle moved onto the view screen.
Then the rest of the 382 thousand vertices of the train and the 2. 1 million vertices of all the 600 objects in the camera’s field of view undergo a similar set of transformations, thereby moving all 3. 5 million triangles onto a 2D viewscreen.
This is an incredible amount of matrix math, but GPUs in graphics cards and video game consoles are designed to be triangle mesh rendering monsters and thus have evolved over decades to handle millions of triangles every few milliseconds. For example, this GPU has 10,000ish cores designed to efficiently execute up to 35 trillion operations of 32-bit multiplication and addition every second, and, by distributing the vertex co-ordinates and transformation data among each of the cores, the GPU can easily render the scene resulting in 120 or more frames a second. Now that we have all the vertices moved onto a 2D plane, the next step is to use the 3 vertices of a single triangle and figure out which specific pixels on your display are covered by that triangle.
This process is called rasterization. A 4K monitor or TV has a resolution of thirty-eight forty by twenty-one sixty, yielding around 8. 3 million pixels.
Using the X and Y coordinates of the vertices of a given triangle on the view screen, your GPU calculates where it falls within this massive grid and which of the pixels are covered by that particular triangle. Next, those pixels are shaded using the texture or color assigned to that triangle. Thus, with rasterization, we turn triangles into fragments which are groups of pixels that come from the same triangle and share the same texture or color.
Then we move on to the next triangle and shade in the pixels that are covered by it and continue to do this for each of the 3. 5 million triangles that were previously moved onto the viewscreen. By applying the Red Blue and Green color values of each triangle to the appropriate pixels, a 4K image is formed in the frame buffer and sent to the display.
You’re probably wondering how we account for triangles that overlap or block other triangles. For example, the train is blocking the view of much of the train station. Additionally, the train has hundreds of thousands of triangles on its backside that are sent through the rendering pipeline, but obviously don’t appear in the final image.
Determining which triangles are in front is called the visibility problem and is solved by using a Z-buffer or Depth Buffer. A Z-Buffer adds an extra value to each of the 8. 3 million pixels corresponding to the distance or depth that each pixel is from the camera.
In the previous step, when we did the vertex transformations, we ended up with X and Y coordinates, but then also got a Z value that corresponds to the distance from the transformed vertex to the camera. When a triangle is rasterized, it covers a set of pixels and the Z value or depth of the triangle is compared with the values stored in the Z-Buffer. If the triangle’s depth values are lower than those in the Z-buffer, meaning the triangle is closer to the camera, then we paint in those pixels using the triangle’s color and re-place the Z-buffer’s values using that triangle’s Z-values.
However, let’s say a second triangle comes along with Z values that are higher than those in the Z-buffer, meaning the triangle is further away. We just throw it out and keep the pixels from the triangle that was previously painted with lower Z-values. Using this method, only the closest triangles to the camera with the lowest Z-values will be displayed on the screen.
By the way, here’s the image of the Z or Depth buffer, wherein black is close and white is far. Note that because these triangles are in 3D space, the vertices often have 3 different Z values, and thus each individual pixel of the triangle needs its Z value computed using the vertex coordinates. This allows intersecting triangles to properly render out their intersections pixel by pixel.
One issue with rasterization and these pixels is that if the triangle cuts at an angle and passes through the center of the pixel, then the entire pixel is painted with that triangle’s color resulting in jagged and pixelated edges. To reduce the appearance of these jagged edges, graphics processors implement a technique called Super Sampling Anti-Aliasing. With SSAA, 16 sampling points are distributed across a single pixel, and when a triangle cuts through a pixel, depending on how many of the 16 sampling points the triangle covers, a corresponding fractional shade of that color is applied to the pixel, resulting in faded edges in the image and significantly less noticeable pixelization.
One thing to remember is that when you’re playing a video game, your character’s camera view as well as the objects in the scene are continuously moving around. As a result, the process and calculations within vertex shading, rasterization, and fragment shading are recalculated for every single frame once every 8. 3 milliseconds for a game running at 120 frames a second.
Let’s move onto the next step which is Fragment Shading. Now that we have a set of pixels corresponding to each triangle, it’s not enough to simply paint by number to color the pixels. Rather, to make the scene realistic, we have to account for the direction and strength of the light or illumination, the position of the camera, reflections, and shadows cast by other objects.
Fragment shading is therefore used to shade in each pixel with accurate illumination to make the scene realistic. As a reminder, fragments are groups of pixels formed from a single rasterized triangle. Let’s see the fragment shader in action.
This train engine is mostly made of black metal, and if we apply the same color to each of its pixel fragments, we get a horribly inaccurate train. But once we apply proper shading, such as making the bottom darker and the top lighter, and by adding in specular highlights or shininess where the light bounces off the surface, we get a realistic black metal train. Additionally, as the sun moves in the sky, the shading on the train reflects the passage of time throughout the day, and, if it’s night, the materials and colors of all the objects are darker and illuminated from the light of the fire.
Even video games such as Super Mario 64 which is almost 30 years old have some simple shading where the colors of surfaces are changed by the lighting and shadows in the scene. So, let’s see how fragment shading works. The basic idea is that if a surface is pointing directly at a light source such as the sun, it’s shaded brighter whereas if a surface is facing perpendicular to, or away from the light, it’s shaded darker.
In order to calculate a triangle’s shading, there are two key details we need to know. First, the direction of the light and second, the direction the triangle’s surface is facing. Let’s continue to use the locomotive as an example and paint it bright red instead of black.
As you already know, this train is made of 762 thou-sand flat triangles, many of which face in different directions. The direction that an individual triangle is facing is called its surface normal, which is simply the direction perpendicular to the plane of the triangle, kind of like a flagpole sticking out of the ground. To calculate a triangle’s shading, we take the cosine of the angle or theta between the two directions.
The cosine theta value is 1 when the surface is facing the light and when the surface is perpendicular to the light it’s 0. Next, we multiply cosine theta by the intensity of the light and then by the color of the material to get the properly shaded color of that triangle. This process adjusts the triangles’ RGB values and as a result, we get a range of lightness to darkness of a surface depending on how its individual triangles are facing the light.
However, if the surface is perpendicular or facing away, we don’t want a cosine theta value of 0 or a negative number because this would result in a pitch-black surface. Therefore, we set the minimum to 0 and add in an ambient light intensity times the surface color, and adjust this ambient light so that it’s higher in daytime scenes, and closer to 0 at night. Finally, when there are multiple light sources in a scene, we perform this calculation multiple times with different light directions, and intensities and then add the individual contributions together.
Having more than a few light sources is computationally intense for your GPU, and thus scenes limit the number of individual light sources and sometimes limit the range of influence for the lights so that triangles will ignore distant lights. The vector and matrix math used in rendering video game graphics is rather complicated, but luckily there’s a free and easy way to learn it and that’s with Brilliant. org.
Brilliant is a multidisciplinary online interactive education platform and is the best way to learn math, computer science, and many other fields of science and engineering. Thus far we’ve been simplifying the math behind video game graphics considerably. For example, vectors are used to find the value of cosine theta between the direction of the light and the surface normal, and the GPU uses the dot product divided by the norm of the two vectors to calculate it.
Additionally, we skipped a lot of detail when it came to 3D shapes and transformations from one coordinate system to another using matrices. Rather fittingly, Brilliant. org has entire courses on vector calculus, trigonometry, and 3D geometry, as well as courses on linear algebra and matrix math.
All of which have direct applications to this video and are needed for you to fully understand graphics algorithms. Alternatively, if you’re all set with math, we recommend their course on Thinking in Code which will help you build a solid foundation on computational problem solving. Brilliant is offering a free 30-day trial with full access to their thousands of lessons.
It’s incredibly easy to sign up, try out some of their lessons for free and, if you like them, which we’re sure you will, you can sign up for an annual subscription. To the viewers of this channel, Brilliant is offering 20% off an annual subscription to the first 200 people who sign up. Just go to brilliant.
org/brancheducation. The link is in the description below. Let’s get back to exploring fragment shading.
One key problem with it is that the triangles within an object each have only a single normal, and thus each triangle will share the same color throughout the triangle’s surface. This is called flat shading and is rather unrealistic when viewed on curved surfaces such as the body of this steam engine. So, in order to produce smooth shading, instead of using surface normals, we use one normal for each vertex calculated using the average of the normals of the adjacent triangles.
Next, we use a method called barycentric coordinates to produce a smooth gradient of normals across the surface of a triangle. Visually it’s like mixing 3 different colors across a triangle, but instead we’re using the three vertex normal directions. For a given fragment we take the center of each pixel and use the vertex normals and coordinates of the pre-rasterized triangle to calculate the barycentric normal of that particular pixel.
Just like mixing the three colors across a triangle this pixel’s normal will be a proportional mix of the three vertex normals of the triangle. As a result, when a set of triangles is used to form a curved surface, each pixel will be part of a gradient of normals resulting in a gradient of angles facing the light with pixel-by-pixel coloring and smooth shad-ing across the surface. We want to say that this has been one of the most enjoyable videos to make simply because we love playing video games and seeing the algorithm that makes these incredible graphics has been a joy.
We spent over 540 hours researching, writing, modelling this scene from RDR2, and animating. If you could take a few seconds to hit that like button, subscribe, share this video with a friend, and write a comment below it would help us more than you think, so thank you. Thus far we’ve covered the core steps for the graphics rendering pipeline, however, there are many more steps and advanced topics.
For example, you might be wondering where ray tracing and DLSS or deep learning super sampling fits into this pipeline. Ray tracing is predominately used to create highly detailed scenes with accurate lighting and reflections typically found in TV and movies and a single frame can take dozens of minutes or more to render. For video games, the primary visibility and shading of the objects are calculated using the graphics rendering pipeline we discussed, but in certain video games ray tracing is used to calculate shadows, reflections, and improved lighting.
On the other hand, DLSS is an algorithm for taking a low resolution frame and upscaling it to a 4K frame using a convolution neural network. Therefore DLSS is executed after ray tracing and the graphics pipeline generates a low-resolution frame. One interesting note is that the latest generation of GPUs has 3 entirely separate architectures of computational resources or cores.
CUDA or Shading cores execute the graphics rendering pipeline. Ray tracing cores are self-explanatory. And then DLSS is run on the Tensor cores.
Therefore, when you’re playing a high-end video game with Ray Tracing and DLSS, your GPU utilizes all of its computational resources at the same time, allowing you to play 4K games and render frames in less than 10 milliseconds each. Whereas if you were to solely rely on the CUDA or shading cores, then a single frame would take around 50 milliseconds. With that in mind, Ray Tracing and DLSS are entirely different topics with their own equally complicated algorithms, and therefore we’re planning separate videos that will explore each of these topics in detail.
Furthermore, when it comes to video game graphics, there are advanced topics such as Shadows, Reflections, UVs, Normal Maps and more. Therefore, we’re considering making an additional video on these advanced topics. If you’re interested in such a video let us know in the comments.
We believe the future will require a strong emphasis on engineering education and we’re thankful to all our Patreon and YouTube Membership Sponsors for supporting this dream. If you want to support us on YouTube Memberships, or Patreon, you can find the links in the description. This is Branch Education, and we create 3D animations that dive deeply into the technology that drives our modern world.
Watch another Branch video by clicking one of these cards or click here to subscribe. Thanks for watching to the end!