Video games have spectacular graphics, capable of  transporting you to incredibly detailed cities, heart-racing battlegrounds, magical  worlds, and breathtaking environments. While this may look like an old western train  station and locomotive from Red Dead Redemption 2, it’s actually composed of 2. 1 million  vertices assembled into 3.
5 million triangles with 976 colors and textures  assigned to the various surfaces, all with a virtual sun illuminating the scene below. But perhaps the most impressive fact is that these vertices, textures, and lights are entirely  composed of ones and zeroes that’s continuously being processed inside your computer’s  graphics card or a video game console. So then, how does your computer take billions  of ones and zeroes and turn it into realistic 3D graphics?
Well, let’s jump right in. The video game graphics rendering pipeline has three key steps: Vertex Shading, Rasterization,  and Fragment Shading. While additional steps are used in many modern video games, these three core  steps have been used for decades in thousands of video games for both computers and consoles and  are still the backbone of the video game graphics algorithm for pretty much every game you play.
Let’s begin with the first step called vertex shading. The basic idea in this step is to  take all the objects’ geometries and meshes in a 3D space and use the field of view of the  camera to calculate where each object falls in a 2D window called the view screen, which  is the 2D image that’s sent to the display. In this train station scene, there are 1,100  different models and the camera’s field of view sections off what the player sees, reducing the  number of objects that need to be rendered to 600.
Let’s focus on the locomotive as an example. Although this engine has rounded surfaces and some rather complex shapes, it’s actually  assembled from 762 thousand flat triangles using 382 thousand vertices and 9 different  materials or colors applied to the surfaces of the triangles. Conceptually, the entire train  is moved as one piece onto the viewscreen, but actually, each of the train’s hundreds of  thousands of vertices are moved one at a time.
So, let’s focus on a single vertex. The  process of moving a vertex, and by extension, the triangles and the train, from a 3D world onto  a 2D view screen is done using 3 transformations. First moving a vertex from model space to world  space, then from world space to camera space, and finally from the perspective field of view onto  the view screen.
To perform this transformation we use the X,Y, and Z coordinates of that  vertex in modeling space, then the position, scale, and rotation of the model in world space,  and finally the coordinates and rotation of the camera and its field of view. We plug all  these numbers into different transformation matrices and multiply them together resulting  in the X and Y values of the vertex on the view screen as well as a Z value or depth, which  we’ll use later to determine object blocking. After three vertices of the train are transformed  using similar matrix math, we get a single triangle moved onto the view screen.
Then the  rest of the 382 thousand vertices of the train and the 2. 1 million vertices of all the 600  objects in the camera’s field of view undergo a similar set of transformations, thereby moving  all 3. 5 million triangles onto a 2D viewscreen.
This is an incredible amount of matrix  math, but GPUs in graphics cards and video game consoles are designed to be triangle mesh  rendering monsters and thus have evolved over decades to handle millions of triangles every few  milliseconds. For example, this GPU has 10,000ish cores designed to efficiently execute up to 35  trillion operations of 32-bit multiplication and addition every second, and, by distributing the  vertex co-ordinates and transformation data among each of the cores, the GPU can easily render the  scene resulting in 120 or more frames a second. Now that we have all the vertices moved onto a 2D  plane, the next step is to use the 3 vertices of a single triangle and figure out which specific  pixels on your display are covered by that triangle.
This process is called rasterization. A 4K monitor or TV has a resolution of thirty-eight forty by twenty-one sixty,  yielding around 8. 3 million pixels.
Using the X and Y coordinates of the vertices  of a given triangle on the view screen, your GPU calculates where it falls within this  massive grid and which of the pixels are covered by that particular triangle. Next, those pixels  are shaded using the texture or color assigned to that triangle. Thus, with rasterization,  we turn triangles into fragments which are groups of pixels that come from the same  triangle and share the same texture or color.
Then we move on to the next triangle and  shade in the pixels that are covered by it and continue to do this for each of the  3. 5 million triangles that were previously moved onto the viewscreen. By applying the Red  Blue and Green color values of each triangle to the appropriate pixels, a 4K image is formed  in the frame buffer and sent to the display.
You’re probably wondering how we account  for triangles that overlap or block other triangles. For example, the train is blocking the  view of much of the train station. Additionally, the train has hundreds of thousands of triangles  on its backside that are sent through the rendering pipeline, but obviously don’t appear in  the final image.
Determining which triangles are in front is called the visibility problem and  is solved by using a Z-buffer or Depth Buffer. A Z-Buffer adds an extra value to each of the  8. 3 million pixels corresponding to the distance or depth that each pixel is from the camera.
In the previous step, when we did the vertex transformations, we ended up with X and Y  coordinates, but then also got a Z value that corresponds to the distance from the  transformed vertex to the camera. When a triangle is rasterized, it covers a set  of pixels and the Z value or depth of the triangle is compared with the values stored in the  Z-Buffer. If the triangle’s depth values are lower than those in the Z-buffer, meaning the triangle  is closer to the camera, then we paint in those pixels using the triangle’s color and re-place the  Z-buffer’s values using that triangle’s Z-values.
However, let’s say a second triangle comes along  with Z values that are higher than those in the Z-buffer, meaning the triangle is further away.  We just throw it out and keep the pixels from the triangle that was previously painted  with lower Z-values. Using this method, only the closest triangles to the camera with the  lowest Z-values will be displayed on the screen.
By the way, here’s the image of the Z or Depth  buffer, wherein black is close and white is far. Note that because these triangles are in 3D space,  the vertices often have 3 different Z values, and thus each individual pixel of the triangle needs  its Z value computed using the vertex coordinates. This allows intersecting triangles to properly  render out their intersections pixel by pixel.
One issue with rasterization and these pixels is  that if the triangle cuts at an angle and passes through the center of the pixel, then the  entire pixel is painted with that triangle’s color resulting in jagged and pixelated edges. To reduce the appearance of these jagged edges, graphics processors implement a technique  called Super Sampling Anti-Aliasing. With SSAA, 16 sampling points are distributed across a single  pixel, and when a triangle cuts through a pixel, depending on how many of the 16 sampling  points the triangle covers, a corresponding fractional shade of that color is applied to  the pixel, resulting in faded edges in the image and significantly less noticeable pixelization.
One thing to remember is that when you’re playing a video game, your character’s camera view as  well as the objects in the scene are continuously moving around. As a result, the process and  calculations within vertex shading, rasterization, and fragment shading are recalculated for every  single frame once every 8. 3 milliseconds for a game running at 120 frames a second.
Let’s move onto the next step which is Fragment Shading. Now that we have a set  of pixels corresponding to each triangle, it’s not enough to simply paint by number to color  the pixels. Rather, to make the scene realistic, we have to account for the direction and  strength of the light or illumination, the position of the camera, reflections, and  shadows cast by other objects.
Fragment shading is therefore used to shade in each pixel  with accurate illumination to make the scene realistic. As a reminder, fragments are groups of  pixels formed from a single rasterized triangle. Let’s see the fragment shader in action.
This  train engine is mostly made of black metal, and if we apply the same color to each of its  pixel fragments, we get a horribly inaccurate train. But once we apply proper shading, such  as making the bottom darker and the top lighter, and by adding in specular highlights or shininess  where the light bounces off the surface, we get a realistic black metal train. Additionally, as the  sun moves in the sky, the shading on the train reflects the passage of time throughout the day,  and, if it’s night, the materials and colors of all the objects are darker and illuminated from  the light of the fire.
Even video games such as Super Mario 64 which is almost 30 years old have  some simple shading where the colors of surfaces are changed by the lighting and shadows in the  scene. So, let’s see how fragment shading works. The basic idea is that if a surface is pointing  directly at a light source such as the sun, it’s shaded brighter whereas if a  surface is facing perpendicular to, or away from the light, it’s shaded darker.
In order to calculate a triangle’s shading, there are two key details we need to know.  First, the direction of the light and second, the direction the triangle’s surface is facing.  Let’s continue to use the locomotive as an example and paint it bright red instead of black.
As  you already know, this train is made of 762 thou-sand flat triangles, many of which face  in different directions. The direction that an individual triangle is facing is called its  surface normal, which is simply the direction perpendicular to the plane of the triangle, kind  of like a flagpole sticking out of the ground. To calculate a triangle’s shading, we take the  cosine of the angle or theta between the two directions.
The cosine theta value is 1 when the  surface is facing the light and when the surface is perpendicular to the light it’s 0. Next, we  multiply cosine theta by the intensity of the light and then by the color of the material to  get the properly shaded color of that triangle. This process adjusts the triangles’ RGB values  and as a result, we get a range of lightness to darkness of a surface depending on how its  individual triangles are facing the light.
However, if the surface is perpendicular or  facing away, we don’t want a cosine theta value of 0 or a negative number because this would  result in a pitch-black surface. Therefore, we set the minimum to 0 and add in an ambient  light intensity times the surface color, and adjust this ambient light so that it’s higher  in daytime scenes, and closer to 0 at night. Finally, when there are multiple light sources  in a scene, we perform this calculation multiple times with different light directions, and  intensities and then add the individual contributions together.
One key problem with it is that the triangles  within an object each have only a single normal, and thus each triangle will share the  same color throughout the triangle’s surface. This is called flat shading and  is rather unrealistic when viewed on curved surfaces such as the body of this steam engine. So, in order to produce smooth shading, instead of using surface normals, we use one normal for  each vertex calculated using the average of the normals of the adjacent triangles.
Next, we  use a method called barycentric coordinates to produce a smooth gradient of normals across the  surface of a triangle. Visually it’s like mixing 3 different colors across a triangle, but instead  we’re using the three vertex normal directions. For a given fragment we take the center of each  pixel and use the vertex normals and coordinates of the pre-rasterized triangle to calculate the  barycentric normal of that particular pixel.
Just like mixing the three colors across a triangle  this pixel’s normal will be a proportional mix of the three vertex normals of the triangle.  As a result, when a set of triangles is used to form a curved surface, each pixel will be part  of a gradient of normals resulting in a gradient of angles facing the light with pixel-by-pixel  coloring and smooth shad-ing across the surface. We want to say that this has been one of the most  enjoyable videos to make simply because we love playing video games and seeing the algorithm  that makes these incredible graphics has been a joy.
For  example, you might be wondering where ray tracing and DLSS or deep learning super sampling fits into  this pipeline. Ray tracing is predominately used to create highly detailed scenes with accurate  lighting and reflections typically found in TV and movies and a single frame can take dozens  of minutes or more to render. For video games, the primary visibility and shading of the objects  are calculated using the graphics rendering pipeline we discussed, but in certain video  games ray tracing is used to calculate shadows, reflections, and improved lighting.
On the other  hand, DLSS is an algorithm for taking a low resolution frame and upscaling it to a 4K frame  using a convolution neural network. Therefore DLSS is executed after ray tracing and the graphics  pipeline generates a low-resolution frame. One interesting note is that the latest generation  of GPUs has 3 entirely separate architectures of computational resources or cores.
CUDA or Shading  cores execute the graphics rendering pipeline. Ray tracing cores are self-explanatory. And then  DLSS is run on the Tensor cores.
Therefore, when you’re playing a high-end video game with  Ray Tracing and DLSS, your GPU utilizes all of its computational resources at the same time,  allowing you to play 4K games and render frames in less than 10 milliseconds each. Whereas if you  were to solely rely on the CUDA or shading cores, then a single frame would take around  50 milliseconds. With that in mind, Ray Tracing and DLSS are entirely different topics  with their own equally complicated algorithms, and therefore we’re planning separate videos  that will explore each of these topics in detail.
Furthermore, when it comes to video game graphics,  there are advanced topics such as Shadows, Reflections, UVs, Normal Maps and more. Therefore,  we’re considering making an additional video on these advanced topics. If you’re interested  in such a video let us know in the comments.
