How do Graphics Cards Work? Exploring GPU Architecture

1.11M views3903 WordsCopy TextShare
Branch Education
Interested in working with Micron to make cutting-edge memory chips? Work at Micron: https://bit.l...
Video Transcript:
How many calculations do you think your  graphics card performs every second while running video games with incredibly  realistic graphics? Maybe 100 million? Well, 100 million calculations a second is what’s  required to run Mario 64 from 1996.
We need more power. Maybe 100 billion calculations a  second? Well, then you would have a computer that could run Minecraft back in 2011.
In order  to run the most realistic video games such as Cyberpunk 2077 you need a graphics card that can  perform around 36 trillion calculations a second. This is an unimaginably large number, so let’s  take a second to try to conceptualize it. Imagine doing a long multiplication problem once every  second.
Now let’s say everyone on the planet does a similar type of calculation but with different  numbers. To reach the equivalent computational power of this graphics card and its 36 trillion  calculations a second we would need about 4,400 Earths filled with people, all working together  and completing one calculation each every second. It’s rather mind boggling to think that a  device can manage all these calculations, so in this video we’ll see how graphics cards work  in two parts.
First, we’ll open up this graphics card and explore the different components inside,  as well as the physical design and architecture of the GPU or graphics processing unit. Second,  we’ll explore the computational architecture and see how GPUs process mountains of data, and why  they’re ideal for running video game graphics, Bitcoin mining, neural networks and AI. So, stick around and let’s jump right in.
This video is sponsored by  Micron which manufactures the graphics memory inside this graphics card. Before we dive into all the parts of the GPU, let’s first understand the differences between  GPUs and CPUs. Inside this graphics card, the Graphics Processing Unit or GPU has over  10,000 cores.
However, when we look at the CPU or Central Processing Unit that’s mounted to  the motherboard, we find an integrated circuit or chip with only 24 cores. So, which one is more  powerful? 10 thousand is a lot more than 24, so you would think the GPU is more powerful,  however, it’s more complicated than that.
A useful analogy is to think of a GPU as a massive  cargo ship and a CPU as a jumbo jet airplane. The amount of cargo capacity is the amount of  calculations and data that can be processed, and the speed of the ship or airplane is the  rate at which how quickly those calculations and data are being processed. Essentially,  it’s a trade-off between a massive number of calculations that are executed at a  slower rate versus a few calculations that can be performed at a much faster rate.
Another key difference is that airplanes are a lot more flexible since they can carry passengers,  packages, or containers and can take off and land at any one of tens of thousands of airports.  Likewise CPUs are flexible in that they can run a variety of programs and instructions. However,  giant cargo ships carry only containers with bulk contents inside and are limited to traveling  between ports.
Similarly, GPUs are a lot less flexible than CPUs and can only run simple  instructions like basic arithmetic. Additionally GPUs can’t run operating systems or interface  with input devices or networks. This analogy isn’t perfect, but it helps to answer the question  of “which is faster, a CPU or a GPU?
”. Essentially if you want to perform a set of calculations  across mountains of data, then a GPU will be faster at completing the task. However, if you  have a lot less data that needs to be evaluated quickly than a CPU will be faster.
Furthermore,  if you need to run an operating system or support network connections and a wide range of different  applications and hardware, then you’ll want a CPU. We’re planning a separate video on CPU  architecture, so make sure to subscribe so you don’t miss it, but let’s now dive into this  graphics card and see how it works. In the center of this graphics card is the printed circuit  board or PCB, with all the various components mounted on it, [Animator Note: Highlight and list  out the various parts that will be covered.
] and we’ll start by exploring the brains which is  the graphics processing unit or GPU. When we open it up, we find a large chip or die named  GA102 built from 28. 3 billion transistors.
The majority of the area of the chip is taken up by  the processing cores which have a hierarchical organization. Specifically, the chip is divided  into 7 Graphics Processing Clusters or GPCs, and within each processing cluster are 12  streaming multiprocessors or SMs. Next, inside each of these streaming multiprocessors  are 4 warps and 1 ray tracing core, and then, inside each warp are 32 Cuda or shading cores and  1 tensor core.
Across the entire GPU are 10752 CUDA cores, 336 Tensor Cores, and 84 Ray Tracing  Cores. These three types of cores execute all the calculations of the GPU, and each has a different  function. CUDA cores can be thought of as simple binary calculators with an addition button, a  multiply button and a few others, and are used the most when running video games.
Tensor cores  are matrix multiplication and addition calculators and are used for geometric transformations  and working with neural networks and AI. And ray tracing cores are the largest but the fewest  and are used to execute ray tracing algorithms. Now that we understand the computational  resources inside this chip, one rather interesting fact is that the 3080, 3090, 3080 ti, and 3090 ti  graphics cards all use the same GA102 chip design for their GPU.
This might be counterintuitive  because they have different prices and were released in different years, but it’s true.  So, why is this? Well, during the manufacturing process sometimes patterning errors, dust  particles, or other manufacturing issues cause damage and create defective areas of the  circuit.
Instead of throwing out the entire chip because of a small defect, engineers find the  defective region and permanently isolate and deactivate the nearby circuitry. By having a GPU  with a highly repetitive design, a small defect in one core only damages that particular streaming  multiprocessor circuit and doesn’t affect the other areas of the chip. As a result, these chips  are tested and categorized or binned according to the number of defects.
The 3090ti graphics  cards have flawless GA102 chips with all 10752 CUDA cores working properly, the 3090 has 10,496  cores working, the 3080ti has 10,240 and the 3080 has 8704 CUDA cores working, which is equivalent  to having 16 damaged and deactivated streaming multiprocessors. Additionally, different graphics  cards differ by their maximum clock speed and the quantity and generation of graphics memory that  supports the GPU, which we’ll explore in a little bit. Because we’ve been focusing on the physical  architecture of this GA102 GPU chip, let’s zoom into one of these CUDA cores and see what it looks  like.
Inside this simple calculator is a layout of approximately 410 thousand transistors. This  section of 50 thousand transistors performs the operation of A times B plus C which is called  fused multiply and add or FMA and is the most common operation performed by graphics cards.  Half of the CUDA cores execute FMA using 32-bit floating-point numbers, which is essentially  scientific notation, and the other half of the cores use either 32-bit integers or 32-bit  floating point numbers.
Other sections of this core accommodate negative numbers and perform  other simple functions like bit-shifting and bit masking as well as collecting and queueing  the incoming instructions and operands, and then accumulating and outputting the results.  As a result, this single core is just a simple calculator with a limited number of functions.  This calculator completes one multiply and one add operation each clock cycle and therefore with this  3090 graphics cards and its 10496 cores and 1.
7 gigahertz clock, we get 35. 6 trillion calculations  a second. However, if you’re wondering how the GPU handles more complicated operations like division,  square root, and trigonometric functions, well, these calculator operations are performed by  the special function units which are far fewer as only 4 of them can be found in each streaming  multiprocessor.
Now that we have an understanding of what’s inside a single core, let’s zoom out  and take a look at the other sections of the GA102 chip. Around the edge we find 12 graphics memory  controllers, the NVLink Controllers and the PCIe interface. On the bottom is a 6-megabyte Level  2 SRAM Memory Cache, and here’s the Gigathread Engine which manages all the graphics processing  clusters and streaming multiprocessors inside.
Now that we’ve explored this GA102 GPU’s physical  architecture, let’s zoom out and take a look at the other parts inside the graphics card. On this  side are the various ports for the displays to be plugged into, on the other side is the incoming  12 Volt power connector, and then here are the PCIe pins that plug into the motherboard. On  the PCB, the majority of the smaller components constitute the voltage regulator module which  takes the incoming 12 volts and converts it to one point one volts and supplies hundreds  of watts of power to the GPU.
Because all this power heats up the GPU, most of the weight  of the graphics card is in the form of a heat sink with 4 heat pipes that carry heat from the  GPU and memory chips to the radiator fins where fans then help to remove the heat. Perhaps some of  the most important components, aside from the GPU, are the 24 gigabytes of graphics memory chips  which are technically called GDDR6X SDRAM and were manufactured by Micron which is the sponsor  of this video. Whenever you start up a video game or wait for a loading screen, the time it takes  to load is mostly spent moving all the 3D models of a particular scene or environment from the  solid-state drive into these graphics memory chips.
As mentioned earlier, the GPU has a small  amount of data storage in its 6-megabyte shared Level 2 cache which can hold the equivalent of  about this much of the video game’s environment. Therefore in order to render a video game,  different chunks of scene are continuously being transferred between the graphics memory and the  GPU. Because the cores are constantly performing tens of trillions of calculations a second,  GPUs are data hungry machines and need to be continuously fed terabytes upon terabytes of data,  and thus these graphics memory chips are designed kind of like multiple cranes loading a cargo ship  at the same time.
Specifically, these 24 chips transfer a combined 384 bits at a time, which  is called the bus width and the total data that can be transferred, or the bandwidth is about 1. 15  terabytes a second. In contrast the sticks of DRAM that support the CPU only have a 64-bit bus width  and a maximum bandwidth closer to 64 gigabytes a second.
One rather interesting thing is that you  may think that computers only work using binary ones and zeros. However, in order to increase data  transfer rates, GDDR6X and the latest graphics memory, GDDR7 send and receive data across the bus  wires using multiple voltage levels beyond just 0 and 1. For example, GDDR7 uses 3 different  encoding schemes to combine binary bits into ternary digits or PAM-3 symbols with voltages of  0, 1, and negative 1.
Here’s the encoding scheme on how 3 binary bits are encoded into 2 ternary  digits and this scheme is combined with an 11 bit to 7 ternary digit encoding scheme resulting  in sending 276 binary bits using only 176 ternary digits. The previous generation, GDDR6X, which  is the memory in this 3090 graphics card, used a different encoding scheme, called PAM-4, to send  2 bits of data using 4 different voltage levels, however, engineers and the graphics memory  industry agreed to switch to PAM-3 for future generations of graphics chips in order to reduce  encoder complexity, improve the signal to noise ratio, and improve power efficiency. Micron  delivers consistent innovation to push the boundaries on how much data can be transferred  every second and to design cutting edge memory chips.
Another advancement by Micron is the  development of HBM, or the high bandwidth memory, that surrounds AI chips. HBM is built from  stacks of DRAM memory chips and uses TSVs or through silicon vias, to connect this stack  into a single chip, essentially forming a cube of AI memory. For the latest generation of high  bandwidth memory, which is HBM3E, a single cube can have up to 24 to 36 gigabytes of memory, thus  yielding 192 gigabytes of high-speed memory around the AI chip.
Next time you buy an AI accelerator  system, make sure it uses Micron’s HBM3E which uses 30% less power than the competitive products.  However, unless you’re building an AI data center, you’re likely not in the market to buy one  of these systems which cost between 25 to 40 thousand dollars and are on backorder for a  few years. If you’re curious about high bandwidth memory, or Micron’s next generation of graphics  memory take a look at one of these links in the description.
Alternatively, if designing the next  generation of memory chips interests you, Micron is always looking for talented scientists and  engineers to help innovate on cutting edge chips and you can find out more about working for Micron  using this link. Now that we’ve explored many of the physical components inside this graphics card  and GPU, let’s next explore the computational architecture and see how applications like video  game graphics and bitcoin mining run what’s called “embarrassingly” parallel operations. Although  it may sound like a silly name, embarrassingly parallel is actually a technical classification  of computer problems where little or no effort is needed to divide the problem into parallel tasks,  and video game rendering and bitcoin mining easily fall into this category.
Essentially, GPUs solve  embarrassingly parallel problems using a principle called SIMD, which stands for single instruction  multiple data where the same instructions or steps are repeated across thousands to millions of  different numbers. Let’s see an example of how SIMD or single instruction multiple data is used  to create this 3D video game environment. As you may know already, this cowboy hat on the table is  composed of approximately 28 thousand triangles built by connecting together around 14,000  vertices, each with X, Y, and Z coordinates.
These vertex coordinates are built using a  coordinate system called model space with the origin of 0,0,0 being at the center of the hat. To build a 3D world we place hundreds of objects, each with their own model space into the world  environment and, in order for the camera to be able to tell where each object is relative to  other objects, we have to convert or transform all the vertices from each separate model space  into the shared world coordinate system or world space. So, as an example, how do we convert the  14 thousand vertices of the cowboy hat from model space into world space?
Well, we use a single  instruction which adds the position of the origin of the hat in world space to the corresponding  X,Y, and Z coordinate of a single vertex in model space. Next we copy this instruction to  multiple data, which is all the remaining X,Y, and Z coordinates of the other thousands of  vertices that are used to build the hat. Next, we do the same for the table and the rest of  the hundreds of other objects in the scene, each time using the same instructions but with  the different objects’ coordinates in world space, and each objects’ thousands of vertices in model  space.
As a result, all the vertices and triangles of all the objects are converted to a common  world space coordinate system and the camera can now determine which objects are in front  and which are behind. This example illustrates the power of SIMD or single instruction multiple  data and how a single instruction is applied to 5,629 different objects with a total of 8. 3  million vertices within the scene resulting in 25 million addition calculations.
The key to  SIMD and embarrassingly parallel programs is that every one of these millions of calculations  has no dependency on any other calculation, and thus all these calculations can be distributed  to the thousands of cores of the GPU and completed in parallel with one another. It's important to  note that vertex transformation from model space to world space is just one of the first steps of  a rather complicated video game graphics rendering pipeline and we have a separate video that delves  deeper into each of these other steps. Also, we skipped over the transformations for the  rotation and scale of each object, but factoring in these values is a similar process that requires  additional SIMD calculations.
Now that we have a simple understanding of SIMD, let’s discuss how  this computational architecture matches up with the physical architecture. Essentially, each  instruction is completed by a thread and this thread is matched to a single CUDA core. Threads  are bundled into groups of 32 called warps, and the same sequence of instructions is issued to  all the threads in a warp.
Next warps are grouped into thread blocks which are handled by the  streaming multiprocessor. And then finally thread blocks are grouped into grids, which are computed  across the overall GPU. All these computations are managed or scheduled by the Gigathread Engine,  which efficiently maps thread blocks to the available streaming multiprocessors.
One important  distinction is that within SIMD architecture, all 32 threads in a warp follow the same  instructions and are in lockstep with each other, kind of like a phalanx of soldiers moving  together. This lock step execution applied to GPUs up until around 2016. However, newer GPUs follow  a SIMT architecture or single instruction multiple threads.
The difference between SIMD and SIMT is  that while both send the same set of instructions to each thread, with SIMT, the individual  threads don’t need to be in lockstep with each other and can progress at different rates.  In technical jargon, each thread is given its own program counter. Additionally, with SIMT all the  threads within a streaming multiprocessor use a shared 128 kilobyte L1 cache and thus data that’s  output by one thread can be subsequently used by a separate thread.
This improvement from SIMD to  SIMT allows for more flexibility when encountering warp divergence via data-dependent conditional  branching and easier reconvergence for the threads to reach the barrier synchronization. Essentially  newer architectures of GPUs are more flexible and efficient especially when encountering branches  in code. One additional note is that although you may think that the term warp is derived from  warp drives, it actually comes from weaving and specifically the Jacquard Loom.
This loom from  1804 used programmable punch cards to select specific threads out of a set to weave together  intricate patterns. As fascinating as looms are, let’s move on. The final topics we’ll explore are  bitcoin mining, tensor cores and neural networks.
But first we’d like to ask you to ‘like’  this video, write a quick comment below, share it with a colleague, friend or on social  media, and subscribe if you haven’t already. The dream of Branch Education is to make free  and accessible, visually engaging educational videos that dive deeply into a variety topics on  science, engineering, and how technology works, and then to combine multiple videos into an  entirely free engineering curriculum for high school and college students. Taking a few seconds  to like, subscribe, and comment below helps us a ton!
Additionally, we have a Patreon page  with AMAs and behind the scenes footage, and, if you find what we do useful, we would appreciate  any support. Thank you. So now that we’ve explored how single instruction multiple threads is used  in video games, let’s briefly discuss why GPUs were initially used for mining bitcoin.
We’re not  going to get too far into the algorithm behind the blockchain and will save it for a separate  episode, but essentially, to create a block on the blockchain, the SHA-256 hashing algorithm is  run on a set of data that includes transactions, a time stamp, additional data, and a random number  called a nonce. After feeding these values through the SHA-256 hashing algorithm a random 256-bit  value is output. You can kind of think of this algorithm as a lottery ticket generator where you  can’t pick the lottery number, but based on the input data, the SHA-256 algorithm generates  a random lottery ticket number.
Therefore, if you change the nonce value and keep the rest  of the transaction data the same, you’ll generate a new random lottery ticket number. The winner of  this bitcoin mining lottery is the first randomly generated lottery number to have the first 80 bits  all zeroes, while the rest of the 176 values don’t matter and once a winning bitcoin lottery ticket  is found, the reward is 3 bitcoin and the lottery resets with a new set of transactions and input  values. So, why were graphics cards used?
Well, GPUs ran thousands of iterations of the SHA-256  algorithm with the same transactions, timestamp, other data, but, with different nonce values.  As a result, a graphics card like this one could generate around 95 million SHA-256 hashes or 95  million randomly numbered lottery tickets every second, and hopefully one of those lottery numbers  would have the first 80 digits as all zeros. However, nowadays computers filled with ASICs or  application specific integrated circuits perform 250 trillion hashes a second or the equivalent  of 2600 graphics cards, thereby making graphics cards look like a spoon when mining bitcoin next  to an excavator that is an asic mining computer.
Let’s next discuss the design of the tensor cores.  It’ll take multiple full-length videos to cover generative AI, and neural networks, so we’ll focus  on the exact matrix math that tensor cores solve. Essentially, tensor cores take three matrices and  multiply the first two, add in the third and then output the result.
Let’s look at one value of the  output. This value is equal to the sum of values of the first row of the first matrix multiplied  by the values from the first column of the second matrix, and then the corresponding  value of the third matrix is added in. Because all the values of the 3 input  matrices are ready at the same time, the tensor cores complete all of the matrix  multiplication and addition calculations concurrently.
Neural Networks and generative  AI require trillions to quadrillions of matrix multiplication and addition operations and  typically uses much larger matrices. Finally, there are Ray Tracing Cores which we explored in  a separate video that’s already been released. That’s pretty much it for graphics cards. 
We’re thankful to all our Patreon and YouTube Membership Sponsors for supporting our videos.  If you want to financially support our work, you can find the links in the description below. This is Branch Education, and we create 3D animations that dive deeply into the technology  that drives our modern world.
Watch another Branch video by clicking one of these cards or click  here to subscribe. Thanks for watching to the end!
Copyright © 2024. Made with ♥ in London by YTScribe.com