Have you ever wondered what’s happening inside your computer when you load a program or video game? Well, millions of operations are happening, but perhaps the most common is simply just copying data from a solid-state drive or SSD into dynamic random-access memory or DRAM. An SSD stores all the programs and data for long-term storage, but when your computer wants to use that data, it has to first move the appropriate files into DRAM, which takes time, hence the loading bar.
Because your CPU works only with data after it’s been moved to DRAM, it’s also called working memory or main memory. The reason why your desktop uses both SSDs and DRAM is because Solid-State Drves permanently store data in massive 3D arrays composed of a trillion or so memory cells, yielding terabytes of storage, whereas DRAM temporarily stores data in 2D arrays composed of billions of tiny capacitor memory cells yielding gigabytes of working memory. Accessing any section of cells in the massive SSD array and reading or writing data takes about 50 microseconds whereas reading or writing from any DRAM capacitor memory cell takes about 17 nanoseconds, which is 3000 times faster.
For comparison, a supersonic jet going at Mach 3 is around 3000 times faster than a moving tortoise. So, the speed of 17 nanosecond DRAM versus 50 microsecond SSD is like comparing a supersonic jet to a tortoise. However, speed is just one factor.
DRAM is limited to a 2D array and temporarily stores one bit per memory cell. For example, this stick of DRAM with 8 chips holds 16 gigabytes of data, whereas a solid-state drive of a smaller size can hold 2 terabytes of data, more than 100 times that of DRAM. Additionally, DRAM requires power to continuously store and refresh the data held in its capacitors.
Therefore, computers use both SSDs and DRAM and, by spending a few seconds of loading time to copy data from the SSD to the DRAM, and then prefetching, which is the process of moving data before it’s needed, your computer can store terabytes of data on the SSD and then access the data from programs that were preemptively copied into the DRAM in a few nanoseconds. For example, many video games have a loading time to start up the game itself, and then a separate loading time to load a save file. During the process of loading a save file, all the 3D models, textures, and the environment of your game state are moved from the SSD into DRAM so any of it can be accessed in a few nanoseconds, which is why video games have DRAM capacity requirements.
Just imagine, without DRAM, playing a game would be 3,000 times slower. We covered solid-state drives in other videos, so in this video, we’re going to take a deep dive into this 16-gigabyte stick of DRAM. First, we’ll see exactly how the CPU communicates and moves data from an SSD to DRAM.
Then we’ll open up a DRAM microchip and see how billions of memory cells are organized into banks and how data is written to and read from groups of memory cells. In the process, we’ll dive into the nanoscopic structures inside individual memory cells and see how each capacitor physically stores 1 bit of data. Finally, we’ll explore some breakthroughs and optimizations such as the burst buffer and folded DRAM layouts that enable DRAM to move data around at incredible speeds.
A few quick notes. First, you can find similar DRAM chips inside GPUs, Smartphones, and many other devices, but with different optimizations. As examples, GPU DRAM or VRAM, located all around the GPU chip, has a larger bandwidth and can read and write simultaneously, but operates at a lower frequency, and DRAM in your smartphone is stacked on top of the CPU and is optimized for smaller packaging and lower power consumption.
Second, this video is sponsored by Crucial. Although they gave me this stick of DRAM to model and use in the video, the content was independently researched and not influenced by them. Third, there are faster memory structures in your CPU called cache memory and even faster registers.
All these types of memory create a memory hierarchy, with the main trade-off being speed versus capacity while keeping prices affordable to consumers and optimizing the size of each microchip for manufacturing. Fourth, you can see how much of your DRAM is being utilized by each program by opening your computer’s resource monitor and clicking on memory. Fifth, there are different generations of DRAM, and we’ll explore DDR5.
Many of the key concepts that we explain apply to prior generations, although the numbers may be different. Sixth, 17 nanoseconds is incredibly fast! Electricity travels at around 1 foot per nanosecond, and 17 nanoseconds is about the time it takes for light to travel across a room.
Finally, this video is rather long as it covers a lot of what there is to know around DRAM. We recommend watching it first at one point two five times speed, and then a second time at one and a half speed to fully comprehend this complex technology. Stick around because this is going to be an incredibly detailed video.
To start, a stick of DRAM is also called a Dual Inline Memory Module or DIMM and there are 8 DRAM chips on this particular DIMM. On the motherboard, there are 4 DRAM slots, and when plugged in, the DRAM is directly connected to the CPU via 2 memory channels that run through the motherboard. Note that the left two DRAM slots share these memory channels, and the right two share a separate channel.
Let’s move to look inside the CPU at the processor. Along with numerous cores and many other elements, we find the memory controller which manages and communicates with the DRAM. There’s also a separate section for communicating with SSDs plugged into the M2 slots and with SSDs and hard drives plugged into SATA connectors.
Using these sections, along with data mapping tables, the CPU manages the flow of data from the SSD to DRAM, as well as from DRAM to cache memory for processing by the cores. Let’s move back to see the memory channels. For DDR5 each memory channel is divided into two parts, Channel A and Channel B.
These two memory channels A and B independently transfer 32 bits at a time using 32 data wires. Using 21 additional wires each memory channel carries an address specifying where to read or write data and, using 7 control signal wires, commands are relayed. The addresses and commands are sent to and shared by all 4 chips on the memory channel which work in parallel.
However, the 32-bit data lines are divided among the chips and thus each chip only reads or writes 8 bits at a time. Additionally, power for DRAM is supplied by the motherboard and managed by these chips on the stick itself. Next, let’s open and look inside one of these DRAM microchips.
Inside the exterior packaging, we find an interconnection matrix that connects the ball grid array at the bottom with the die which is the main part of this microchip. This 2 gigabyte DRAM die is organized into 8 bank groups composed of 4 banks each, totaling 32 banks. Within each bank is a massive array, 65,536 memory cells tall by 8192 cells across, essentially rows and columns in a grid, with tens of thousands of wires, and supporting circuitry running outside each bank.
Instead of looking at this die, we’re going to transition to a functional diagram, and then reorganize the banks and bank groups. In order to access 17 billion memory cells, we need a 31-bit address. 3 bits are used to select the appropriate bank group, then 2 bits to select the bank.
Next 16 bits of the address are used to determine the exact row out of 65 thousand. Because this chip reads or writes 8 bits at a time, the 8192 columns are grouped by 8 memory cells, all read or written at a time, or ‘by 8’, and thus only 10 bits are needed for the column address. One optimization is that this 31-bit address is separated into two parts and sent using only 21 wires.
First, the bank group, bank, and row address are sent, and then after that the column address. Next, we’ll look inside these physical memory cells, but first, let’s briefly talk about how these structures are manufactured as well as this video’s sponsor. This incredibly complicated die, also called an integrated circuit, is manufactured on 300-millimeter silicon wafers, 2500ish dies at a time.
On each die are billions of nanoscopic memory cells that are fabricated using dozens of tools and hundreds of steps in a semiconductor fabrication plant or fab. This one was made by Micron which manufactures around a quarter of the world’s DRAM, including both Nvidia’s and AMD’s VRAM in their GPUs Micron also has its own product line of DRAM and SSDs under the brand Crucial which, as mentioned earlier, is the sponsor of this video. In addition to DRAM, Micron is one of the world’s leading suppliers of solid-state drives such as this Crucial P5+ M2 NVME SSD.
By installing your operating system and video games on a Crucial NVMe solid-state drive, you’ll be sure to have incredibly fast loading times and smooth gameplay, and if you do video editing, make sure all those files are on a fast SSD like this one as well. This is because the main speed bottleneck for loading is predominantly limited by the speed of the SSD or hard drive where the files are stored. For example, this hard drive can only transfer data at around 150 megabytes a second whereas this Crucial NVMe SSD can transfer data at a rate of up to 6,600 megabytes a second, which, for comparison is the speed of a moving tortoise versus a galloping horse.
By using a Crucial NVMe SSD, loading a video game that requires gigabytes of DRAM is reduced from a minute or more down to a couple seconds. Check out the Crucial NVMe SSDs using the link in the description below. Let’s get back to the details of how DRAM works and zoom in to explore a single memory cell situated in a massive array.
This memory cell is called a 1T1C cell and is a few dozen nanometers in size. It has two parts, a capacitor to store one bit of data in the form of electrical charges or electrons and a transistor to access and read or write data. The capacitor is shaped like a deep trench dug into silicon and is composed of two conductive surfaces separated by a dielectric insulator or barrier just a few atoms thick, which stops the flow of electrons but allows electric fields to pass through.
If this capacitor is charged up with electrons to 1 volt, it’s a binary 1, and if no charges are present and it’s at 0 volts, it’s a binary 0, and thus this cell only holds one bit of data. Designs of capacitors are constantly evolving but in this trench capacitor, the depth of the silicon is utilized to allow for larger capacitive storage, while taking up as little area as possible. Next let’s look at the access transistor and add in two wires.
The wordline wire connects to the gate of the transistor while the bitline wire connects to the other side of the transistor’s channel. Applying a voltage to the wordline turns on the transistor, and, while it’s on, electrons can flow through the channel thus connecting the capacitor to the bitline. This allows us to access and charge up the capacitor to write a 1 or discharge the capacitor to write a 0.
Additionally, we can read the stored value in the capacitor by measuring the amount of charge. However, when the wordline is off, the transistor is turned off, and the capacitor is isolated from the bitline thus saving the data or charge that was previously written. Note that because this transistor is incredibly small, only a few dozen nanometers wide, electrons slowly leak across the channel, and thus over time the capacitor needs to be refreshed to recharge the leaked electrons.
We’ll cover exactly how refreshing memory cells works a little later. As mentioned earlier, this 1T1C memory cell is one of 17 billion inside this single die and is organized into massive arrays called banks. So, let’s build a small array for illustrative purposes.
In our array, each of the wordlines is connected in rows, and then the bitlines are connected in columns. Wordlines and bitlines are on different vertical layers so one can cross over the other, and they never touch. Let’s simplify the visual and use symbols for the capacitors and the transistors.
Just as before, the wordlines connect to each transistor’s control gate in rows, and then all the bitlines in columns connect to the channel opposite each capacitor. As a result, when a wordline is active, all the capacitors in only that row are connected to their corresponding bitlines, thereby activating all the memory cells in that row. At any given time only one wordline is active because, if more than one wordline were active, then multiple capacitors in a column would be connected to the bitline and the data storage functionalities of these capacitors would interfere with one another, making them useless.
As mentioned earlier, within a single bank there are 65,536 rows and 8,192 columns and the 31-bit address is used to activate a group of just 8 memory cells. The first 5 bits select the bank, and the next 16-bits are sent to a row decoder to activate a single row. For example, this binary number turns on the wordline row 27,524, thus turning on all transistors in that row and connecting the 8,192 capacitors to their bitlines, while at the same time the other 65 thousandish wordlines are all off.
Here’s the logic diagram for a simple decoder. The remaining 10 bits of the address are sent to the column multiplexer. This multiplexer takes in the 8192 bitlines on the top, and, depending on the 10-bit address, connects a specific group of 8 bitlines to the 8 input and output IO wires at the bottom.
For example, if the 10-bit address we this, then only the bitlines 4,784 through 4,791 would be connected to the IO wires, and the rest of the 8000ish bitlines would be connected to nothing. Here’s the logic diagram for a simple multiplexer. We now have the means of accessing any memory cell in this massive array; however, to understand the three basic operations, reading, writing, and refreshing let’s add two elements to our layout: A sense amplifier at the bottom of each bitline, and a read and write driver outside of the column multiplexer.
Let’s look at reading from a group of memory cells. First the read command and 31-bit address are sent from the CPU to the DRAM. The first 5 bits select a specific bank.
The next step is to turn off all the wordlines in that bank, thereby isolating all the capacitors, and then precharge all 8000ish bitlines to . 5 volts. Next the 16-bit row address turns on a row, and all the capacitors in that row are connected to their bitlines.
If an individual capacitor holds a 1 and is charged to 1 volt, then some charge flows from the capacitor onto the . 5-volt bitline, and the voltage on the bitline increases. The sense amplifier then detects this slight change or perturbation of voltage on the bitline, amplifies the change, and pushes the voltage on the bitline all the way up to 1 volt.
However, if a 0 is stored in the capacitor, charge flows from the bitline into the capacitor, and the . 5-volt bitline decreases in voltage. The sense amplifier then sees this change, amplifies it and drives the bitline voltage down to 0 volts or ground.
The sense amplifier is necessary because the capacitor is so small, and the bitline is rather long, and thus the capacitor needs to have an additional component to sense and amplify whatever value is stored. Now, all 8000ish bitlines are driven to 1 volt or 0 volts corresponding to the stored charge in the capacitors of the activated row, and this row is now considered open. Next, the column select multiplexer uses the 10-bit column address to connect the corresponding 8 bitlines to the read driver which then sends these 8 values and voltages over the 8 data wires to the CPU.
Writing data to these memory cells is similar to reading, however with a few key differences. First the write command, address, and 8 bits to be written are sent to the DRAM chip. Next, just like before the bank is selected, the capacitors are isolated, and the bitlines are precharged to .
5 volts. Then, using a 16-bit address, a single row is activated, the capacitors perturb the bitline, and the sense amplifiers sense this and drive the bitlines to a 1 or 0 thus opening the row. Next the column address goes to the multiplexer, but, this time, because a write command was sent, the multiplexer connects the specific 8 bitlines to the write driver which contains the 8 bits that the CPU had sent along the data wires and requested to write.
These write drivers are much stronger than the sense amplifier and thus they override whatever voltage was previously on the bitline, and drive each of the 8 bitlines to 1 volt for a 1 to be written, or 0 volts for a 0. This new bitline voltage overrides the previously stored charges or values in each of the 8 capacitors in the open row, thereby writing 8 bits of data to the memory cells corresponding to the 31-bit address. Three quick notes.
First, as a reminder, writing and reading happens concurrently with all the 4 chips in the shared memory channel, using the same 31-bit address and command wires, but with different data wires for each chip. Second, with DDR5 for a binary 1 the voltage is actually 1. 1 volts, for DDR4 it’s 1.
2 volts, and prior generations had even higher voltages, with the bitline precharge voltages being half of these voltages. However, for DDR5, when writing or refreshing a higher voltage, around 1. 4 volts is applied and stored in each capacitor for a binary 1 because charge leaks out over time.
However, for simplicity, we’re going to stick with 1 and 0. Third, the number of bank groups, banks, bitlines and wordlines varies widely between different generations and capacities but is always in powers of 2. Let’s move on and discuss the third operation which is refreshing the memory cells in a bank.
As mentioned earlier, the transistors used to isolate the capacitors are incredibly small, and thus charges leak across the channel. The refresh operation is rather simple and is a sequence of closing all the rows, precharging the bitlines to . 5 volts, and opening a row.
To refresh, just as before, the capacitors perturb the bitlines and then the sense amplifiers drive the bitlines and capacitors of the open row fully up to 1 volt or down to 0 volts depending on the stored value of the capacitor, thereby refilling the leaked charge. This process of row closing, precharging, opening, and sense amplifying happens row after row, taking 50 nanoseconds for each row, until all 65 thousandish rows are refreshed taking a total of 3 milliseconds or so to complete. The refresh operation occurs once every 64 milliseconds for each bank, because that’s statistically below the worst-case time it takes for a memory cell to leak too much charge to make a stored 1 turn into a 0, thus resulting in a loss of data.
Let’s take a step back and consider the incredible amount of data that is moved through DRAM memory cells. These banks of memory cells handle up to 4 thousand 8 hundred million requests to read and write data every second while refreshing every memory cell in each bank row by row around 16 times a second. That’s a staggering amount of data movement and illustrates the true strength of computers.
Yes, they do simple things like comparisons, arithmetic, and moving data around, but at a rate of billions of times a second. Now, you might wonder why computers need to do so much data movement. Well, take this video game for example.
You have obvious calculations like the movement of your character and the horse. But then there are individual grasses, trees, rocks, and animals whose positions and geometries are stored in DRAM. And then the environment such as the lighting and shadows change the colors and textures of the environment in order to create a realistic world.
Next, we’re going to explore breakthroughs and optimizations that allow DRAM to be incredibly fast. But, before we get into all those details, we would greatly appreciate it if you could take a second to hit that like button, subscribe if you haven’t already, and type up a quick comment below, as it helps get this video out to others. Also, we have a Patreon and would appreciate any support.
This is our longest and most detailed video by far, and we’re planning more videos that get into the inner details of how computers work. We can’t do it without your help, so thank you for watching and doing these three quick things. It helps a ton.
The first complex topic which we’ll explore is why there are 32 banks, as well as what the parameters on the packaging of DRAM are. After that, we’ll explore burst buffers, sub-arrays, and folded DRAM architecture and what’s inside the sense amplifier. Let’s take a look at the banks.
As mentioned earlier opening a single row within a bank requires all these steps and this process takes time. However, if a row were already open, we could read or write to any section of 8 memory cells using only the 10-bit column address and the column select multiplexer. When the CPU sends a read or write command to a row that’s already open, it’s called a row hit or page hit, and this can happen over and over.
With a row hit, we skip all the steps required to open a row, and just use the 10-bit column address to multiplex a different set of 8 columns or bitlines, connecting them to the read or write driver, thereby saving a considerable amount of time. A row miss is when the next address is for a different row, which requires the DRAM to close and isolate the currently open row, and then open the new row. On a package of DRAM there are typically 4 numbers specifying timing parameters regarding row hits, precharging, and row misses.
The first number refers to the time it takes between sending an address with a row open, thus a row hit, to receiving the data stored in those columns. The next number is the time it takes to open a row if all the lines are isolated and the bitlines are precharged. Then the next number is the time it takes to precharge the bitlines before opening a row, and the last number is the time it takes between a row activation and the following precharge.
Note that these numbers are measured in clock cycles. Row hits are also the reason why the address is sent in two sections, first the bank selection and row address called RAS and then the column address called CAS. If the first part, the bank selection and row address, matches a currently open row, then it’s a row hit, and all the DRAM needs is the column address and the new command, and then the multiplexer simply moves around the open row.
Because of the time saving in accessing an open row, the CPU memory controller, programs, and compilers are optimized for increasing the number of subsequent row hits. The opposite, called thrashing, is when a program jumps around from one row to a different row over and over, and is obviously incredibly inefficient both in terms of energy and time. Additionally, DDR5 DRAM has 32 banks for this reason.
Each bank’s rows, columns, sense amplifiers and row decoders operate independently of one another, and thus multiple rows from different banks can be open all at the same time, increasing the likelihood of a row hit, and reducing the average time it takes for the CPU to access data. Furthermore, by having multiple bank groups, the CPU can refresh one bank in each bank group at a time while using the other three, thus reducing the impact of refreshing. A question you may have had earlier is why are banks significantly taller than they are wide?
Well, by combining all the banks together one next to the other you can think of this chip as actually being 65 thousand rows tall by 262 thousand columns wide. And, by adding 31 equally spaced divisions between the columns, thus creating banks, we allow for much more flexibility and efficiency in reading, writing and refreshing. Also, note that on the DRAM packaging are its capacity in Gigabytes, the number of millions of data transfers per second, which is two times the clock frequency, and the peak data transfer rate in Megabytes per second.
The next design optimization we’ll explore is the burst buffer and burst length. Let’s add a 128-bit read and write temporary storage location, called a burst buffer to our functional diagram. Instead of 8 wires coming out of the multiplexer, we’re going to have 128 wires that connect to these 128-bit buffer locations.
Next the 10-bit column address is broken into two parts, 6 bits are used for the multiplexer, and 4 bits are for the burst buffer. Let’s explore a reading command. With our burst buffer in place, 128 memory cells and bitlines are connected to the burst buffer using the 6 column bits, thereby temporarily loading, or caching 128 values into the burst buffer.
Using the 4 bits for the buffer, 8 quickly accessed data locations in the burst buffer are connected to the read drivers and the data is sent to the CPU. By cycling through these 4 bits, all 16 sets of 8 bits are read out, and thus the burst length is 16. After that a new set of 128 bitlines and values are connected and loaded into the burst buffer.
There’s also a write burst buffer which operates in a similar way. The benefit of this design is that 16 sets of 8 bits per microchip, totaling 1024 bits, can be accessed and read or written extremely quickly, as long as the data is all next to one another, but at the same time we still have the granularity and ability to access any set of 8 bits if our data requests jump around. The next design optimization is that this bank of 65536 rows by 8192 columns is rather massive, and results in extremely long wordlines and bitlines, especially when compared to the size of each trench capacitor memory cell.
Therefore, the massive array is broken up into smaller blocks 1,024 by 1,024, with intermediate sense amplifiers below each subarray, and subdividing wordlines and using a hierarchical row decoding scheme. By subdividing the bitlines, the distance and amount of wire that each tiny capacitor is connected to as it perturbs the bitline to the sense amplifier is reduced, and thus the capacitor doesn’t have to be as big. By subdividing the wordlines the capacitive load from eight thousandish transistor gates and channels is decreased, and thus the time it takes to turn on all the access transistors in a row is decreased.
The final topic we’re going to talk about is the most complicated. Remember how we had a sense amplifier connected to the bottom of each bitline? Well, this optimization has two bitlines per column going to each sense amplifier and alternating rows of memory cells connected to the left and right bitlines, thus doubling the number of bitlines.
When one row is active, half of the bitlines are active while the other half are passive and vice versa when the next row is active. Moving down to see inside the sense amplifier we find a cross-coupled inverter. How does this work?
Well, when the active bitline is a 1, the passive bitline will be driven by this cross-coupled inverter to the opposite value of 0, and when the active is a 0, the passive becomes a 1. Note that the inverted passive bitline isn’t connected to any memory cells, and thus it doesn’t mess up any stored data. The cross-coupled inverter makes it such that these two bitlines are always going to be opposite one another, and they’re called a differential pair.
There are three benefits to this design. First, during the precharge step, we want to bring all the bitlines to . 5 volts and, by having a differential pair of active and passive bitlines, the easiest solution is to disconnect the cross coupled inverters and open a channel between the two using a transistor.
The charge easily flows from the 1 bitline to the 0, and they both average out and settle at . 5 volts. The other two benefits are noise immunity, and a reduction in parasitic capacitance of the bitline.
These benefits are related to that fact that by creating two oppositely charged electric wires with electric fields going from one to the other we reduce the amount of electric fields emitted in stray directions and relatedly increase the ability of the sense amplifier to amplify one bitline to 1 volt and the other to 0 volts. One final note is that when discussing DRAM, one major topic is the timing of addresses, command signals and data, and the related acronyms DDR or double data rate, and SDRAM, or Synchronous DRAM. These topics were omitted from this video because it would have taken an additional 15 minutes to properly explore.
That’s pretty much it for the DRAM, and we are grateful you made it this far into the video. We believe the future will require a strong emphasis on engineering education and we’re thankful to all our Patreon and YouTube Membership Sponsors for supporting this dream. If you want to support us on YouTube Memberships, or Patreon, you can find the links in the description.
A huge thanks goes to the Nathan, Peter, and Jacob who are doctoral students at the Florida Institute for Cybersecurity Research for helping to research and review this video’s content! They do foundational research on finding the weak points in device security and whether hardware is compromised. If you want to learn more about the FICS graduate program or their work, check out the website using the link in the description.
This is Branch Education, and we create 3D animations that dive deep into the technology that drives our modern world. Watch another Branch video by clicking one of these cards or click here to subscribe. Thanks for watching to the end!