How does Computer Memory Work? 💻🛠

4.17M views5216 WordsCopy TextShare
Branch Education
Check out Crucial NVMe SSDs Here: http://crucial.com/ Have you ever wondered why it takes time for c...
Video Transcript:
Have you ever wondered what’s happening inside  your computer when you load a program or video game?  Well, millions of operations are happening,  but perhaps the most common is simply just copying data from a solid-state drive or SSD into dynamic  random-access memory or DRAM.   An SSD stores all the programs and data for long-term storage,  but when your computer wants to use that data, it has to first move the appropriate  files into DRAM, which takes time, hence the loading bar. 
Because your CPU works  only with data after it’s been moved to DRAM, it’s also called working memory or main memory. The reason why your desktop uses both SSDs and DRAM is because Solid-State Drves permanently  store data in massive 3D arrays composed of a trillion or so memory cells, yielding terabytes of  storage, whereas DRAM temporarily stores data in 2D arrays composed of billions of tiny capacitor  memory cells yielding gigabytes of working memory. Accessing any section of cells in the massive  SSD array and reading or writing data takes about 50 microseconds whereas reading or  writing from any DRAM capacitor memory cell takes about 17 nanoseconds, which is 3000  times faster. 
For comparison, a supersonic jet going at Mach 3 is around 3000 times faster  than a moving tortoise.  So, the speed of 17 nanosecond DRAM versus 50 microsecond SSD is  like comparing a supersonic jet to a tortoise. However, speed is just one factor. 
DRAM is  limited to a 2D array and temporarily stores one bit per memory cell. For example, this stick  of DRAM with 8 chips holds 16 gigabytes of data, whereas a solid-state drive of a smaller  size can hold 2 terabytes of data, more than 100 times that of DRAM.  Additionally,  DRAM requires power to continuously store and refresh the data held in its capacitors.  
Therefore, computers use both SSDs and DRAM and, by spending a few seconds of loading time  to copy data from the SSD to the DRAM, and then prefetching, which is the process of  moving data before it’s needed, your computer can store terabytes of data on the SSD and then access  the data from programs that were preemptively copied into the DRAM in a few nanoseconds.  For example, many video games have a loading time to start up the game itself, and then a  separate loading time to load a save file. During the process of loading a save file, all  the 3D models, textures, and the environment of your game state are moved from the SSD into DRAM  so any of it can be accessed in a few nanoseconds, which is why video games have DRAM capacity  requirements. 
Just imagine, without DRAM, playing a game would be 3,000 times slower.   We covered solid-state drives in other videos, so in this video, we’re going to take a deep  dive into this 16-gigabyte stick of DRAM.  First, we’ll see exactly how the CPU communicates  and moves data from an SSD to DRAM. 
Then we’ll open up a DRAM microchip and see how  billions of memory cells are organized into banks and how data is written to and read from  groups of memory cells.  In the process, we’ll dive into the nanoscopic structures inside  individual memory cells and see how each capacitor physically stores 1 bit of data.   Finally, we’ll explore some breakthroughs and optimizations such as the burst buffer and folded  DRAM layouts that enable DRAM to move data around at incredible speeds.
A few quick notes.   First, you can find similar DRAM chips inside GPUs, Smartphones, and many other devices, but  with different optimizations.  As examples, GPU DRAM or VRAM, located all around the  GPU chip, has a larger bandwidth and can read and write simultaneously, but operates at  a lower frequency, and DRAM in your smartphone is stacked on top of the CPU and is optimized for  smaller packaging and lower power consumption.
Second, this video is sponsored by  Crucial.  Although they gave me this stick of DRAM to model and use in the  video, the content was independently researched and not influenced by them.   Third, there are faster memory structures in your CPU called cache memory and even faster  registers. 
All these types of memory create a memory hierarchy, with the main trade-off  being speed versus capacity while keeping prices affordable to consumers and optimizing  the size of each microchip for manufacturing. Fourth, you can see how much of  your DRAM is being utilized by each program by opening your computer’s  resource monitor and clicking on memory. Fifth, there are different generations of DRAM,  and we’ll explore DDR5. 
Many of the key concepts that we explain apply to prior generations,  although the numbers may be different. Sixth, 17 nanoseconds is incredibly fast!   Electricity travels at around 1 foot per nanosecond, and 17 nanoseconds is about the  time it takes for light to travel across a room.
Finally, this video is rather long as it covers  a lot of what there is to know around DRAM.  We recommend watching it first at one point two  five times speed, and then a second time at one and a half speed to fully comprehend this  complex technology.  Stick around because this is going to be an incredibly detailed video.  
To start, a stick of DRAM is also called a Dual Inline Memory Module or DIMM and there are 8  DRAM chips on this particular DIMM.  On the motherboard, there are 4 DRAM slots, and when  plugged in, the DRAM is directly connected to the CPU via 2 memory channels that run through  the motherboard.  Note that the left two DRAM slots share these memory channels, and the right  two share a separate channel. 
Let’s move to look inside the CPU at the processor.  Along  with numerous cores and many other elements, we find the memory controller which manages  and communicates with the DRAM.  There’s also a separate section for communicating with SSDs  plugged into the M2 slots and with SSDs and hard drives plugged into SATA connectors. 
Using  these sections, along with data mapping tables, the CPU manages the flow of data from  the SSD to DRAM, as well as from DRAM to cache memory for processing by the cores. Let’s move back to see the memory channels. For DDR5 each memory channel is divided into two  parts, Channel A and Channel B.
These two memory channels A and B independently transfer 32 bits at  a time using 32 data wires.   Using 21 additional wires each memory channel carries an address  specifying where to read or write data and, using 7 control signal wires, commands are relayed. The addresses and commands are sent to and shared by all 4 chips on the memory channel which  work in parallel. 
However, the 32-bit data lines are divided among the chips and thus each  chip only reads or writes 8 bits at a time. Additionally, power for DRAM is  supplied by the motherboard and managed by these chips on the stick itself. Next, let’s open and look inside one of these DRAM microchips. 
Inside the exterior packaging,  we find an interconnection matrix that connects the ball grid array at the bottom with the die  which is the main part of this microchip.  This 2 gigabyte DRAM die is organized into 8 bank groups  composed of 4 banks each, totaling 32 banks. Within each bank is a massive array, 65,536 memory  cells tall by 8192 cells across, essentially rows and columns in a grid, with tens of thousands of  wires, and supporting circuitry running outside each bank. 
Instead of looking at this die, we’re  going to transition to a functional diagram, and then reorganize the banks and bank groups. In order to access 17 billion memory cells, we need a 31-bit address.  3 bits are used to  select the appropriate bank group, then 2 bits to select the bank. 
Next 16 bits of the address  are used to determine the exact row out of 65 thousand.  Because this chip reads or writes 8  bits at a time, the 8192 columns are grouped by 8 memory cells, all read or written at a time,  or ‘by 8’, and thus only 10 bits are needed for the column address.  One optimization is that  this 31-bit address is separated into two parts and sent using only 21 wires. 
First, the bank  group, bank, and row address are sent, and then after that the column address.  Next, we’ll look  inside these physical memory cells, but first, let’s briefly talk about how these structures are  manufactured as well as this video’s sponsor. This incredibly complicated die,  also called an integrated circuit, is manufactured on 300-millimeter silicon wafers,  2500ish dies at a time. 
On each die are billions of nanoscopic memory cells that are fabricated  using dozens of tools and hundreds of steps in a semiconductor fabrication plant or fab.  This  one was made by Micron which manufactures around a quarter of the world’s DRAM, including both  Nvidia’s and AMD’s VRAM in their GPUs Micron also has its own product line of DRAM and SSDs under  the brand Crucial which, as mentioned earlier, is the sponsor of this video.  In addition  to DRAM, Micron is one of the world’s leading suppliers of solid-state drives such as this  Crucial P5+ M2 NVME SSD.  
By installing your operating system and video games on a Crucial  NVMe solid-state drive, you’ll be sure to have incredibly fast loading times and smooth gameplay,  and if you do video editing, make sure all those files are on a fast SSD like this one as well.   This is because the main speed bottleneck for loading is predominantly limited by the speed of  the SSD or hard drive where the files are stored. For example, this hard drive can only transfer  data at around 150 megabytes a second whereas this Crucial NVMe SSD can transfer data at a  rate of up to 6,600 megabytes a second, which, for comparison is the speed of a moving tortoise  versus a galloping horse. 
By using a Crucial NVMe SSD, loading a video game that requires gigabytes  of DRAM is reduced from a minute or more down to a couple seconds.  Check out the Crucial NVMe  SSDs using the link in the description below. Let’s get back to the details of how DRAM works  and zoom in to explore a single memory cell situated in a massive array.
This memory cell is  called a 1T1C cell and is a few dozen nanometers in size.  It has two parts, a capacitor to store  one bit of data in the form of electrical charges or electrons and a transistor to access and read  or write data.  The capacitor is shaped like a deep trench dug into silicon and is composed of  two conductive surfaces separated by a dielectric insulator or barrier just a few atoms thick, which  stops the flow of electrons but allows electric fields to pass through. 
If this capacitor  is charged up with electrons to 1 volt, it’s a binary 1, and if no charges are present  and it’s at 0 volts, it’s a binary 0, and thus this cell only holds one bit of data.  Designs  of capacitors are constantly evolving but in this trench capacitor, the depth of the silicon is  utilized to allow for larger capacitive storage, while taking up as little area as possible. Next let’s look at the access transistor and add in two wires. 
The wordline wire connects to  the gate of the transistor while the bitline wire connects to the other side of the transistor’s  channel.  Applying a voltage to the wordline turns on the transistor, and, while it’s on,  electrons can flow through the channel thus connecting the capacitor to the bitline.  This  allows us to access and charge up the capacitor to write a 1 or discharge the capacitor to write  a 0. 
Additionally, we can read the stored value in the capacitor by measuring the amount of  charge.  However, when the wordline is off, the transistor is turned off, and the capacitor  is isolated from the bitline thus saving the data or charge that was previously written.  Note  that because this transistor is incredibly small, only a few dozen nanometers wide, electrons slowly  leak across the channel, and thus over time the capacitor needs to be refreshed to recharge  the leaked electrons.
We’ll cover exactly how refreshing memory cells works a little later. As mentioned earlier, this 1T1C memory cell is one of 17 billion inside this single die and is  organized into massive arrays called banks.  So, let’s build a small array for illustrative  purposes. 
In our array, each of the wordlines is connected in rows, and then the bitlines are  connected in columns.  Wordlines and bitlines are on different vertical layers so one can  cross over the other, and they never touch. Let’s simplify the visual and use symbols for the  capacitors and the transistors. 
Just as before, the wordlines connect to each transistor’s control  gate in rows, and then all the bitlines in columns connect to the channel opposite each capacitor.  As a result, when a wordline is active, all the capacitors in only that row are  connected to their corresponding bitlines, thereby activating all the memory cells in that  row.  At any given time only one wordline is active because, if more than one wordline were  active, then multiple capacitors in a column would be connected to the bitline and the data  storage functionalities of these capacitors would interfere with one another, making them useless.  
As mentioned earlier, within a single bank there are 65,536 rows and 8,192 columns and the 31-bit  address is used to activate a group of just 8 memory cells.  The first 5 bits select the bank,  and the next 16-bits are sent to a row decoder to activate a single row.  For example, this  binary number turns on the wordline row 27,524, thus turning on all transistors in that row and  connecting the 8,192 capacitors to their bitlines, while at the same time the other 65  thousandish wordlines are all off.
Here’s the logic diagram for a simple decoder. The remaining 10 bits of the address are sent to the column multiplexer.  This multiplexer  takes in the 8192 bitlines on the top, and, depending on the 10-bit address, connects a  specific group of 8 bitlines to the 8 input and output IO wires at the bottom. 
For example,  if the 10-bit address we this, then only the bitlines 4,784 through 4,791 would be connected  to the IO wires, and the rest of the 8000ish bitlines would be connected to nothing.  Here’s  the logic diagram for a simple multiplexer. We now have the means of accessing any  memory cell in this massive array; however, to understand the three basic operations,  reading, writing, and refreshing let’s add two elements to our layout:  A sense amplifier  at the bottom of each bitline, and a read and write driver outside of the column multiplexer.
Let’s look at reading from a group of memory cells.  First the read command and 31-bit address  are sent from the CPU to the DRAM.  The first 5 bits select a specific bank.
The next step is  to turn off all the wordlines in that bank, thereby isolating all the capacitors, and then  precharge all 8000ish bitlines to . 5 volts.  Next the 16-bit row address turns on a row, and all  the capacitors in that row are connected to their bitlines. 
If an individual capacitor holds a 1  and is charged to 1 volt, then some charge flows from the capacitor onto the . 5-volt bitline, and  the voltage on the bitline increases.  The sense amplifier then detects this slight change  or perturbation of voltage on the bitline, amplifies the change, and pushes the voltage on  the bitline all the way up to 1 volt.
However, if a 0 is stored in the capacitor, charge  flows from the bitline into the capacitor, and the . 5-volt bitline decreases in voltage.   The sense amplifier then sees this change, amplifies it and drives the bitline voltage down  to 0 volts or ground. 
The sense amplifier is necessary because the capacitor is so small,  and the bitline is rather long, and thus the capacitor needs to have an additional component  to sense and amplify whatever value is stored. Now, all 8000ish bitlines are driven to 1  volt or 0 volts corresponding to the stored charge in the capacitors of the activated  row, and this row is now considered open. Next, the column select multiplexer uses  the 10-bit column address to connect the corresponding 8 bitlines to the read  driver which then sends these 8 values and voltages over the 8 data wires to the CPU. 
Writing data to these memory cells is similar to reading, however with a few key differences. First the write command, address, and 8 bits to be written are sent to the DRAM chip.  Next, just  like before the bank is selected, the capacitors are isolated, and the bitlines are precharged  to .
5 volts.  Then, using a 16-bit address, a single row is activated, the capacitors perturb  the bitline, and the sense amplifiers sense this and drive the bitlines to a 1 or 0 thus opening  the row.  Next the column address goes to the multiplexer, but, this time, because a write  command was sent, the multiplexer connects the specific 8 bitlines to the write driver which  contains the 8 bits that the CPU had sent along the data wires and requested to write. 
These  write drivers are much stronger than the sense amplifier and thus they override whatever voltage  was previously on the bitline, and drive each of the 8 bitlines to 1 volt for a 1 to be written,  or 0 volts for a 0.  This new bitline voltage overrides the previously stored charges or values  in each of the 8 capacitors in the open row, thereby writing 8 bits of data to the memory  cells corresponding to the 31-bit address. Three quick notes. 
First, as a reminder, writing  and reading happens concurrently with all the 4 chips in the shared memory channel, using  the same 31-bit address and command wires, but with different data wires for each chip.   Second, with DDR5 for a binary 1 the voltage is actually 1. 1 volts, for DDR4 it’s 1.
2 volts,  and prior generations had even higher voltages, with the bitline precharge voltages being  half of these voltages.  However, for DDR5, when writing or refreshing a higher voltage,  around 1. 4 volts is applied and stored in each capacitor for a binary 1 because charge leaks  out over time.
However, for simplicity, we’re going to stick with 1 and 0.  Third, the number  of bank groups, banks, bitlines and wordlines varies widely between different generations  and capacities but is always in powers of 2. Let’s move on and discuss the third operation  which is refreshing the memory cells in a bank.
As mentioned earlier, the transistors used to  isolate the capacitors are incredibly small, and thus charges leak across the channel.  The  refresh operation is rather simple and is a sequence of closing all the rows, precharging  the bitlines to . 5 volts, and opening a row.
To refresh, just as before, the capacitors perturb  the bitlines and then the sense amplifiers drive the bitlines and capacitors of the open row fully  up to 1 volt or down to 0 volts depending on the stored value of the capacitor, thereby refilling  the leaked charge.  This process of row closing, precharging, opening, and sense amplifying happens  row after row, taking 50 nanoseconds for each row, until all 65 thousandish rows are refreshed  taking a total of 3 milliseconds or so to complete.  The refresh operation occurs  once every 64 milliseconds for each bank, because that’s statistically below the  worst-case time it takes for a memory cell to leak too much charge to make a stored 1  turn into a 0, thus resulting in a loss of data.
Let’s take a step back and consider the  incredible amount of data that is moved through DRAM memory cells. These banks of memory  cells handle up to 4 thousand 8 hundred million requests to read and write data every second  while refreshing every memory cell in each bank row by row around 16 times a second.  That’s a staggering amount of data movement and illustrates the true strength of computers. 
Yes, they do simple things like comparisons, arithmetic, and moving data around, but  at a rate of billions of times a second. Now, you might wonder why computers  need to do so much data movement. Well, take this video game for example.
You have obvious  calculations like the movement of your character and the horse. But then there are individual  grasses, trees, rocks, and animals whose positions and geometries are stored in DRAM.  And then the environment such as the lighting and shadows change the colors and textures of the  environment in order to create a realistic world.
Next, we’re going to explore breakthroughs and  optimizations that allow DRAM to be incredibly fast. But, before we get into all those  details, we would greatly appreciate it if you could take a second to hit that like  button, subscribe if you haven’t already, and type up a quick comment below, as it helps get  this video out to others.  Also, we have a Patreon and would appreciate any support. 
This is our  longest and most detailed video by far, and we’re planning more videos that get into the inner  details of how computers work.  We can’t do it without your help, so thank you for watching and  doing these three quick things. It helps a ton.
The first complex topic which we’ll explore  is why there are 32 banks, as well as what the parameters on the packaging of DRAM are.   After that, we’ll explore burst buffers, sub-arrays, and folded DRAM architecture  and what’s inside the sense amplifier. Let’s take a look at the banks. 
As  mentioned earlier opening a single row within a bank requires all these  steps and this process takes time. However, if a row were already open, we  could read or write to any section of 8 memory cells using only the 10-bit  column address and the column select multiplexer.   When the CPU sends a read or  write command to a row that’s already open, it’s called a row hit or page hit, and this  can happen over and over. 
With a row hit, we skip all the steps required to open a row, and  just use the 10-bit column address to multiplex a different set of 8 columns or bitlines, connecting  them to the read or write driver, thereby saving a considerable amount of time.  A row miss is  when the next address is for a different row, which requires the DRAM to close and isolate the  currently open row, and then open the new row. On a package of DRAM there are typically 4 numbers  specifying timing parameters regarding row hits, precharging, and row misses. 
The first number  refers to the time it takes between sending an address with a row open, thus a row hit, to  receiving the data stored in those columns. The next number is the time it takes to open  a row if all the lines are isolated and the bitlines are precharged.  Then the next number  is the time it takes to precharge the bitlines before opening a row, and the last number is  the time it takes between a row activation and the following precharge. 
Note that these  numbers are measured in clock cycles. Row hits are also the reason why the address is  sent in two sections, first the bank selection and row address called RAS and then the column address  called CAS. If the first part, the bank selection and row address, matches a currently open row,  then it’s a row hit, and all the DRAM needs is the column address and the new command, and then the  multiplexer simply moves around the open row.
Because of the time saving in accessing an  open row, the CPU memory controller, programs, and compilers are optimized for increasing the  number of subsequent row hits. The opposite, called thrashing, is when a program jumps around  from one row to a different row over and over, and is obviously incredibly inefficient  both in terms of energy and time. Additionally, DDR5 DRAM has 32 banks for  this reason. 
Each bank’s rows, columns, sense amplifiers and row decoders operate  independently of one another, and thus multiple rows from different banks can be open all at the  same time, increasing the likelihood of a row hit, and reducing the average time it takes for the CPU  to access data.  Furthermore, by having multiple bank groups, the CPU can refresh one bank in each  bank group at a time while using the other three, thus reducing the impact of refreshing.  A question you may have had earlier is why are banks significantly taller than they are  wide?
Well, by combining all the banks together one next to the other you can think of this chip  as actually being 65 thousand rows tall by 262 thousand columns wide. And, by adding 31 equally  spaced divisions between the columns, thus creating banks, we allow for much more flexibility  and efficiency in reading, writing and refreshing. Also, note that on the DRAM packaging are  its capacity in Gigabytes, the number of millions of data transfers per second, which  is two times the clock frequency, and the peak data transfer rate in Megabytes per second.
The next design optimization we’ll explore is the burst buffer and burst length.  Let’s add a  128-bit read and write temporary storage location, called a burst buffer to our functional diagram.   Instead of 8 wires coming out of the multiplexer, we’re going to have 128 wires that connect  to these 128-bit buffer locations. 
Next the 10-bit column address is broken into two  parts, 6 bits are used for the multiplexer, and 4 bits are for the burst buffer.  Let’s explore a reading command.  With our burst buffer in place, 128 memory cells and  bitlines are connected to the burst buffer using the 6 column bits, thereby temporarily loading,  or caching 128 values into the burst buffer.
Using the 4 bits for the buffer, 8 quickly  accessed data locations in the burst buffer are connected to the read drivers and the data is  sent to the CPU.  By cycling through these 4 bits, all 16 sets of 8 bits are read out, and thus the  burst length is 16.  After that a new set of 128 bitlines and values are connected and loaded  into the burst buffer. 
There’s also a write burst buffer which operates in a similar way. The benefit of this design is that 16 sets of 8 bits per microchip, totaling 1024 bits, can be  accessed and read or written extremely quickly, as long as the data is all next to one  another, but at the same time we still have the granularity and ability to access any  set of 8 bits if our data requests jump around. The next design optimization is that this bank  of 65536 rows by 8192 columns is rather massive, and results in extremely long wordlines and  bitlines, especially when compared to the size of each trench capacitor memory cell. 
Therefore,  the massive array is broken up into smaller blocks 1,024 by 1,024, with intermediate  sense amplifiers below each subarray, and subdividing wordlines and using a hierarchical  row decoding scheme.  By subdividing the bitlines, the distance and amount of wire that each tiny  capacitor is connected to as it perturbs the bitline to the sense amplifier is reduced, and  thus the capacitor doesn’t have to be as big.  By subdividing the wordlines the capacitive load from  eight thousandish transistor gates and channels is decreased, and thus the time it takes to turn on  all the access transistors in a row is decreased.
The final topic we’re going to talk about is  the most complicated.  Remember how we had a sense amplifier connected to the bottom of  each bitline?  Well, this optimization has two bitlines per column going to each sense amplifier  and alternating rows of memory cells connected to the left and right bitlines, thus doubling the  number of bitlines. 
When one row is active, half of the bitlines are active while the other  half are passive and vice versa when the next row is active.   Moving down to see inside the sense  amplifier we find a cross-coupled inverter.  How does this work? 
Well, when the active bitline is  a 1, the passive bitline will be driven by this cross-coupled inverter to the opposite value  of 0, and when the active is a 0, the passive becomes a 1.  Note that the inverted passive  bitline isn’t connected to any memory cells, and thus it doesn’t mess up any stored data.  The  cross-coupled inverter makes it such that these two bitlines are always going to be opposite  one another, and they’re called a differential pair. 
There are three benefits to this design.   First, during the precharge step, we want to bring all the bitlines to . 5 volts and, by having a  differential pair of active and passive bitlines, the easiest solution is to disconnect the cross  coupled inverters and open a channel between the two using a transistor. 
The charge easily  flows from the 1 bitline to the 0, and they both average out and settle at . 5 volts.   The other two benefits are noise immunity, and a reduction in parasitic capacitance of the  bitline. 
These benefits are related to that fact that by creating two oppositely charged electric  wires with electric fields going from one to the other we reduce the amount of electric fields  emitted in stray directions and relatedly increase the ability of the sense amplifier to amplify  one bitline to 1 volt and the other to 0 volts. One final note is that when discussing DRAM,  one major topic is the timing of addresses, command signals and data, and the related  acronyms DDR or double data rate, and SDRAM, or Synchronous DRAM.  These topics were omitted  from this video because it would have taken an additional 15 minutes to properly explore.  
That’s pretty much it for the DRAM, and we are grateful  you made it this far into the video.  We believe the future will require a strong emphasis on  engineering education and we’re thankful to all our Patreon and YouTube Membership Sponsors  for supporting this dream.  If you want to support us on YouTube Memberships, or Patreon,  you can find the links in the description.
A huge thanks goes to the Nathan, Peter, and  Jacob who are doctoral students at the Florida Institute for Cybersecurity Research for helping  to research and review this video’s content!  They do foundational research on finding the weak  points in device security and whether hardware is compromised.  If you want to learn more about  the FICS graduate program or their work, check out the website using the link in the description.  
This is Branch Education, and we create 3D animations that dive deep into the technology that  drives our modern world.  Watch another Branch video by clicking one of these cards or click here  to subscribe.  Thanks for watching to the end!
Copyright © 2024. Made with ♥ in London by YTScribe.com