AWS re:Invent 2024 - Dive deep on Amazon S3 (STG302)

6.26k views10584 WordsCopy TextShare

AWS Events

Amazon S3 provides developers and IT teams with cloud object storage that delivers industry-leading ...

Video Transcript:

welcome everybody my name is uh Seth Markle I'm here with uh James borneol today um he'll be coming out in a bit and this is the uh Deep dive on Amazon S3 I think we have a really interesting one this year um so just to give a little background on myself my name is Seth Markle I've been I'm a senior principal engineer in an S3 I've been in S3 about 15 years it'll be 15 years this coming February I've worked all cross the S3 stack I've worked on our object indexing I've worked on our disc

Fleet worked on our front end systems um if if folks saw Matt garmin's uh keynote today I worked on the S3 metadata system that we previewed today uh James borol who will be out here in a bit uh is a principal engineer and he's been working on our disc Fleet and he's also been working with the open source community on connectors including Mount point pytorch and most recently he was involved with our S3 tables launch that we launched today So today we're going to be talking about using scale to our advantage and yours so what

does that mean well you often have probably heard technical talks about scale and they're usually talking about how to build for scale so in other words how to avoid having your software fall over when it's taking too much traffic or holding too much data there's too much metadata so how do you tolerate scale but today we're going to look at scale from a different perspective so we're not going to talk about how we designed to tolerate scale but rather how does that scale actually make S3 work better for customers so before we begin let's take

a look at S3's scale so here's some statistics so S3 as as folks know is an object store and we hold more than 400 trillion objects uh that amounts to exabytes of data and we average over 150 million requests per second daily we send over 200 billion event notifications and in total at Peak we serve over one pyte per second of traffic worldwide through our one through our front end Fleet one pedy so this is what we're referring to when we're talking about scale so we're going to illustrate this concept of scale to your advantage

in three ways today so first we're going to talk about the physics of data so when we talk about physics we talk about the real world properties of hardware and software and how they dictate system Behavior so if any of you saw Andy warfield's Kino right before this you'll have heard him refer to the uh how we think about our dis Fleet as like a large thody dnamic system we'll be talking about how those properties actually improve as our system grows then James will look at how the scale of S3 helps us to deliver con

a consistent performance and a consistent operational experience even under real world failure scenarios and then we'll close out talking about how our scale allows us to design for failure in a way that actually helps us to deliver faster on your behalf okay so let's get into it so we're going to talk about the physics of data so as a software engineer I'm often thinking about algorithmic efficiency testing documentation things like that after some of the Q developer announcements this morning I might have to personally think about testing and documentation less often but it's still very

top of mine for me today as a storage engineer I'm often thinking about the real world physical properties of the underlying Hardware that can't just be solved with better software and so we're going to walk through those physical properties and I'm going to show how the storage system actually improves as it gets bigger like I said earlier we think about SC we often think about scale as how do I tolerate getting larger but today we're going to talk about how the system actually improves because it's larger so larger here so S3 has grown to have

tens of millions of drives and they actually run really hot in terms of their workload as we'll get into there's only so much work a drive can absorb before it's at its limit so if we can't spread that traffic across those tens of millions of drives we're limiting what our customers can do despite having the capacity on hand to serve that traffic but it turns out at scale there's some pretty useful patterns that emerge that helps us to balance things out and so we'll look at how that works and we'll look at how that works

looking at three actors in our system so the the first and most important actor is you the customer there's also the hardware in our system and the software in our system and those three together will help us paint a picture of what I mean here so let's start off talking about the physics of the hardware okay so here we have a hard drive hard drives are mechanical components that have two primary movements required to read data so the platter spins around as you can see in this illustration around a central component called a spindle you'll

often hear people refer to spindles when they really mean Drive count we have to talk about spindle counts in the system and um aside from that there's a mechanical arm called an actuator that moves back and forth it has a read and a right head on its tip and it has to move back and forth to be able to access particular parts of the platter where the data resides when you read data you provide an address it's it's really going to be an offset from zero to the capacity of the drive it's called a logical

block address or an LBA and that instructs the drive on which platter and where on the platter the um the data lives that you want to access and so then the actuator has to move into location and so the actuator swings back and forth right it's an actual physical component this does not happen instantaneously um and then once it is in the right on the right track the odds are the data that it needs is not directly under it and so you have to wait for a rotation to come around before the data uh that

needs to be that wants to be red exists underneath the platter and on average that's going to be about half a rotation right because sometimes you'll have just missed it sometimes it's right before you so it's about a half the act of the actuator moving is called seeking and so the time it takes to move is called seek time and then you have to wait for rotation that's called rotational latency and so these are two physical real world factors that uh cause uh drives to take time when you're when you're reading data and they're measurable

amounts of time and so this is what we mean when we say that hard drives have physical limitations there's only so much swinging back and forth that this can do in a fixed period of time so let's look at our software next okay so I won't get into too much detail here but when you store an object in S3 we break that object apart into pieces we call shards and we store those shards across our storage nodes across our disc Fleet uh using an algorithm called erer coding James will actually get into more detail about

ertia coding in a bit but the important thing here is that we we Chunk Up the data that you send us and we store it redundantly across a set of drives those drives are running a piece of software called Shard store so Shard store is the file system that we run on our storage note hosts Shard store is something that we've written ourselves uh there's public papers on Shard store which uh James will talk about in his section but at its core it's a log structured file system and so what does that mean that means

when new shards are written we will put them sequentially with each other if you think back to how I I talked about drives working part of the expense part of the latency associated with using drives is that actuator moving back and forth so by putting data next to itself on the drive I can actually eliminate some of that cost and so rights actually become more efficient with a log structured file system but from a read perspective I don't have that luxury I can't control when a customer wants to read which piece of data and so

I am left for reads to jump to wherever on the platter that the the data happens to be and so reads are kind of the pessimal workload where we have to pay for both the seek and the rotational latency and so we're hopping around the drive for reads this type of IO is often called random IO folks may have heard of that um and it you know usually involves a seek and then some rotation of the drive so we have the hardware we have the software let's take a look at some workloads storage workloads individually

tend to be really really bursty what this means is that they're idle for really long periods of time and then they suddenly become really demanding in a really great example of this is finra so finra is a a big customer of AWS and S3 fenra regulates member broker dealers and the Securities markets and they're responsible for collecting processing and analyzing every single transaction across equity and options markets to look for improper behavior um so they it involves them ingesting hundreds of billions of Records each day I think they in August they had a peak day

of 96 billion records that they processed two days in a row and on average this is about 30 terabytes of data that they have to process in a day and they have a strict 4-Hour SLA for processing this data and so their workload is what we refer to as bursty because they're accessing their data really hard for a relatively short period of time this 4H hour period I described and then they're idle for another 20 hours a day not not purely idle they are ingesting new records but they're doing all of their bulk processing in

a 4-Hour Windows they have 20 hour Windows of relatively light activity so let's do some math to see how many drives an example workload like fos would need to be able to operate this way if you were building your own storage system so in this example and this again these are not Finos numbers uh we have a customer that wants to hold a petabyte of data and they want to process that data in a 1H hour period so I'm using these numbers to illustrate the point here so if I were to have to access a

pedy over a 1hour period the intended access rate is 275 gigabytes per second across that data and if it's about a megabyte per object that's about a billion objects so it's a pretty actually reasonable size workload so we started talking about hard drives and if you remember there's two primary operations right there's seeking and there's the rotation so a 7200 RPM Drive spinning around 7200 times a minute that works out to about 8 milliseconds per rotation and since on average you have to do half a rotation to find your data that's going to be 4

milliseconds for the rotational latency on your average read a seek time on average has to move half the radius of the platter so that's another 4 millisecs and remember how we talked about objects are chunked up into shards for redundant storage well let's say in this example the one megabyte objects are chunked up into half megabyte shards and so to read that half megabyte we have to scan about a half megabyte off a off a disc and so let's add another 2 milliseconds for that so in total we get 10 milliseconds for this random read

right so that's the seek time the rotational latency and the transfer time a second has a th000 milliseconds and so if um if a read takes 10 milliseconds then I can do about 100 reads per second and these are half Meg reads so if I can do 100 half megags per second I can do 50 megabytes per second from this drive so the random IO nature of this really does add a lot of a lot of overhead here 50 megabytes a second per drive so if we go back to the original numbers and tie this

together I have a paby of data that I wish to access at 275 gabes per second at Peak if I were just storing that data and not accessing it I would need 50 drives if they were 20 terabytes per Drive which is the drive size that we land nowadays but to be able to access 275 gigabytes per second at 50 megabytes per Drive I need over 5,000 drives in my system to support that workload so there's a 100x amplification of Drive count if you consider the access rate over the storage rate and those drives end

up idle 23 hours a day in this particular example because I'm only using that capacity that IO capacity for 1 hour a day and so that's a huge amount of inefficiency you're paying for 100x the capacity and also you're not using it for 95% of the time so some of you might be wondering okay if this is what a single workload could potentially look like and S3 has millions of active customers isn't this problem actually a million times worse for S3 as a service and this is the really cool part of operating at scale because

individual workloads are actually really bursty but independent workloads that have nothing to do with each other don't burst together and so at the aggregate level storage temperature actually becomes surprisingly predictable yes younger data tends to be hot smaller objects tend to be hot and we have a lot of young and small objects we have over 400 trillion objects we also have a lot of old and large objects and so when you layer these workloads together suddenly the system as a whole becomes a lot more predictable and so as we aggregate them even though our Peak

demand progressively grows the Peak to mean is collapsing and so the workloads become more predictable and so this is actually really awesome for that physics equation that we just walked through because we can spread customers over many more drives than their storage alone would typically require and we can do this because the Peaks for some workloads map to quiet periods for others and so this is the first part of how our scale works to your advantage we have tens of thousands of customers probably in this room who have data spread and workload spread across over

a million drives each even though their storage footprint does not require nearly that much and there's two advantages to this because you're able to spread over a vast number of spindles remember the spindles is the spinning unit of the hard drive you're able to burst to throughputs offered in aggregate by those spindles even though the amount of space you're taking up does not fill up the drive and because any single customer customer is a very small portion of a given disc we get that nice workload isolation we saw last slide where Peaks and valleys of

different customers are decorrelated and this is all possible precisely because we have such a massive scale so I just spent a fair amount of time illustrating how customers are able to utilize throughput from far more drives than their data would normally require and uh this predictability of workloads in aggregate is only part of the story we still have to pay attention to how we have to pay attention to how traffic distributes across the fleet that we can prevent hot and cold spots across our discs and this doesn't happen automatically so objects follow some predictable patterns

in aggregate but over time you know I mentioned younger data is hotter data so if I'm writing to a drive you know the data I'm writing to is young but then a year from now it's less young and two years from now it's less young and so the drive is kind of cool off as they get older and sure deletes might come in and poke some holes in that data and so I I put more young data in there and so um it it heats up a little bit but not nearly as much as new

drives and so uncorrected our system actually will tend towards coldness and remember we want to Aggregate and balance the traffic across all of the spindles it does no one any good to have a bunch of spindles that are full but also not serving that throughput that we need and so we could uncorrected end up with these pockets of high utilization and Y low utilization across the fleet which isn't ideal because we're like I said we're leaving this io on the table but our scale means that we have storage racks of every traffic profile in the

system and so there's actually a really large surface area for rebalancing and re-encoding when we get things wrong or when we see things cooling off and so we do constantly move data around to balance storage temp temperatures across our racks and across our hard drives and it's our chance to revisit our initial placement decisions you know maybe the bites were hotter or colder than expected maybe things just C cooled off because they were older maybe they need to be moved around for intelligent tearing there's lots of reasons why we might want to do this and

so one useful opportunity for us to do this rebalancing is when we land new storage racks so let's talk about storage racks just for a quick moment so when it comes down to it S3 is just a collection of these really massive racks of discs um if some of you saw Dave Brown's talk last night um some of some of the rack types we used to have weighed upwards of 4,500 lb that's over 2 tons our rack types today aren't quite that heavy but they are thousands of pounds they weigh more than a car and

they're actually so big that we have to consider their weight when we design our facilities because when we land them in the loading dock they have to be able to be wheeled to their final location without collapsing the floor along the way and so we have reinforced flooring to support the weight of these moving around and these are actual pictures of our we call them jbods and you'll hear different uh expansions of the acronym A jbod just a Bo just a bunch of discs just a bunch of drives just a box of discs but it's

it's what we call our endearing term for our our racks of discs um there about a thousand discs in these racks at about 20 terabytes per disc these racks hold about 20 pedabytes of physical capacity each and we're constantly randing or Landing racks like this to add more space to S3 similar to how you might buy a new drive and plug it into your laptop but we can't just start sending all the new traffic to these drives because like I mentioned young data is the hot data in the system and it would overwhelm the Thousand

spindles in the rack right thousand sounds like a lot but it's actually not that much when you consider 50 megabytes a second per per spindle um you're not able to get that much from a rack and so if it's all that young hot data you will not you will melt down the rack this is an artist rendition of melting down a rack um so instead we bring the rack online and we start by diffusing existing S3 data onto it this is a process we call migration and we actually fill it to be you know ballpark

around 80% full of existing cooler data and allow it to mix in with the rest of the fleet and so what this does is it frees up capacity and aggregate across the pre-existing spindles so that there's space across a lot of spindles for hot data to land and not just this all of the space in the new spindles that have landed because remember we want as many spindles as possible participating in the traffic so this provides um Landing New racks provides us a great opportunity for revisiting uh our past placement decisions and our scale here

is be it becomes the powerful tool for avoiding the hotspots because of the scale we have a diverse profile of of racks in the system and so there're different places for us to select different traffic profiles depending on what we need to migrate and so the bottom line here is that you know storage systems are based on the physics of real world hardware and access patterns and so when you're running your workloads on S3 you're leveraging the scale and the physics of our system to your advantage it's now James is going to talk come out

and talk about how we design for decorrelation Thanks James thank you so I'm so excited to be here this is kind of my favorite talk at reinvent because it's a chance to talk about the really awesome work that our teams do to actually architect our systems hopefully also to give you some ideas of how to architect your systems but also just some really cool stuff to talk about so I'm actually going to pick up pretty much where Seth left off he was telling you about kind of the benefits of spreading data across many drives millions

of drives and how that works to your advantage and I want to talk a little bit more about that because it's actually not a one-off it's not the only place that we do this and as we've been building S3 we realized it's actually kind of an architectural pattern that we use and it's this idea of decorrelation the idea is that we really engineer every part of S3 to decorrelate workloads so if you have two customers running workloads those two workloads should be decorrelated from each other but also even yourself as a single customer your workload

should be decorrelated from itself that's kind of a weird idea like how do you do decorrelate from yourself so let's talk a little bit more about what that actually means so S3 has a lot of really awesome features right we just launched some more this morning but pretty fundamentally S3 is about puts and gets right you put some bites into S3 and then you come back and you get them back later when you actually do one of these puts we actually have a pretty important decision to make we have to decide when you do the

put where should we put your bites so here's a really simple approach to doing that you're going to do a put and we're going to assign you a drive right here's your hard drive now every time your bucket takes more puts to the same bucket we're going to keep putting those puts on the same drive right we'll do this over and over and over again eventually if you have enough puts your drive will fill up that's okay we have more drives right so we'll assign you a second drive and we'll start filling that drive up

with data from your bucket now you might sort of infer from the way that I'm talking about this this is not how we do it and it's it's not a particularly great way to do it but I don't want to dismiss this out of hand this actually is like a pretty good way to do allocation and in particular it has one key Advantage which is that it's simple right simple is great it means it's easy for us to operate it's easy to reason about right it's easy to plan for so simple is fantastic but it

does have a couple of problems one of the problems if you've ever thought about storage systems before it's probably pretty obvious to you which is that you can't just store data on one drive right hard drives fail they actually fail at a fairly High rate so of course we have to store your data in more than one place we'll talk more about that a little bit later on but for now just remember that we do store the data on more than one drive but there are some more nuanced problems with this approach and actually the

problems are a little bit different depending on your scale so if you have a small bucket right a bucket that's not large enough to fill up a drive it's not going to be cost efficient for us or for you to assign you an entire Drive Right Seth was talking about these drives being 20 terabytes large if you have 100 kilobytes of data we're not going to you an entire drive to yourself right it just doesn't make any sense and so if you're a small customer you're going to be contending with other customers for space right

you're going to be sharing one of these drives and putting and you have many many different workloads running on the same drive you get do you get the correlation right you have multiple workloads trying to hit the same Drive trying to contend for the same scarce resources that Seth was telling you about those rotations and those seeks if you're a large customer that's not a problem right you have the entire drive you have enough data to fill up an entire drive but now you have a different problem which is is that your your performance is

ultimately constrained by your storage footprint this is the idea that Seth was telling you about right if I have a petabyte of data I get 50 drives right that's how much space that I get I actually might want a lot more performance that I can get out of 50 drives these are both correlations right if you're a small customer you're correlated with other customers because you're sharing a resource you're sharing the drive if you're a large customer you're correlated with yourself because you're bound by your own capacity you can't burst beyond that so these makes

really difficult to use they also make it very difficult for us to scale because we have to think about what if customers become unbalanced over time so we just like a little bit different on S3 and it's actually an example of a pattern that we talk about often in OS called Shuffle sharting the idea Shuffle sharting is actually pretty simple it's that rather than ecstatically assigning a workload to a drive or to any other kind of resource a CPU a GPU what have you we randomly spread workloads across the fleet across our drives so when

you do a put we'll pick a random set of drives to put those bites on May we pick these two the next time you do a put even to the same bucket even to the same key doesn't even matter right we'll pick a different set of drives they might overlap we'll make a new random choice about which drives to use for that put this is intentionally engineering decorrelation into the system right these objects are from the same workload they're the same customer the same bucket but we're forcing them to be decorrelated within your bucket that

means that you're not constrained drain by the static resources anymore right so if you have a paby of data you can actually spread across many more than one pedy's worth of drives you can get the 5,500 drives that Seth was saying that you needed for your workload if you're a small customer you're insulated against the Noisy Neighbor effect because if you have one customer that's very hot retrieving an object often a large number of times that work that that object is probably spread across a different set of drives to you right even though you might

overlap on one of the drives there's other drives you can go to that also have the data so we get to decorrelate you from the hot workloads you're contending with so this idea of Shuffle sharting is actually really cool and it's actually pretty fundamental to how the team thinks about S3's resources our mental model for this really is that when we talk about S3 being elastic what we mean is that any workload should be able to burst to use every Drive in our Fleet the tens of millions of drives that we have as long as

they're not interfering with each other so that means if you're finra you can burst to that one hour a day to get terabytes per second of bandwidth if you have to far beyond what your storage footprint Your Capacity might otherwise allow you to have so software shotting is really a powerful mechanism uh for sharing resources efficiently and creating intentionally creating decorrelation between workloads once you know this idea of Shuffle sharting it actually starts to pop up all over the place it's not just drives so let me give you another example of Shuffle sharting in S3

let's talk a little bit about DNS when you send a request to S3 you might know that you're actually sending a request to this DNS address right your bucket name. s3. Amazon.com or something else you're you're in a different region when you actually go and resolve one of these names you get back a list of ips you actually used to get back one IP a really cool launch we had last year was to add multi anwer DNS support to S3 so now when you do one of these lookups you actually get back eight IPS so

why does this matter why am I telling about all these IPS well each of these IPS points to a different set of servers in our Fleet and this is how S3 does load balancing right when you do a DNS lookup that's when we decide where we think we should send your ests and it's C going to be different to some other customer this has a couple of advantages for you the first one is as a single customer it's great because again you can spread your requests across many hosts right across many of these servers this

minimizes the effect of hotspotting right if you're sharing one of those servers with other customers but it also means you can burst to much larger amounts of bandwidth than you might otherwise get if you only had one place to go for your data there's also Shuffle sharting going on here right if Seth and I do a DNS resolution for the same host at the same time the same bucket name we're actually going to get back different sets of frontend hosts different sets of ips and so what that means that even if one of us is

a very hot customer even if we're sending a lot of traffic to whatever IPS we got back from the DNS resolution we're not going to contend with each other because we're almost certainly going to a different set of hosts now importantly we engineer these frontend hosts these IPS that you get back to be stateless right so it doesn't matter which one you send requests to you'll be able to access all the bytes all those random drives we gave you no matter which server you go to so again it's that idea of decorrelation DNS Shuffle sharting

is actually especially important for full tolerance so we have a lot of these servers as I was saying and sometimes servers fail right that's a totally normal thing if there was some correlation between which servers you could use to access your data this wouldn't be great for you right because if one of these servers failed it' be kind of like a systematic failure in your bucket you wouldn't be able to access those bytes anymore so the statelessness lets us limit the impact of one of these failures these kinds of failures are also why retries are

so important right a single failure will almost always succeed the second time right because it'll almost always go to a different host all the aw sdks automatically Implement retries for exactly this reason right we want to make sure that if one of these hosts fails you go somewhere else and in fact the sdks are actually pretty smart about this it's not just like if it fails try again they're actually a little bit smarter than that because they know what kinds of request failures what kinds of http error codes you get back correspond to server issues

right they know the difference between a 403 authentication problem and an actual server error and so when this happens they can actually intelligently steer the retry to a different host they can forget about the one that was failing and deliberately send a request to a different IP and again they can do this because of the statelessness right because it doesn't matter which host they go to for any particular request your buckets aren't mapped to specific servers so the bottom line here is that shuffle sharting is just like an awesome way to engineer for decorrelation we

do it across S3 we do it on the client sdks we do it on our front-end servers we do it on the drives decorrelation is how we give you resilience and stability it's also thing that's only possible because of the scale that we're working at so Shuffle sharting is a great technique one other thing I want to tell you about Shuffle sharting it's a little bit in the weeds but when the team told me about it it was so awesome that I couldn't help but talk about it so the AWS common runtime is a low-level

software library that we use for working with S3 it implements all of S3's best practices things like retries uh automatic paralyzation of requests everything you need to get really high throughput out of S3 hundreds of gigabits of throughput from a single host now most customers will never interact with a CRT directly right they'll get it either from the Java SDK or the python SDK which include it as libraries or using some of our open source connectors things like mountpoint for S3 now the CRT does one really awesome thing to take advantage of Shuffle sharting not

just to give you fault tolerance like all our sdks do but actually to improve performance when you talk to S3 and it goes like this this is a graph of request latencies to S3 right so I have latency on the xaxis and a distribution of the latencies on the y- axis most of these requests are pretty fast but you can probably see down the end there there's like a few like outliers right there's a few pretty slow requests that happens in a distributed system that's totally normal right the CRT does a really cool thing with

this the CRT actually dynamically tracks this distribution and what it's doing is it's tracking to keep a live estimate of the tail latency of the request that you're making so let's say that it's tracking the P95 the 95th percentile of the latencies that you're getting back from S3 once it knows this the CRT starts doing kind of an interesting thing it starts intentionally cancelling requests right requests that go past this P95 requests that are slower than this P95 the CRT cancels them and retries them now on the surface that might seem kind of surprising right

why would you cancel requests especially a so request it's probably pretty close to responding seems like you're throwing away work but actually Shuffle sharding is what makes this work right the new request that you send is probably going to go to a different front-end host right it's probably going to go to a different set of drives because they're spreading your request across multiple drives and so actually it's very likely that that retry will go quite fast and so you end up with a distribution that kind of looks like this right the left hand side is

still the same right the left hand side hasn't really changed but everything past that P95 where we started cancelling and retrying requests we've pulled in all the tail latency right most of those requests became very fast on the retry so we've pulled in the tail latency distribution now to be clear right this is gambling right we are taking a bet that that request will be foster the second time around and it's not always going to pay off but in practice it's a ridiculously effective technique to get not just fault tolerance out of Shuffle sharting but

res but performance as well so Shuffle sharting is kind of everywhere awesome design pattern let me tell you a little bit about how this actually works right so we like Shuffle charting we like the idea of kind of spreading data across the fleet as many drives as we possibly can but how do we actually do it right when you do one of these puts we have to pick a set of drives we have to decide this drive this drive and this drive are going to be the drives we put your bites on how do we

make that decision well one thing we could do is just to look at all the drives right look at every Drive figure out what the optimal set of drives is for you right maybe it's the least full drives or the least contended Drive we can make an intelligent decision by looking at all of our drives when you do a put unfortunately we have a lot of drives like a lot of drives right and so we can't actually look at all of them we have tens of millions of drives we can't look at all these tens

of millions of drives for every put that we do because actually there's not just one put right we're doing many millions of requests per second in S3 there's many millions of puts all trying to make this calculation at the same time there's also a little bit of a coordination problem here right if we're all looking at this data at the same time how do we know that we're all making the right decision so we can't do this perfectly what about totally randomly totally randomly is actually a good strategy right don't even try to get it

right just pick a drive totally at random and put the bites there that actually works quite well and it's like in the spirit of Shuffle sharting is a really effective technique there is a problem with doing this randomly though let me show you what I mean what I'm going to show you is a simulation of the capacity usage the amount of space that we use on a drive so here's a graph right on the x-axis is actually a percentage right it's how balanced the capacity there so over our Fleet of drives if you do total

random allocations so you do a put you totally randomly choose what set of drives you get what this graph is showing you is the distribution of the capacity usage right how balanced the capacity usage is and so what you see in this graph is that actually if we did this totally randomly then over time we end up in a place where some of our drives are up to sort of 10% less full than other drives maybe even up to 20% less full than other drives and this isn't great right it means we're wasting a bunch

of capacity because we're making these random choices we're not actually thinking about well which Drive is the most full we shouldn't put more data there which Drive is the least full that would be a good place to put data so totally randomly is okay but it RIS this imbalance and imbalance is ultimately a cost problem right it means that we have to buy spare capacity to make up for this imbalance so let me tell you about another really cool magic trick it's called the power of two random choices and here's the idea if you're trying

to optimize a metric so here we're trying to optimize for capacity usage right instead of doing it totally at random do a very different thing a very slightly different thing pick two drives right not just all of them just pick two random drives and look at them and figure out which one is least used and use that one right so don't look at all the drives you don't need to look at all tens of millions of drives pick two drives totally at random whichever one is less full use that drive now this doesn't seem like

you've gained much right why would picking two random drives help you right there's tens of millions of drives but actually it turns out to be incredibly effective right so this is the distribution of capacity and we use that technique so rather than just totally randomly making choices randomly pick two drives or 10 drives or 20 drives doesn't actually super matter right pick two drives and use whichever one is is better right on whatever metric you're trying to optimize for so here we're optimizing for capacity this is the distribution you get it's ridiculously better so this

is a really dramatic effect right you're kind of getting the Best of Both Worlds you're getting the the the ex very very close to perfect right kind of almost as good as you could do by looking at all the drives every time but you the performance the efficiency the lack of coordination of Randomness by just looking at a couple of drives in other words just a little bit of knowledge about the system just looking at two drives is enough to go a long way to making your placement better so power two random choices is super

awesome it's a pattern to look for any system it's actually really really fun and it's techniques like this that make it possible for us to do Shuffle sharting at scale right and make it effective so we engineer intentionally to create decorrelation across the entire S3 stack not just our drives not just our front end the whole thing we do this with Shuffle sharting right we intentionally Place data into different streams so that it doesn't interfere with each other in the future and then we use algorithms like the power of two random choices to get scale

and fault tolerance without paying performance cost there's a really cool article on the Amazon Builder Library about shovel sharting if you want to check that out many of the same ideas more pictures is pretty great and I really encourage you to look for ideas to do Shuffle sharting and opportunities to do Shuffle sharting in in your systems is a lowcost way to engineer resilience and can often be very very simple to implement Okay so we've talked about correlation the idea of engineering for decorrelation the last thing I want to tell you a little bit about

is how we engineer for velocity and one of the really surprising things I've learned while working on S3 is that actually all the work we do in designing for fault tolerance turns out actually be very helpful to make help our teams move faster and deliver things faster so let's talk about what that actually means obviously at scale failures are a fact of life right drives fail all the time kind of randomly right this is ultimately a distributed systems problem and every distributed system has this problem things fail it's not just drives right and it's actually

important to be precise about failure when we talk about failure of course drives fail individually right we see failures of hard drives we see failures of racks but there are often hardware issues right there are often things like a fan is broken or a cable is broken or something like that but we also engineer for larger fault domains right not just drives and in particular S3 storage classes that use multiple availability zones are resilient to the permanent loss of an entire data center right so one of these availability zones can go away forever and all

your data is still intact right now we've done entire reement talks about how we make this happen it's not easy right it's not just about architecture right it's not just about having redundancy in the system fault tolerance at this scale is actually as much about engineering culture and how teams think about the system as it is about actually just knowing what the failures are in the first place the other really important thing to know about fault tolerance is that it's not just about being robust it's not just about oh we'll have some redundancy and everything

is great it's not just about the shape of our systems one of the hardest and most subtle problems in dealing with full tolerance is actually knowing when something has failed in the first place which sounds crazy right but actually how do you know the system has failed is a surprisingly hard question to answer at scale and so as an example of this one of the things that S3 is doing all the time we are constantly scanning every single bite that we store on all of our millions of hard drives to check its Integrity right to

make sure that it's still there where we expected it to be and that it's intact if you were to make a list of our largest customers right the customers who send us the most requests per second this scanning work this background scanning work to find which data is intact or where if data is missing would be very high on that list so we do this a lot we also take it very seriously from an operational perspective we have alarms on the p00 longest time the most time a bite has gone without being scanned by this

process right we will page and wake up engineers in the middle of the night if that our metric goes too high we really do care at making sure that every single bite in our fleet has been scanned at some regularity but this none of this should be too surprising right fault tolerance is the thing that we do all the time in us three let me talk to you about a little bit about about how we get there right how do you get fault tolerance in this world it's an idea that we call erer coding right

so here's a really simple way that you could do this if you want to get durability you want to get resilience to single Drive failures one thing you could do is to replicate your object right we could put the object on the multiple drives so here's a one meeg object and I can make copies of it right I can make let's say two copies store them on different drives now I'm resilient to drive failures right if one drive fails I still got two copies if two drives fail I still got one copy so I can

survive some Drive failures in this world we can actually get resilience also to that entire availability Zone failure model by putting these copies in different availability zones right so now if one of the availability zones fails the other two still have a copy of of the data we can still get the object back now again there's a lot to like about replication right and in particular it's another one of those things that's very simple and we like simple right we do use replication in our system it just has one downside and the downside is that

is very expensive right to store one bite of customer data in this model we now have to store three physical bites we have a three x amplification on the bytes that we're storing to get this full tolerance now in reality to get to the 119 of durability that we design as three for you actually need more than three copies so it's even worse than it looks on the slide so what can we do instead I love talking about ATI coding it's my chance to talk to you about some really cool math that that that we

really enjoy working with in S3 here's the idea I want to start again with my one mag object I'm G to divide it into five pieces now so far I haven't gain anything right I've just got like the same object but split into five chunks right we call these shards here's the trick we do some really fancy math to compute some extra shards called parody shards now these aren't just any shards they have a pretty interesting property the math that we do to compute these shards gives us a property that any five of these shards

is enough to recover the original object so obviously if I just looked at the top five right the original data that I split up into five shards I can get the object back just glue it back together right that's super simple but I can also do it in a different way right I could take let's say two of the parody shards and three of the original shards and I'll be able to get back the entire original object it doesn't actually matter which combination I use any combination of five of these nine shards will be enough

to get back the entire object and so now if I spread these shards across multiple drives I'm getting fault tolerance right some number of those drives can still fail and we'll be able to recover the original object and again importantly it doesn't matter which drives right here any four drives can fail and we'll be okay we'll still be able to recover the object this ATI coding again also works to give us tolerance to availability Zone failures right we just have to spread the shards across the availability zones so here the math works out kind of

nicely right I have nine shards on the slide we have three availability zones in most regions so I put three shards in each a z and then even an availability Zone fails I still got six shards right I still got enough shards to recover the object doesn't matter which shards we lost the magic part of this is that the overhead is much lower than it was for replication right in the version that I've drawn on the slides here we can tolerate losing any for drives right to get that with replication I would have had to

store five copies right the four that are going to fail and then one extra copy to still have the object so I would have had a 5x overhead right I would have had to store five physical bytes for every logical bite that I wanted to store with aaser coding we created four parody shards each one of those parody shards was as large as one of the five shards of the original object and so we ended up with only an 80% overhead right 1.8 physical bytes for every logical bite that we wanted to store so AR

coding is actually like super magical math it's a lot deeper than I'm going want to go into and explain here but what it gets us is Fault tolerance at low overhead so it's really cool but you've probably heard about rati coding before right it's actually a pretty common storage technique is kind of table Stakes for our storage system what I actually want to tell you about is some of the ideas of eraser coding that go beyond just fault tolerance right so two examples that I want to talk about in particular the first thing I want

to tell you about is that aati coding actually is a really powerful mechanism for development velocity because I've been telling you about different kinds of failure domains right I've been telling about how drives can fail racks can fail availability zones can fail our systems are designed to tolerate those failures using techniques like eraser coding but once you've engineered for failures like these the cause of the failure doesn't actually matter very much right if a single Drive fails because a cable is broken on the drive that's the same thing as if a single Drive fails because

there's a bug in the software running on the drive right it's the same impact to the system and so actually we can use our fault tolerance to be able to deploy things faster and gain confidence in changes faster maybe we start by deploying a new piece of software on just a single host or a single rack maybe we get a new hard drive type and we deploy that just with one drive or 10 drives or 100 drives at a very small footprint of that capacity to start with that way only a very small piece of

the data that we store in S3 will be exposed to that new fult domain so if it turns out that the software doesn't work or the hard drive has a problem with it that's okay right it's a very small fragment of our of our of our capacity our fault tolerance means that we can tolerate that drive going away or that piece of software going away in other words we get to deploy new software and new hardware with the expectation that it will fail now this doesn't absolve us from thinking about durability and making sure that

systems are bug free our teams are extremely careful about new components so in practice these failures are rare but being pessimistic like this helps us it helps the teams to deploy new ideas with confidence they don't get paralyzed thinking about but what if this one drive fails it's okay for one drive to fail this kind of production experience getting systems into production as quickly as we can is invaluable to Distributing to developing S3 right because even if we only allow let's say one of these shards to be exposed to new hardware to to new hardware

or to new software at our scale that means being exposed to exabytes of data to millions of requests per second and we can get that experience without any risk to the durability of our data so fault tolerance gives us velocity helps us to deploy things more quickly this all sounds great there's one catch you might have noticed earlier that I was telling you about shu sharting right and the idea that actually we're going to put objects randomly across all the drives while there are very few of these new Platforms in the fleet maybe a 100

drives right out of the tens of millions that we have or maybe one rack of servers that have the new software the probability that your object is exposed to that capacity is very low right when you do a put of one object we pick some drives very unlikely that those drives are going to contain the new software or the new hardware right but there's another little pesky bit of math here that you might have heard of called the birthday paradox the idea of the birthday Paradox is in this room it's actually pretty unlikely that someone

shares my birthday I'm not going to try to find out because I don't want to tell you what my birthday is but it's pretty unlikely that someone has my birthday but if you looked at the entire room the chance that there exists two people in this room that share a birthday is very very high right in fact a room this large is 100% is a pigeon hole principle right so the birthday Paradox tells us that even if one object is very unlikely to be exposed to a fault domain in this case very unlikely to get

unlucky and be chosen to have these drives at scale there's going to be at least one object that gets overly exposed so most objects might only have let's say one Shard on the new drive or one Shard on the new software but there's going to be one object somewhere in our system that got super unlucky and just happened to be assigned to let's say five of them right five of the hundred new drives in our Fleet just a question of math and so now we actually have to be a little bit more careful right we

can't just totally randomly spread shards across the fleet anymore our Shuffle sharding system actually intentionally incorporates knowledge of these new failure domains right it knows about new software and new hardware and it is intentionally making sure that when it picks a new random assignment when it does the shuffle sharding right it limits itself to only putting one Shard on that new hardware an example of this kind of idea this idea of sort of Shuffle sharding with a guard rail is something called The Shard store system that that Seth was telling you about this is our

newest storage node software it's a software that runs on all of our drives we've actually been rolling the software out over the past few years across S3 and using eraser coding made this possible and made it safe at every step because we started with Shard store by just allowing one Shard of an object to be stored on one of these Shard stores these new pieces of software and we left it like that for a while the team learned a bunch about the data they got that experience they tuned Shard store they improved Shard store and

then we raised it to two shards right then we raised it to three shards then we raised it to an entire availability Zone wor of data and then eventually we can remove the guard rail and Shard stores Everywhere We Were able to change the engine of S3 without you even noticing right we've rebuilt this most fundamental part of our system and we use eral coding to make that happen one more thing about erasa coding and how again it's not just for full tolerance ATI coding is actually a great lever for increasing performance as well and

the idea is actually pretty similar to the thing I was showing you a moment ago about the CRT the common runtime and how it cancels requests and retries them because remember if I have one of these ra coded objects I need five shards to recover the object but it doesn't matter which five right and so I can pick five of these shards and if it turns out that actually one of these shards is slow for some reason maybe the drive is overloaded because you're contending with other customers that's okay we can just go and pick

another one right we can cancel that request because it was going too slowly and try a different Shard right reading extra Shard using Shuffle sharding helps us to hedge against tail latency that comes from the the the slowest Shard so again fault tolerance is more than just fault tolerance it's also for velocity it's also for performance it's actually kind of cool the bottom line here is that the S3 team spends a lot of time planning for fa tolerance right it's not just about having redundancy in the system we work really hard to make sure that

we can detect failures respond to failures quickly and we design redundancy schemes like eraser coding around the ability to do this kind of thing to detect failures and to recover from them as an engineer being fault tolerant sometimes feels like a bit of a burden right it would be so much easier to build systems that they just didn't fail but they do at scale being fault tolerant is actually a huge enabler right our teams can move quickly they can move with confidence exactly because our systems are resilient to failure so that's pretty much all I

wanted to tell you about today I hope this is giv you a little bit of a glimpse of how we think about building a durable elastic storage system and how three scale works to our advantage and ultimately to your advantage as well as always with our storage teams the tenant that we live by is that we are happiest if you are never thinking about us right as storage systems should just work and you shouldn't have to think about S3 or anything like that hopefully I've given you a sense of how our teams bring that idea

to life right maybe I've given you a few ideas that you can take home you'd be surprised how often opportunities for Shuffle sharting or the power of two random choices actually show up in systems when you start looking for them or you even got none of that I hope you really enjoyed geeking out with us a little bit about some of the cool things we do with S3 thanks so much for being here today enjoy the rest of reent [Applause]