Evolution of the Storage Engine for Spanner, an Exabyte-scale Database System

2.38k views8429 WordsCopy TextShare

Jignesh Patel

Speaker's narrative: I’ll describe the design of Spanner’s new storage engine, Ressi, which replaced...

Video Transcript:

all right time to get started it is a huge pleasure to have David bacon from Google uh gives today's lecture on spanner and looks like we're going to find out not just about uh all the fun little things in spanner including some of its big innovation with true time but also it's uh some of the newer things that are happening in uh spanner David of course has a rich Legacy in technology got his PhD from Berkeley uh working on optimizing virtual functions he's published over 80 papers hold 29 patents it worked at IBM he's an

ACM fellow for some of the work that he did on realtime garbage collection and right now he leads Google storage engine teams which is responsible for over 70% of the total Fleet cost for spanner so obviously of huge importance for Google as an infrastructure not just in Google Cloud but also on regular core Google's business because F1 which is the revenue maker for Google runs on top of spanner so without Much Ado David the floor is yours go for it great uh thank you very much for having me it's a pleasure to be here I'm

sorry I couldn't be there in person um as I mentioned earlier um the talk has a lot of material we're scheduled for an hour but I can stay for half an hour afterwards if people want to ask question questions uh I'd be more than happy to to do that um so I joined uh Google about almost exactly 10 years ago um coming to work on spanner which I thought I would work on for a year or two um turns out it's uh held my uh interest and Fascination for far longer than that um so what

is spanner um fundamentally spanner is a SQL database so you can do uh the sorts of uh SQL operations that I'm sure you're all familiar with by now um so this would be a query to find all the uh uh the count of all of the artists in our database whose name uh starts with uh the letter m right okay so um anyway so spanner is uh supports external consistency um and I'll talk a little bit later about how the strong consistency was kind of fundamental in spanners becoming uh the sort of um uh really

the storage default storage system for transaction actional data inside of Google uh next slide um so spanner is geographically replicated at a planet scale uh this is an example snapshot that I took of a real database uh from our um visualization tool so typically uh we have three or five uh replicas um in this case we have three replicas that are relatively close to each other in the US um and that gives us a quorum so that uh most of the transactions um can complete relatively quickly with the local Quorum without having to go transatlantic

uh but we also have multiple replicas in Europe which gives uh um better uh latency there next slide so spanner is a classic horizontally scalable system um so we run anything from uh a fraction of a server to uh thousands of servers on each uh replica that uh that you saw in the previous uh picture next slide so um this is an example of the kind of performance we get with spanner Zanzibar is our um is Google's Global access control system so when you for instance um are sharing Google Docs um the um Access Control

uh all runs through uh sanbar and you can see here that we're getting a 95th percentile uh latency that's better than uh 10 milliseconds and 99.9th percentile below 100 milliseconds and this is for a you know distributed uh full Planet scale uh system Zanzibar is unusual and that actually has far more more replicas it has um replicas that uh only serve data and don't take rights which lets us provide even lower latency um especially in you know places like South America and Australia next slide David a quick a quick question when you consider the number

of replicas and the placement for replicas you obviously talked about briefly in the previous slide about uh getting a quorum on a continent perhaps do you also have to worry about regulations in the countries where data is allowed to be stored or that's not quite important when you do uh choose the placement for the replicas yeah we do and one of one of the things that has driven spanners adoption within Google is that we provide uh centralized uh support for that kind of thing so you can actually store um uh rows within the same database

with different um uh Geographic placements so I I like I like to say that spanner is Google's complexity sync so you know things like um access control and privacy and uh data residency um all get supported in spanner and then all of the systems that are built on top of it don't have to implement that themselves and that's a huge savings in system complexity for the company as a whole great thank you yeah next slide um you know and one of spanner big uh claims to fame is our availability uh so you can see here

that um over the course of three years on Zanzibar um we actually maintained our availability significantly above uh 59s um so that's something we are um constantly focused on and and uh quite uh proud of uh next slide please all right so let's talk about scale because that's always kind of what people think about when they think about Google next slide um so uh at this point spanner stores about 15 exabytes of serving data um an exobyte is uh what is it a million terabytes um and we're running about five billion queries per second next

slide so what makes Banner so big so you mentioned F1 um originally spanner was designed to support uh F1 which runs uh ads which you can see up there um but what's happened over the last decade is that um as we have become able to support um Global transactional consistency um at scale um more and more applications and systems have migrated onto spanner or been built uh new on spanner um because of the recognition that it just provides um a huge amount of Leverage in terms of uh Simplicity right this is part of the complexity

sync used to be that each application had its own protocols for dealing with distributed consistency um and now spanner just takes care of that and so uh pretty much all of our billion plus user pro products uh now run on top of spanner so David a quick question over here when you say when you take one of these application and just pick Google Docs as an example uh does this mean the actual contents of the Google Docs that users are creating is also inside spanner and uh what's under that transactional perspective of spanner because I

think Google Docs also has a crdt uh component to it so can you map that a little bit better uh for maybe Google Docs um so I have a slide on that uh a few slides uh further perfect okay we'll wait for that okay uh next slide um so in addition to our external product um uh spanner also runs the Google Cloud control plane so like when you create a um a cloud virtual machine uh all of that is mediated through spanner uh we um provide spanner as a service to external customers and I also

mentioned Zanzibar and then lots of Google's uh internal infrastructure that we call Google 3 uh is built on at storing things like uh code deltas and all sorts of things um the last I looked actually I should have looked again this morning but probably two years ago we had you know many tens of thousands of databases it could easily be in the hundreds of thousands by now um next slide so this is trying to get it your question before I think if you if you were to try to get a sense of Google from a

very high level perspective how we build infrastructure and the way we do data storage um you know the way I like to think about it is spanner is Google's metadata data and that metadata is the most uh reliability and availability critical it's stored in spanner we have um three or more complete replicas of the data and we provide uh Global consistency so spanner is used to store the web index the search history uh the text of your Gmail message is the information on your you know YouTube watch lists uh metadata from your photos now as

you're kind of alluding to um not everything is stored in spanner and in fact um in terms of data volume blob store is far far bigger um blob store is where we store uh Gmail attachments uh YouTube videos photos um uh I'm actually not totally sure which things in Drive are stored in blump store I suspect you know slides and I'm not sure about text documents um and uh for that um data we have normally uh two replicas um that are um G geographically uh distributed um uh slightly lower availability and it's just individual object

storage there's no notion of transactionality there um and then finally a excuse me a large portion of our storage goes to logs uh via The Sawmill system and that has yet lower replication depending on the application there will be one or two replicas it has yet again slightly lower availability um over time logs often go from two replicas to one replica um depending on the type of log it might also be down sampled as it gets older um so examples of what goes in here are you know audit Trails things so that we can make

sure that uh and prove that data hasn't been accessed by uh the wrong people um also you know every time we run a uh a spanner server it's generating log information to standard out and standard error and that gets accumulated and kept for quite a long time for debugging purposes um did that answer your question yes that did thank you sure so I'm sorry is it okay just the previous do you have a question okay sorry David go for it no problem I'm going to start um so I think it's really remarkable sorry David we

lost you here for a second yeah it said you muted me oh no I was trying to mute so there's another person here so sorry go for it I have no problem um so I think it's really remarkable how um SQL as a programming abstraction has uh survived over so many decades and such a massive increase in scale um you know I talked to some of the people who worked on the original system R system and you know at a guess um we're talking about maybe 60 megabytes of data running at you know something in

the order of 20 queries per second um and you know so that's a 250 million increase in QPS and a 250 billion increase in storage I don't think there's any other uh abstraction in computer science that's that's scaled like that and uh I found it funny that the uh example in the original um system R paper was a query to find uh underpaid programmers I'm sure they were trying to send IBM a message um next slide and uh in fact um for all this increase in productivity uh underpaid programmers have only gone up by a

factor of 10x um uh next slide um so I think it's also really interesting that we've kind of come full cycle right we've started with uh hierarchical databases which didn't have really much more structure than a than a Unix file system uh we went to um SQL databases with acid consistency um and those all worked essentially on a single site even though you had IBM Building Systems with um you know tightly coupled multiprocessors to scale up the amount of data um so there is really with the internet uh discontinuity where we went from uh tightly

coupled single sight systems to suddenly trying to do things at Planet scale um and that uh gave rise to no SQL and uh pioneered by uh big table at Google uh which kind of backed off of our abstractions uh providing only weak consistency and key value stores and then finally in spanner we've managed to solve the um those problems of scale and get back to where we were uh in the 70s in terms of consistency and um the high level of abstraction uh that we're able to program with next slide so you know if you

look at this trend right we've only got 66x growth to get to a zetabyte of data and 200x growth to get to um a trillion QPS um so you know this is as I think about the evolution of our system going forward this is kind of what uh what I'm looking at exactly when we get there you know 2030 2033 um I think that's uh well within our uh fairly near-term future next slide all right so let's talk a little bit about how spanner scales next slide so uh here's uh an example of a database

definition for a table called songs which holds uh information about um the um audio on an album and uh you know this is of course simplified but it's not that far from the kind of schema that gets used uh in something like um uh play or something you know Google play or something like that um and uh I'm a jazz fan so you can see there's all of these different jazz songs um but then at the very bottom we have Gangnam Style which is the the song that quote unquote broke the internet because we used

to use uh 32-bit counters for or uh watch counts in YouTube um next slide so basically uh what we do in spanner is we take uh key ranges and um Shard them over uh machines uh here I'm showing just single in some cases single rows U really each Shard is a fairly large key range uh by itself but it's enough to um to distribute the workload uh pretty well and so we have this database and uh We've sharded it out over three different machines next slide and now if people suddenly start uh listening to Gangnam

style uh that um that row is going to get very hot and so what happens then is that spanner will automatically uh split up uh that Shard and uh assign it somewhere else um to do Dynamic load balancing and um uh we also do um merging when we discover that shards are are no longer longer hot next slide uh one of the key things that we do in spanner is uh interleaving um so basically you can have the albums table and the song table and uh you can use that uh Clause at the bottom inter

leave and parent album and what that will do is it will store the information on disk with um all of the songs for an album um Coming immediately after uh the parent uh album key um next slide a quick question on this David is that ining done just uh across tablets in the underlying system or within a given tablet you will actually have rows from these two tables that are intered at that granularity what's the granularity of that inter leing on the servers in the uh it's on a per row basis in the parent table

so uh you would have the uh album row of the parent table would then immediately be followed in the same storage block by the uh songs of that album right and so um you know this is non-standard SQL and basically this lets this is just storing data in a pre-join format right and the observation is that um you know it's highly frequent uh for uh there to be kind of a dominant um access locality in the system and so this is this is pretty fundamental to letting us uh scale up uh all of these different

applications right Gmail uh interleaves your uh messages with your user ID um you know uh and and so on you know it's just kind of a repeated pattern great thank you next slide um yeah so here you kind of um here I tried to kind of diagram how that uh looks right so you see the the interleaving in some cases we might like with the all Blues album by Miles Davis uh the um you know we might actually uh wind up um you know splitting things uh but um for the most part um if you

read the the parent row um it's very likely that the child you know the information in the child rows that you need is going to be in the same Block in which case it's already in Ram or it'll be in the same uh um Chunk on the file server in which case it'll already be in the buffer cach of the remote machine that you fetched from so um you know this gives uh locality benefits of the kind that every scalable system uh needs next slide okay so I heard that uh you folks we're very curious

about true time it's not my area of expertise but I um spoke to Peter hild who's one of the inventors of true time uh one of the things that's really interesting about spanner is um true time is really fundamental to uh the way spanner works but when the spanner project was started um that wasn't the case um and uh the designers were struggling with you know how to provide um a good Global consistency model and it just happened that the truetime developers were sitting very close to the spanner developers and they got to talking and

you know sort of realized um that uh they had the ingredients of a solution um next slide um so true time is fun fundamentally based on a technology that gives us to a very high degree of precision a coordinated uh Global time um so in each data center we have these uh time servers which uh are using multiple uh GPS systems to get a Coordinated Time signal and then multiple Atomic oscillators um to act as uh you know super high resolution uh timekeepers um and these two systems provide you know so we have multiple oscillators

providing redundancy for each other we have uh and multiple GP s uh nodes and then the two of them kind of uh act as checks on each other um so the fundamental thing about true time is it doesn't tell you the time it tells you uh the range in which a time could in which the time is so uh there's this thing called Epsilon uh which is the uh potential error and when you uh call make a call to True time and you ask what's the time now it gives you this interval uh which bounds

both the earliest time and the latest time um and uh that uh uh interval is twice uh Epsilon which um you know generally um is no worse than a few milliseconds and often better next slide so um fundamentally spanner assigns time stamps to every operation and it maintains the property that if a transaction T2 starts after T1 commits then we guarantee that the time stamp of T2 is greater than the time stamp of T1 and in particular if T2 has some kind of uh data dependency on T1 uh we ensure that that invariant is true

um I maybe should have mentioned earlier that spanner is a multiversion database so instead of just keeping a single version of each value which is um replaced as people update things we keep uh multiple versions um and those multiv versioning in combination with this uh timestamp property means that um we can provide uh Global consistency snapshots of the data uh at um timestamps and we can also do um historical reads um uh without having to do any locking right once once we're a little ways in the past and know that no one is there no

more updates going on in that time range we no longer need to do locking to read that data and um that's another feature which is um a huge enabler for um these uh you know uh huge systems right like uh we now have multiple uh systems with uh over an exobyte of data just in an individual system uh next slide David one quick question on on the historical uh stale right reads I know the original spanner paper talked about potentially a the read only query specifying the time or some time range in which they want

to read is that still how that is done or is it a different way for the reads to pick which version of the data to read yeah you can um let's see you can read at a specific stale Tim stamp um you can ask it to pick a stale timestamp for you um and then we also have something which I don't think is in the paper um called snapshot Epoch where you can uh basically take a uh sort of like you do a stale read and tell spanner to kind of freeze that data right and

now uh uh you're guaranteed that that that that snapshot that consistent view of the database won't disappear until you release your snapshot ebook got it cool and that's that's another uh feature that gets uh used quite a bit thank you um so uh Peter has this uh this um description of spanner um which in which all of the in which true time is uh uh 40% of spanner um but I love the way uh he's just sort of boiled everything down um to uh this very um small fragment here um so um basically you know

spanner does uh well we now have multiple um consistency protocols but um originally uh did only did two-phase locking and um so the way the transaction uh commit flow works is you acquire all the locks and then you query true time uh for the time interval now and you select the uh latest possible value of uh the true time interval as the time stamp for the transaction uh then he's you know reduced Decades of uh uh work on uh consistency into uh this uh pack goop thing which means we do some kind of dist distributed

agreement protocol um and then we do What's called the commit weight uh which um you can think of is we just in a tight Loop uh we query now until the earliest time stamp of now is less than uh the latest is less than timestamp which is the value we assigned from the latest uh before we did the distributed agreement and then finally we release the locks and uh the super highlevel proof sketch is that uh you know timestamp is naming an instant when all the locks uh were held right and so that gives us

that uh that ordering um next slide so now the another key thing here right is that because our true time uncertainty uh uncertainty is the is equivalent to 2 * Epsilon right so transaction latency is almost always longer than the true time uncertainty and that means that uh almost always by the time the uh paxos uh goop finishes um you know what I showed as a as this kind of unbounded loop to do the commit weight is only going to do one iteration um and so unless you have um the replicas extremely close together which

uh in most cases we don't allow because spanner uh we have requirements on distribution of things in different failure domains right and that means things like you know you can't have two spanner replicas in data centers that are close enough that they might get hit by the same hurricane or something like that um and the other thing is that typically you hold locks a lot longer than the true time uncertainty um and uh that implies among other things that if you have a sequence of data dependent transactions uh that's going to be that's going to

be limit your throughput of um on data dependent information to one over the uncertainty so if the UN if the Epsilon is 5 milliseconds uh your uncertainty is 10 millisecs and you can do uh a I guess that's 100 uh transactions per second on uh sequentially chained information [Music] um so that is a a whirlwind tour through true time um that's great David a quick question on that uh related to the Locking component what's the granularity for the lock manager is it at the cell level or at what level do you acquire the locks when

you're grabbing all of them at the beginning of the commit protocol um so um it's at let's see I don't actually we use multi-level uh locking um I don't know the lock manager well enough to to say that for sure I can find out and and let you know uh later on that sounds great thank you sure okay next okay so let me talk a bit about the storage engine uh which is uh near and dear to my heart um right so underneath all of this data uh sits the storage engine uh which is responsible

um basically for most of the single node operation of the system um next slide so so uh in 2014 when I joined uh spanner the system looked like like this we have um so uh spanner was actually developed originally as a fork of the big table code base and so even though it was a uh providing a SQL abstraction Underneath It All it was using uh sorted string tables that were invented for big table to store unstructured data um so that meant that everything was self-describing um and uh enormously redundant and then we relied on

uh compression to uh eliminate that or reduce that redundancy when we went to store the information on disk um and you know obviously this was a great expedience in terms of getting the system up and running um but uh it you know wasn't really a a good match for what uh spanner became um all of that information each replica stores in the Colossus file system in a series of blocks um one of the things to note is that the Colossus file system itself uh uses replication to provide a high degree of consistency uh when I

first got to Google I was really confused about why we have uh replication across nodes and then we have replication within uh a data center um but uh basically coloss process um you know being able to rely on the Integrity of the local files uh makes uh spanner much simpler and uh also means that if there's a single block corruption uh in a single data center it doesn't Force us to do uh a remote transaction to to execute the the query um um and then above the layer interface uh things are managed as a log

structured merge tree uh which was also uh inherited from uh big table and is used in uh a lot of um you know large scale uh data storage systems next slide um so with uh with RI which is the name of the new storage engine uh we basically approximately speaking replaced everything below the layer interface um so that includes the file format uh compaction compression uh IO management uh the block cache um you know the so-called access methods um and and the what we call mem tables which are basically the right buffers um so um

this uh like I said when I joined spanner I thought this was going to be a a quick project a lot of other people thought it would be a quick project you know two or three years uh I think originally two years was the estimate uh I think the project started in 2013 and we finished uh converting everything in 20 early 2022 I believe um so it's a huge uh engineering effort next slide did you have to did the uh data format in the Colossus file system also change and if yes how did you do

that migration um yes it did it was a radical change because the the Colossus format was basically just uh compressed sorted string tables and RI uses a a packs layout uh with type dependent encoding and you know all sorts of things like run length coding and Delta coding uh and then we do um uh compression and um uh ENT ropy coding on top of that um I'll talk a little bit more about that uh later next slide okay so this is kind of the question you're asking how do we swap the engines out uh next

slide um so you know it's kind of trite but uh it really is the classic uh change the engines of the airplane while you're flying sort of situation right we have to somehow uh switch everyone over um while they're running with five nines of availability um so you know in the good old days when I started working in computer systems uh in a situation like this we would send out an email on Monday saying hey on Friday uh at 5:00 we're going to shut everything down and uh hopefully we'll be back up and running Monday

at 9:00 a.m. uh half the time we weren't up and running again until Wednesday at 9: am um so I would say probably at least half of the engineering effort of the new storage engine was around being able to live migrate the system next slide so um you know not only did we have this huge burden of semantic correctness for this incredibly complicated system but when you're migrating uh you know the performance envelope of the system uh you know to a certain extent defines what your users consider uh correctness right if you substantially slow down

something uh you know even though it's semantically doing the same operations uh they will be pushed into an outage because you know people won't be able to read their Gils or or whatever uh next slide um I didn't see it advance I don't know if it's Advanced there oh I was here and mood over there but the envelope is now outside the SS table en yes right yeah so right so in theory we want everything to be inside but in fact sometimes in certain resource Dimensions it goes outside even though in other resource Dimensions it's

better next slide and then you know as I said there's tens of thousands of databases Each of which uh interacts with the new system in unique ways um right when we started out even though in aggregate we took almost the same amount of space as the esss table encoding uh there were databases where we were 60% smaller and there were databases where we were 40% bigger and uh you know of course the ones that are smaller silently are happy and the ones that are bigger are very vocally unhappy next slide and so there's a real

um issue of of velocity here right like how fast can we develop this system uh while we're doing this with this system that is Mission critical not just for Google but for you know uh literally billions of people all over the world right like if we took down Gmail uh that would be a disaster for many people's um businesses and personal lives so uh I very distinctly remember uh after we had our our first uh production uh outage or in Google par parlan's OMG um and this guy on my team said oh I guess the

days when we can innovate and change the data structures are over uh and that hit me really hard I was like we absolutely cannot get stuck in that situation that'll be a disaster uh next slide um we also uh had to really struggle with privacy right so um uh you know we're trying to figure out how to uh debug our representation and improve our compression algorithms uh you know typically the way you would do that is oh this you know this database has gotten too big let's go and look at the data and see what

the patterns are well we can't do that right we're uh we're not allowed to look at people's data next slide um and then you've also got right to be forgotten which means we have this thing called Wipeout so um you know paradoxically even though our job is to make sure that data is never lost we also have to make sure that it gets lost quickly if someone asks you to get rid of it um and those two design points uh really conflict with each other a lot next slide uh and then we have data residency

which we talked about earlier right so you know let's say even get permission to look at some data like maybe it's um a database that we know is storing public information um I still not might not be able to uh copy the data over to my local machine to do debugging I might you know uh right like maybe I have to call call someone up on the phone and ask you know in Europe and ask them to you know issue commands on the data or something crazy uh next so uh we kind of quickly realized

that this is really kind of an interesting experimental computer science problem right we want to measure the impact of changes we make on the system and fundamentally we need to do that on real data and we have to do it with a pretty quick turnaround time uh we have to do it at production scale but somehow we have to do it without doing without creating any risk uh to production uh there's a surprising number of OMGs that are caused by uh experiments um and it has to be statistically accurate in privacy preserving next slide uh

next slide so we started out by building a experiment framework that operated on a single tablet the tablet is kind of a a slice of the database right and so um we wouldn't ourselves look at the tablet but we'd have a specification of the experiment uh read the original tablet transcode it and then store statistics about whether we got bigger or smaller next slide and so you know we have different experiments to evaluate compression and compaction speed uh data Integrity next slide um and then you know to uh satisfy all of these require Ms we

have to be fully isolated from production so um you know we have we made sure that there's no way for this system to write uh into production uh that it continues to work if a tablet gets deleted out from under it uh it never stores uh uh it either stores data in Ram or uh encrypted on disk but without using any real quota runs at best effort so it can always be preempted next slide um and then you know more things to uh make sure uh we don't uh violate um all sorts of requirements um

I'm trying to rush through here since we're getting low on time um next slide um so then we built something that uses uh Google's Flume infrastructure to do this on all of the tablets in a zone so this is uh for a single database next slide and then we expanded that to something that can launch experiments in all zones where spanner lives and uh now uh we routinely uh run experiments uh on like a 5% sample of our full 15 uh exabytes of data which you know when I think back to some of the experimental

work that I did with you know just tiny um tiny systems tiny data Set uh still kind of blows my mind next slide uh this is just an an actual screenshot of the kind of information we got let's skip that um so can I keep going or do you want to take a yep you can keep going David you have like 10 minutes more go for it all right so I wanted to kind of talk uh about some open research problems hopefully you guys will uh solve these for us um uh so I described this

experiment framework and it's great and everything but um uh you know spanner is a highly stateful system and uh our experiment framework fundamentally since we don't write anything uh can't model stateful updates uh so figuring out how you might uh do um mutating update experiments on a massive scale database in a safe way um would be a hugely impactful uh result um we're playing around with some ideas but uh fundamentally we just don't know how to solve this problem next um as I mentioned we have databases that are um have gone beyond exabyte scale um

and uh users actually you know following this uh encapsulating complexity principle for spanner um users are doing things which conceptually they would like to do as single transactions um over you know a full database of this size right so um maybe you want to get rid of uh URLs from your web index that are more than 30 days old um or you want to uh cut uh the price in half for uh frequent customers um uh once again we've played with some ideas around this um and you know the leverage that we have with multiversion

I think you know potentially makes some sorts of things like this uh plausible to consider but uh once again no answers for this yet next so David just to make sure here the challenges you have these transactions that are updating a large number of objects so probably when you get to the commit time and go grab all the locks uh some of them might not be available so you're constantly aborting these transactions that have large number of mutations and they never get finished right like you know basically you're trying to update a trillion rows while

people are updating them so the chance of an abort is essentially 100% yep makes sense um another big issue for us is silent data corruption um so I don't know if you've read any of the papers that I cited down there but um uh at scale we can no longer rely on um systems on chips to do what they're supposed to do um and uh traditionally people have thought about this as memory corruption errors but what we have found is that uh we are actually getting Corruptions inside the CPU so it's performing incorrect operation ations

like um incorrectly adding two numbers or jumping to the wrong address or you know all sorts of things [Music] um you know we estimate that maybe one in a thousand machines sometimes do this um they're super low probability events but um at our scale scale uh we are seeing this uh multiple times per week um and so we've had to invest a lot of effort into basically making the system uh robust to uh any computation being uh incorrect um so we're doing this at the system level by check sums on blocks and at the library

level by implementing um various forms of checked objects and variables and um currently working on pushing some of these things all the way down into the compiler um but I think this is another fruitful area for research next slide uh distributed debugging uh and understanding of the system uh we are still really in the stoneage like uh you know I mean debugging a system like this I would really like to have some visual console and uh something that lets me see uh dependency flows and how information travels uh instead you know we're using GDP B

and printf um stone stone knives and bare skins for those of you who are fans of the original Star Trek next slide uh and then system Evolution and migration right the as I mentioned uh a massive amount of effort went into just doing the migration uh to a new storage engine um we also have uh situations with people migrating from existing databases into spanner um and uh I think this would be a really interesting challenge to see how uh this could be um automated next slide so that's it uh I guess I finished with three

minutes to spare um and so be happy to take any questions great wonderful first we wanted to thank you David that was excellent thank you okay we have time for a few questions uh and we can also take it from those who are online I can start with one uh David when you look through the documentations for spanner on uh gcp there's all this discussion about how you write your transactions and you have to worry about the number of mutations that you might have in the transactions and and worry about that uh how do how

do real life programmers who are not like five star Google application programmers how do they kind of figure out or is it like I write some piece of code against your API and I hit that limit and I worry about that how do they break up something that doesn't fit into that number of mutations that you might want to fit into transactions it's kind of related to what you said if I've got these long running queries that are doing a ton but maybe not at that scale uh uh what are practical ways in which people

break up transactions so I'll have to admit upfront uh I actually it's kind of weird I don't get to work with our external customers very much because there's kind of you know there's all of these layers between me and them um whereas internally you know if something's going wrong for Gmail I'll get an instant message and and more instant messages if I don't respond to the first one um you know I um we actually have quite High limits on the size of an individual transaction I think it's more uh rather than the bytes I mean

of course there's limits but I don't know if I can say how big but it's quite big um I think it's more a question of you know how many locks do you have to take how big how far does that spread out over the system um we um we have various tools um for seeing what are the hot uh key ranges of the database which are the um most frequent transactions um so um you know I think um you know ultimately um you know it is Maybe the you know the least solved problem uh to

kind of tell people you know how uh you know how a particular choice is affecting the system um for things written in SQL you can see the query Plan and there's a query plan visualizer and it'll tell you you know how many rows were scanned and returned at each level of the um abstract syntax tree for the query um so often that's uh a good way to go about it um but uh yeah I mean ultimately there's you know there's really a massive surface area of features you can use right I mean we have full

text search we have uh um AI uh Vector similarity search uh right I mean Beyond just the sort of basics of a database um and you know that makes it very powerful but it also means that the interactions can can get complex um great wonderful last this question and maybe that's the final okay U comeing up over here just us and ask that question and you may want to just line up here because the microphone's here so it'll be easier to ask you can introduce yourself and then ask the question hi uh my name is

vasilis uh I just had a a quick question about one of the things I thought was uh sort of funny in the spanner paper um at the very end you sort of put in there that you know a portion of the time that all the spanner Engineers spent was just on like going over people's schema and sort of you know uh checking them and proving them and and sort of going through that process is that something that's still sort of going on is Google's you know always sort of scaling or is that just like oh

we people p over and yeah that's a great question um so you know I think uh with 30,000 or 50 or 100,000 data bases right we have a massive range of sophistication of our users um right so we have you know teams like you know Gmail has entire teams dedicated uh just to um analyzing and optimizing their system uh but we have small systems where the entire thing is maintained by a few people and they don't have the bandwidth to become spanner experts um so you know there's um we have our own Internal Documentation um

often uh groups uh like um the the team that does the Google apps you know Gmail and drive and so on has their own kind of internal guidelines for how they want people to use spanner um and then we also provide uh schema review consultation um I think you know the I wouldn't make that same comment today uh you know I think it's it's important but the you know there's now a lot of institutional knowledge about you know well you know yeah you have to think about interleaving and you have to think about multi-site transactions

and uh and so on and so forth so um you know it's certainly become a much more widely understood technology sure thank you thank you we take one last question go for it do introduce yourself hi David thank you for the amazing talk I'm Hong and my question is I know spanner runs with paos for distributor replication I know another one of Google services kubernetes runs with raft instead is there a particular reason and in general how do you make decisions between two very similar Alternatives um you mean with concurrency control specifically or as a

kind of General engineering question I mean if you have details into this specific decision that would be cool to hear but if not general f too I mean uh I wasn't around when the decision to use paxos was made for spanner um we um I think we've looked at some of the um improvements in that general research area but so far um you know haven't seen a reason that you know we'd get a really uh major uh improvement from making a fundamental change um one of the things that we do at Google is uh we

have kind of this internal resource economy and uh every resource is priced in terms of engineer time so that's the kind of normalizer so you know for instance a uh a CPU year might be worth uh you know um 5% of one engineer year and uh a petabyte of dis space might be worth uh a week of engineer time I have no idea what the actual numbers are um or I have an idea but I I can't use real numbers um but uh what that lets us do is if we're looking at a bunch of

projects right we'll sort of try to estimate okay how much time is this going to take by how many people and then what kind of improvement will we get uh in terms of uh overall efficiency um right and then we have you know teams like search right they can actually put a price on you know every millisecond Improvement in a search query is worth x amount of advertising dollars um but ultimately you know the most important decisions often are the ones where you're looking at a new functionality um and you know then uh you know

you can gather all the data that that you can but ultimately you need to make a a kind of um a decision based on your you know deep engineering knowledge and understanding of you know sort of where the system technology is going and where um you know where your external and internal Market is going great thank you awesome thank you everyone let's give David another round of applause okay David huge thank you for taking the time I know your time is Super valuable and hopefully your uh messenger me you didn't get a get get a

message to go put out a fire while you took this time to talk to us thank you very much thank you thank you for having me y byee