How Notion Scaled to 100 Million Users Without Their Database Exploding

90.39k views1632 WordsCopy TextShare
Kiki's Bytes
In this video, we will explore how Notion managed to scale to 100 million users without exploding th...
Video Transcript:
In 2021 Notion's popularity skyrocketed,  but its service became unbearably slow because their postgres was at terabytes of  volume, even when compressed and was ready to explode. Faced with losing customers,  they needed to solve this issue fast. In this video, let's learn how  they managed to solve this crisis.
To understand the root cause, we need to take  a look at their unique data model. At its core, everything in Notion is a `block`.  A block could be a piece of text, an image or even an entire page.
Each block is stored as a row in  Postgres, with its own unique ID. Blocks can be nested within other blocks to  create complex tree-like structure. This structure allows for incredible  versatility, but it also means that even a simple document can result in hundreds  or thousands of database entries.
Like this waifu document that err my friend has. Now,  multiply that by millions of users using Notion. While flexible, the sheer volume  eventually caused users to notice increased latency when requesting page data.
Notion was seeing delay because their  single, monolithic database could no longer handle the load. Their Postgres  VACUUM process began to stall consistently. VACUUM is a crucial database maintenance  operation that reclaims storage occupied by dead tuples - data that's been  updated or deleted but still occupies space.
When VACUUM can't keep up, it leads  to bloated tables and degraded performance. On top of that there was an alarming threat  of transaction ID wraparound. In Postgres, each transaction is assigned a unique ID,  these IDs are finite, and if they're exhausted, the database enters a read-only mode  to prevent data loss; imagine a note app that cannot be edited or deleted.
. For  Notion, this would have been a disaster 6 7 Eventually their database became a ticking  time bomb ready to explode at any moment. The first instinct might be to  beef up their Postgres instance, like me trying to get big at the gym. 
But there are physical limits to how much you can scale a single machine before  costs increase exponentially. Moreover, query performance and maintenance processes often  degrade well before reaching hardware limits. What did they try next, yup you guessed it  horizontal scaling.
Notion decided to shard all the tables linked to the block table  via foreign keys. This included workspaces, discussion threads, and comments, keeping related  data in the same shard. For the partition key, they chose the workspace id since users typically  request data for a single workspace at a time.
The new sharded setup needs to handle their  existing data and scale to meet projected growth for at least two years. For that 1. the instance type needed at least 60K total IOPS.
IOPS represents the number of read and  write operations that can be performed per second. 2. To maintain RDS replication, they set limits  of 500GB per table and 10TB per physical instance.
After careful consideration, Notion chose to  create 32 physical database instances. For each physical machine, they chose to create  15 separate logical shards represented as postgres schema. Each shard would  have its own tables like block, workspace and comments.
For a total of 480  total shards across the 32 physical databases. The routing mechanism is determined at the  application level. To determine where a data is, the application uses the workspace ID  to determine which of the 32 physical databases the data resides on. 
Within that physical database, they use the workspace ID again to determine  which of the 15 logical shards contains the data. They use PgBouncer has an intermediary  between the application layer and the database. PgBouncer is a lightweight Postgres  connection pooler.
It reduces the overhead of establishing and tearing down connections  by maintaining a pool of active connections, allowing applications to reuse them. This  is useful where frequent connections to the database are required, to minimize resource  consumptions and improve performance. After setting the new databases, they  needed to sync old data with the new data.
There were a few options to consider: 1. write directly to old and new databases at the same time. However, if there  was any issue with any write operation, this would lead to flaky data.
2. logical replication using postgres' build in publish, subscript model. This  would have been perfect except that workspace id was not yet populated in the old database. 
This approach doesn't give the ability to add workspace id to the new as they're being moved.  They could populate this column in the old data, but it would have increase the  load on the old database even more. Ultimately, they decided to create an  audit log when saving to old database and use a catch up script to populate the new  databases with additional scheme changes.
Once new incoming data are now  being populated to new databases, they began migrating existing data. This  marathon took about three days, powered by a beastly m5. 24xlarge instance with 96 CPUs. 
When migrating old data to new databases, they would compare the record versions to make sure  that they're not overwriting more recent data. Finally, they executed a carefully orchestrated  switch to the new sharded system, updating their connection pooling layer (PgBouncer) to  route queries to the appropriate shards. Things went great for a while.
. . until  late 2022 when they started noticing some cracks with their trusty 32-shard  set up was reaching its limits.
Here's what they were seeing: 1. Some shards were going over 90% CPU utilization during peak traffic. 2.
Many shards were approaching full utilization of their provision disk bandwidth, or IOPS. 3. They were also reaching connection limits in their PgBouncer setup.
To make matters worse, the new year is approaching  and they historically see huge spikes in signups and usage around this that time. Gotta get  started with those new year resolutions. To resolve this issue, they decided to shard their 32 databases to 96 databases to improve  load distribution.
Provisioning smaller instances for new shards in hope to  reduce both the CPU, IOPS and costs. Recall that each old instance have 15 schemas. Now  for each old instance, 3 new ones will take its place and each new instance will only contain  5 logical schemas.
These new instances were configured with smaller instance type by design  since they were expected to maintain smaller load. For data synchronization since there were no  sql schema change, they used postgres logical replication to continuously apply new changes  to the new databases. This process involved setting up three postgres publications on each  existing database, with each publication covering five logical schemas.
On the new databases,  subscriptions were created to consume one of the these three publications, effectively  copying over the relevant set of data. One key optimization that they did was  delaying the index creation to after all the data transfer. This approach reduced the  syncing time from three days to just 12 hours.
During the transition, they initially had  100 PgBouncer instances, each managing 32 entries. Each instance could open  up to 6 connections per database, resulting in a maximum of 600  connections per database instance. Their migration plan involved adding 96 new  entries to PgBouncer, temporarily redirecting them to the 32 existing shards, allowing them  to gradually migrate data to the new shards.
As data was written to the old database, it  would replicate to the appropriate new shards. However, testing uncovered a critical issue: since  each old shard maps to 3 new shards they either needed to reduce the number of connections  per PgBouncer instance or increase it by 3x. If they chose to increase the connections, each of the 100 PgBouncer instances could  open up to 18 connections to each old shard.
This would mean 1,800 connections per  shard, which risked overloading the system. On the other hand, reducing connections  per entry wasn't sufficient to handle the traffic, leading to query backlogs. The solution was to shard the PgBouncer cluster  itself.
They created four new clusters, each managing 24 databases. This allowed them to: 1. Increase connections per PgBouncer per shard to 8.
2. Limit total connections per Postgres instance to 200. 3.
Isolate PgBouncer issues to 25% of the fleet. This approach maintained appropriate connection numbers, prevented overload during  migration, and enhanced system resilience. Before rolling this these changes to production,  Notion implemented Dark read for testing - Notion added functionality to fetch  data from both the new and old dbs when requests are being made - results are compared for consistency and any discrepancies are logged - to avoid impacting the user's experience, they limited the comparison to queries  returning up to 5 rows and only sampled a small portion of the requests - A 1-second delay was introduced before issuing the dark read query to  allow time for replication catch-up The testing show nearly 100% identical data.
They proceeded with the failover process  to the new sharded system involved: 1. Traffic pause: Halted new connections  while allowing ongoing queries to complete. 2.
Replication check: Ensured new  databases were fully caught up. 3. Configuration update: - Revoked application access to old databases - Updated PgBouncer to point to new databases - Reversed replication direction by streaming changes from new db to old db just in case.
4. Resume Traffic to new shards The resharding project was a  significant success for Notion. Key outcomes: 1.
increase capacity 2. improved performance: CPU and IOPS utilization decreased dramatically  with new utilization hovering around 20% during peak traffic compared to previous 90% 3. future proofing: this new architecture positions Notion to handle continued  user growth and increased data demands What do you think?
Is Notion fast enough now?  Hope you were able to learn something new today. And as always thank you so much for  watching and see you in the next one!
Copyright © 2024. Made with ♥ in London by YTScribe.com