How Notion Scaled to 100 Million Users Without Their Database Exploding

90.39k views1632 WordsCopy TextShare

Kiki's Bytes

In this video, we will explore how Notion managed to scale to 100 million users without exploding th...

Video Transcript:

In 2021 Notion's popularity skyrocketed, but its service became unbearably slow because their postgres was at terabytes of volume, even when compressed and was ready to explode. Faced with losing customers, they needed to solve this issue fast. In this video, let's learn how they managed to solve this crisis.

To understand the root cause, we need to take a look at their unique data model. At its core, everything in Notion is a `block`. A block could be a piece of text, an image or even an entire page.

Each block is stored as a row in Postgres, with its own unique ID. Blocks can be nested within other blocks to create complex tree-like structure. This structure allows for incredible versatility, but it also means that even a simple document can result in hundreds or thousands of database entries.

Like this waifu document that err my friend has. Now, multiply that by millions of users using Notion. While flexible, the sheer volume eventually caused users to notice increased latency when requesting page data.

Notion was seeing delay because their single, monolithic database could no longer handle the load. Their Postgres VACUUM process began to stall consistently. VACUUM is a crucial database maintenance operation that reclaims storage occupied by dead tuples - data that's been updated or deleted but still occupies space.

When VACUUM can't keep up, it leads to bloated tables and degraded performance. On top of that there was an alarming threat of transaction ID wraparound. In Postgres, each transaction is assigned a unique ID, these IDs are finite, and if they're exhausted, the database enters a read-only mode to prevent data loss; imagine a note app that cannot be edited or deleted.

. For Notion, this would have been a disaster 6 7 Eventually their database became a ticking time bomb ready to explode at any moment. The first instinct might be to beef up their Postgres instance, like me trying to get big at the gym.

But there are physical limits to how much you can scale a single machine before costs increase exponentially. Moreover, query performance and maintenance processes often degrade well before reaching hardware limits. What did they try next, yup you guessed it horizontal scaling.

Notion decided to shard all the tables linked to the block table via foreign keys. This included workspaces, discussion threads, and comments, keeping related data in the same shard. For the partition key, they chose the workspace id since users typically request data for a single workspace at a time.

The new sharded setup needs to handle their existing data and scale to meet projected growth for at least two years. For that 1. the instance type needed at least 60K total IOPS.

IOPS represents the number of read and write operations that can be performed per second. 2. To maintain RDS replication, they set limits of 500GB per table and 10TB per physical instance.

After careful consideration, Notion chose to create 32 physical database instances. For each physical machine, they chose to create 15 separate logical shards represented as postgres schema. Each shard would have its own tables like block, workspace and comments.

For a total of 480 total shards across the 32 physical databases. The routing mechanism is determined at the application level. To determine where a data is, the application uses the workspace ID to determine which of the 32 physical databases the data resides on.

Within that physical database, they use the workspace ID again to determine which of the 15 logical shards contains the data. They use PgBouncer has an intermediary between the application layer and the database. PgBouncer is a lightweight Postgres connection pooler.

It reduces the overhead of establishing and tearing down connections by maintaining a pool of active connections, allowing applications to reuse them. This is useful where frequent connections to the database are required, to minimize resource consumptions and improve performance. After setting the new databases, they needed to sync old data with the new data.

There were a few options to consider: 1. write directly to old and new databases at the same time. However, if there was any issue with any write operation, this would lead to flaky data.

2. logical replication using postgres' build in publish, subscript model. This would have been perfect except that workspace id was not yet populated in the old database.

This approach doesn't give the ability to add workspace id to the new as they're being moved. They could populate this column in the old data, but it would have increase the load on the old database even more. Ultimately, they decided to create an audit log when saving to old database and use a catch up script to populate the new databases with additional scheme changes.

Once new incoming data are now being populated to new databases, they began migrating existing data. This marathon took about three days, powered by a beastly m5. 24xlarge instance with 96 CPUs.

When migrating old data to new databases, they would compare the record versions to make sure that they're not overwriting more recent data. Finally, they executed a carefully orchestrated switch to the new sharded system, updating their connection pooling layer (PgBouncer) to route queries to the appropriate shards. Things went great for a while.

. . until late 2022 when they started noticing some cracks with their trusty 32-shard set up was reaching its limits.

Here's what they were seeing: 1. Some shards were going over 90% CPU utilization during peak traffic. 2.

Many shards were approaching full utilization of their provision disk bandwidth, or IOPS. 3. They were also reaching connection limits in their PgBouncer setup.

To make matters worse, the new year is approaching and they historically see huge spikes in signups and usage around this that time. Gotta get started with those new year resolutions. To resolve this issue, they decided to shard their 32 databases to 96 databases to improve load distribution.

Provisioning smaller instances for new shards in hope to reduce both the CPU, IOPS and costs. Recall that each old instance have 15 schemas. Now for each old instance, 3 new ones will take its place and each new instance will only contain 5 logical schemas.

These new instances were configured with smaller instance type by design since they were expected to maintain smaller load. For data synchronization since there were no sql schema change, they used postgres logical replication to continuously apply new changes to the new databases. This process involved setting up three postgres publications on each existing database, with each publication covering five logical schemas.

On the new databases, subscriptions were created to consume one of the these three publications, effectively copying over the relevant set of data. One key optimization that they did was delaying the index creation to after all the data transfer. This approach reduced the syncing time from three days to just 12 hours.

During the transition, they initially had 100 PgBouncer instances, each managing 32 entries. Each instance could open up to 6 connections per database, resulting in a maximum of 600 connections per database instance. Their migration plan involved adding 96 new entries to PgBouncer, temporarily redirecting them to the 32 existing shards, allowing them to gradually migrate data to the new shards.

As data was written to the old database, it would replicate to the appropriate new shards. However, testing uncovered a critical issue: since each old shard maps to 3 new shards they either needed to reduce the number of connections per PgBouncer instance or increase it by 3x. If they chose to increase the connections, each of the 100 PgBouncer instances could open up to 18 connections to each old shard.

This would mean 1,800 connections per shard, which risked overloading the system. On the other hand, reducing connections per entry wasn't sufficient to handle the traffic, leading to query backlogs. The solution was to shard the PgBouncer cluster itself.

They created four new clusters, each managing 24 databases. This allowed them to: 1. Increase connections per PgBouncer per shard to 8.

2. Limit total connections per Postgres instance to 200. 3.

Isolate PgBouncer issues to 25% of the fleet. This approach maintained appropriate connection numbers, prevented overload during migration, and enhanced system resilience. Before rolling this these changes to production, Notion implemented Dark read for testing - Notion added functionality to fetch data from both the new and old dbs when requests are being made - results are compared for consistency and any discrepancies are logged - to avoid impacting the user's experience, they limited the comparison to queries returning up to 5 rows and only sampled a small portion of the requests - A 1-second delay was introduced before issuing the dark read query to allow time for replication catch-up The testing show nearly 100% identical data.

They proceeded with the failover process to the new sharded system involved: 1. Traffic pause: Halted new connections while allowing ongoing queries to complete. 2.

Replication check: Ensured new databases were fully caught up. 3. Configuration update: - Revoked application access to old databases - Updated PgBouncer to point to new databases - Reversed replication direction by streaming changes from new db to old db just in case.

4. Resume Traffic to new shards The resharding project was a significant success for Notion. Key outcomes: 1.

increase capacity 2. improved performance: CPU and IOPS utilization decreased dramatically with new utilization hovering around 20% during peak traffic compared to previous 90% 3. future proofing: this new architecture positions Notion to handle continued user growth and increased data demands What do you think?

Is Notion fast enough now? Hope you were able to learn something new today. And as always thank you so much for watching and see you in the next one!