Imagine having to store 50 billion pictures and videos on your measly little computer. That’s close to 95 million videos per day. For regular users like us, there is no possible way we can store even a fraction of this on a single harddrive.
For weird people… well I won’t bother talking about what those images are. For Instagram, this is a reality with all of its users uploading videos and photos to their servers every second of the day. Now… you are probably thinking, oh well Instagram has this under control….
right? Well, Instagram looked at their metrics and saw that in 12 months, that their servers will crash if they didn’t find a better way to store their videos. In this video, we are going to see how Instagram was able to make a 94% improvement to the process of storing videos that saved them from having a MASSIVE server crash.
If you want more stories like this, make sure you check out my newsletter: the better dev. We release a 2 times a week newsletter showing you things like this, open source software and more. Let’s get started.
The Problem Now, when you have 2 billion monthly active users… (By the way, that’s like a quarter of the entire earth. ) You have to find the best way to make sure your servers are keeping up with the constant traffic. 10 minutes of going down is considered a disaster.
After doing some projections within 12 months, they realized something scary. They wouldn’t be able to have enough computing capacity to provide video uploads for everyone. Do I have to explain why this is bad… no video = no fun = no instagram.
Now the way they have their servers setup is extremely complex, Meta has 14 data centres across the entire world that helps load balance all traffic towards their applications. It’s complicated but for this high level look of a low level related subject, I would highly recommend you read it. Anyway, back to the storage issue.
Instagram wanted to instead optimize the servers they were running rather than getting new ones. Imagine opening up another data center… Let’s see how much that costs…. Ok… yeah let’s maybe save some cash here.
So they scouted around to see how they can save computing power when a user uploaded a video… they found a potential way to shave some optimizations. The Function This is very common amongst lots of different video apps to help accessibility. When you upload a video, that video service will compress the files into multiple different videos.
They do this because sometimes you are watching from a bad internet connection or you are looking at it through a thumbnail. You wouldn’t want to load a 12K video in the thumbnail. So they run this function to create videos depending on what view you are looking at.
Ok, let’s look at this function more in depth… Whenever someone uploads a video on Instagram, the server will generate many videos in these 2 types. One was the H264 Codec and the other was AV1 Codec. So just because developers are weird like that, they went into the function and tested to see how it would perform.
They found that it took 86. 17 seconds. With roughly 4 million videos per hour they are looking at EXTREMELY long computing backup.
Yikes… Let’s look at these 2 video types. Video Types So as mentioned before, we are dealing with 2 video codecs. One is H264 and the other is AV1.
Please stay, I know talking about video compression might be boring but I promise I will make it fun. Let’s look at our boy H. 264.
We LOVE h. 264. Almost all videos you see online for the most part are encoded with H.
264. It’s also been around forever with the first release being released in 2003. Now if you really wanna nerd out and tell all the people at a party how video compression works, then I won’t explain in detail how it works here.
However, the sparknotes version is that it works by analyzing the difference between the next frame and the last frame and removes things that are the same to save memory. It also removes any information that is unnecessary in a frame and stores it more efficiently. For a detailed look, go to Leo’s channel and watch this video.
It’s awesome. …but 20 years of development comes with a massive amount of baggage. …it’s ok H264, we love you and we will comfort you.
So some of the biggest tech companies put research into making a new video compression system that can outperform H264. The Alliance for Open Media (which was founded in 2015 by Amazon, Facebook, Google, Microsoft and Netflix) released AOMedia Video 1 also known as AV1. This is incredible as you can get 30% smaller file sizes and better quality.
The best part is that it’s open source and royalty free. It’s all a win win. However, with the way technology is, it’s really new so it takes a while for people to adopt.
My grandma is still asking me to help her with her webcam… So long story short, they have 2 types of encodings they are generating: 1 H. 264 which is used in every single device but lower quality. This was spending 80% of the resources available.
2 AV1 which is new and better but not used on all devices. They wanted to start generating this more since it’s much much better. Ok… you didn’t fall asleep did you?
Digging Deeper Ok, 80% of resources… that’s a lot of resources… let’s investigate. So let’s look at the minimum functionality encodings. .
aka the H. 264 (which we love btw… I am not being a hater). Instagram creates two classes of minimum functionality encodings.
1 Basic Adaptive Bitrate encodings 2 Progressive encodings. The basic adaptive bitrate encodings is Instagrams most used format by instagram users. You know when you are watching a video and your internet gets wonky so you get a blurry version of it temporarily?
That technique of streaming is called adaptive bit rate streaming. These people are smart you know. The other one isn’t used nearly as often but is needed to support old versions of the app.
It’s always the old tech that has to ruin the fun for everyone. When we test uploading a 23 second 720p video to the servers, they found that it took 86 seconds. Yikes.
However, an oversight was discovered. The programmers noticed that the settings of the 2 encodings were very similar. So a lightbulb went off.
What if they could replace the Adaptive bitrate encodings with the progressive encodings video frames by repackaging them into an advanced bit rate capable file structure. Ok… that sounds complicated. Basically, if all of the settings are the same on the crappy progressive encodings, why not just repackage it so the other encoding can get a head start instead of creating it all from scratch.
That way, they both don’t have to chug along at the same time. Almost like instead of having them race against each other, you make them pass the baton. I think that analogy was clever.
I mean… it’s at least worth trying. So they did. So once they ran their initial tests….
0. 36 seconds. Wait.
. they must have done that wrong… nope. 0.
36 seconds. Ok… let’s review what they did here. Before they would upload a file and at the same time, transcode 2 different video files.
If you have ever rendered a video on your computer, you know that this is a hefty process. So this poor server was chugging this along for so long. Instead, they switched it so that it would render out the older version and then repackage most of the settings and frames from the previous encoding so it didn’t have to render all the way… This would free up computer for the advanced encoding I mentioned at the beginning of this video… remember… AV1… please don’t make me talk about video encoding again… Now, I want to say that this seems simple… but when everything you work on is supposed to scale to BILLIONS of users, it’s a common oversight.
Oh speaking of scaling to billions, Testing if Billions can handle this A lot of people want to talk crap about programmers who work at big tech companies for not fixing a minor bug fast, but the issue is the amount of hurdles you have to go through to have your code in a production environment is a lot. Here is no exception. They needed to test if this optimization that would allow for generating more advanced encodings, be a net positive for users.
Now… this can’t be complicated…. right? Did you not listen to what I said like 2 sentences ago?
To test to see if this would work, they created 2 pools of regular simulated traffic. Both were routed to a test pool or a control pool for when the video was delivered, it would identify if the encodings from the new system was implemented. Do you now believe me when I say nothing is easy when you deliver apps to billions?
From this test they proved that they were degrading the compression efficiency of the original ABR encodings but the extreme higher watch time for the advance AV1 encodings made up for it. Outcomes So over 94 percent of computing power was saved with this small optimization… which is incredible but what can we take away from this. Well to me a couple of things really stick out with this.
1 Is how we really need to consistently invest in innovating things that we think are “good enough” My last 4K video that was 12 minutes long was 30 GB in quicktime format but in H264, it was around 800MB. To the common eye, it seems like we don’t need improvements… that’s a 97% compression rate. However, understanding that things can get better will help us innovate for many different things that we can use in the future.
1 The fact that they were able to reduce that much computing power with “one simple trick” is not a sign of bad engineering. There are lots of reasons why oversights like this can happen. The classic “if it aint broke, don’t fix it” is a common term used in environments like this to insinuate that there are more important priorities elsewhere like creating an exciting new feature over creating an optimization that not many people will realize or care about.
Of course this term is affiliate with “technical debt” where you do an easy solution first that you will eventually HAVE to fix later. 1 Creating a transparent approach to fixing this issue that a lot of us don’t understand. Software engineers are constantly giving back to the community whether that’s meetups, talks or literal source code to projects that are the backbones to huge mega companies.
The transparency and vulnerability it takes to make an article like this are great traits for any type of engineer. Outro I hope you enjoyed this video. Let me know what company you would like me to make a video on next in the comments below.
If you haven’t already seen it, I go deep into how Discord stores BILLIONS of messages, check it out. Make sure you like comment and subscribe. Peace out coders.