Kafka Tutorial for Beginners | Everything you need to get started

42.86k views2796 WordsCopy TextShare

TechWorld with Nana

Apache Kafka Tutorial | Everything you need to know: What is Kafka, Real Life Kafka Use Cases, How K...

Video Transcript:

if you've been hearing about Kafka but you don't understand what it is and why all the hype let me clarify it using real life examples that will make everything finally click for you imagine we're building an e-commerce application called stream store and we have some microservices handling payments orders inventory and so on and when something happens in our application like customer places an order it's like dominoes where a chain reaction of updates and events by other services get triggered like stock needs to be updated in the database now that we sold some of it a

notification or confirmation email needs to be sent to the customer an invoice needs to be generated with the right sales tax and sent per email to the customer um maybe revenue and sales data needs to be updated on our sales dashboard and so on now we are a small startup so we are starting with the simplest straightforward microservices architecture where the microservices just call each other like the order service would say hey all you guys we just closed an order go update your stuff accordingly and it all worked great at first but suddenly we become

a hit and people are loving our store or we just announced Black Friday sales and our store is getting hundreds of thousands of customers which is amazing but suddenly our application starts crashing everything is slowing down users are sitting in front of loading screens because our architecture cannot handle this load we get in panic because we are losing cells every minute our architecture that looked pretty clean and straightforward on the Whiteboard becomes a nightmare so here is what's happening in the background first of all we have what's called tight coupling between the services which means

when the payment service goes down for example because some API in the background isn't responsive or the service itself just crashes under load and when that happens our entire order process freezes we have synchronous communication so each order feels like a game of dominoes one slow service and everything backs up and as I said during peak times customers are literally staring at loading screens and we also have lots of single points of failure which means a 10-minute inventory service outage meant 2 hours of order backlogs and countless lost sales and we are also losing a

lot of analytics data when the analytics service goes down for an hour we're losing important Black Friday sales data after another hectic and chaotic week we thought what if we redesign the system so that the orders flow through the system like items on a conveyor belt instead of our current game of hot potato and instead of apps calling each other directly and waiting for reply we remove that tide coupling we basically make space between them and introduce a tool that sits in the middle and acts as a broker think of it as post office when

you order something online the sellers don't come knocking on your door to deliver package themselves they hand it over to the post office or some middleman to deliver your package or if you are returning your purchase or sending a package to someone you don't fly to their place to give them the package in person post office has this infrastructure and handles the processing so Kafka is like the mail delivery service or post office which sits in the middle so now the order service goes to Kafka and hands over a package called event that says hey

order was made for this customer for these products and here are all the details please make this information available for anyone who needs it to update and do stuff in the background bye and it just basically goes back and continues its work and an event looks like this with very simple structure with key value pair and metadata information so the order service does not need to wait there to make sure that the others actually got the information it can trust this broker that it will be delivered to the right services and all this will happen

in the background like in the post office you just drop off your package and go home you don't wait there sitting and checking whether they actually ship the package or not because you know that will take care of the rest and the order service that gives that information to Kafka or basically a service that produces this event and hands it over to kfka is called a producer because it produces events and in code this is how it would look like using Kafka producer API so in JavaScript or Java code you basically use that API to

create an event and give it to CFA now where does this information or these events get saved when producers of those events give them to Kafka because we have bunch of other services like inventory payment and so on that also produce certain events and hand it over to kka with all the information that other services may need when inventory gets updated or the payment service says that the payment just failed and so on so do all these events from different producers get dumped into a giant bucket in Kafka or are they organized somehow if we

had one big bucket handling all the rights and reads it will not be very performant right it's like having one single queue in the post office so if weather sending a letter or package or picking up your delivery everyone would be standing in the same queue instead imagine that post office will add sections with their own cues like a section for letters another one for large packages and so on so Kafka has what's called topics to group the same type of events so for example order service will write events to orders topic the payments service

May update payments topic and so on now how do those topics get created or who defines them well just like you define a SQL schema for your database based on what your application needs and what objects you have you as an engineer decide how to group these events in Kafka in what topics so now that the order service added an event to the order topic what happens next that event May trigger other actions like updating stock in the database because we just sold something or sending notification to customer or updating invoice and sales status plus

what other topics may exist that would need an event data entry as a result of an order which will in turn trigger other actions so how does all that get handled well on the other side of events we have consumers basic basically microservices who are subscribed to these different topics and whenever a new event happens and gets added to this topic all consumers who are subscribed get notified by Kafka and they then do their stuff in this case we have three microservices that subscribe to the order event notification service we'll see that a new order

event was added which means a order was placed in our application and based on the payload of that event it will send a confirmation email to the customer and maybe a purchase notification to your email then an inventory service May update the database by updating the stocks of every product that was sold in that order and maybe in addition to that database update we'll generate a new event and write it into an inventory topic and then finally the payment service May generate invoice and send it to the user now I hope you're learning a lot

and the topic of Kafka is becoming clear for you it takes us on average two or 3 weeks to produce one such video so if you find it valuable we would appreciate if you left your feedback or liked the video and we'd be happy to have you as our subscriber as well now you may be asking is kfka a replacement of a database somehow since we are SA saving all these data as events and basically updating the status of things so is it kind of a new way of saving things a simple answer is no

it's not a replacement database let's explain by following our story so when the inventory service updates the stock for each product in the database why does it produce an event and write it to the inventory topic what kind of event that may be and why would we have it in addition to the dat in the database well that's another use case of kfka where one event basically creates this chain reaction of events when multiple things need to happen as a result of one event happening which we saw an example you may have another service that

is subscribed to the inventory topic and calculates whether any of the products just gone below the inventory threshold and produce a low inventory alert which maybe as a chain reaction will trigger another service that may trigger an inventory restock service that will order more inventory of that specific product another very important use case of Kafka is realtime analytics for example again when sales happen in your application you may have a sales dashboard where your service is updating real time sales numbers and other such use case is driver location updates in an application like uber where

the driver location changes get sent constantly to the application which then updates the UI of the user to display those changes and for all these use cases Kafka actually uses what's called Stream apis So on one side you have these regular consumers that will process one event at a time for example a notification service that will read an order event and based on that we'll send an email or notification to the customer streams on the other hand will process continuous flow of data with aggregations and joints and so on in order to do realtime processing

and analytics on them so for example low inventory validations to check constantly with every event and do the calculation to see whether inventory just dropped below the threshold or get the location changes from the drivers so this analytic services will stream the events Contin ly doing various analytics and calculations on them and in code you would have a streams API that will read the orders and do all these kinds of calculations on them now as I mentioned these are streams of constant data saved as events in kfka because you have an application like uber with

millions of users and tens of thousands of drivers with their locations getting updated constantly that's a lot of data and events that are being produced right and all consumers need to read from it so millions of writes and reads in different Kafka topics which can of course affect performance so we need to scale and that's where kafka's partition concept comes in which is kind of a core of kafka's ability to scale and become really performant so partitions are basically what make processing large amounts of data easy to handle and process without compromising the performance so

how does it work exactly with our post office example remember we edit sections for letters large packages small packages and so on partitions are like adding more workers per section to help out so suddenly Before Christmas the letters section get overloaded because everyone's sending letters to centa well sadly that doesn't happen but if it did we would add more workers in that section but not just randomly instead you say an processes letters going to Europe Steve handle's letters to us Jay handle's ones to Asia and so on same way in Kafka in the orders topic

you may create EU orders partition us orders Asia orders and so on and again you would decide how to partition your topic as part of your schema design now let's think about the consumer side let's say suddenly millions of orders are coming in and we said we can can scale this with partitions so producers can write into multiple partitions at the same time but what about the consumers how can they consume so much data at once because even if you have partitions you'll have one consumer let's say inventory service trying to process all the events

that it's subscribed to which is like all the parcels going to one person recipient like thousands of letters going to Santa those post office workers are being super quick and are delivering them to the recipient but he's getting buried under the pile but we need some people helping him sort through this and that's where consumer groups come in so when you start additional instances of that microservice like replicas in kubernetes they can all consume from CFA partitions and process events faster in parallel now how does Kafka know which consumers form a group and how to

divide and which ones Belong Together simple they are grouped by the group ID attribute when they register as consumers with kfka so replicas of the same application will have the same group IDs and will automatically be grouped together and when you start replicas Kafka distributes the load automatically by assigning partitions to Consumers so kfka says oh we have a new helper now you can process this pile of letters here and when that helper stops working it will take the pile and give it to another active one now the final question is where is this data

physically saved data in topics is saved on CFA servers called Brokers and you can think of each broker like a post office branch that stores the actual messages on disk handles requests from producers and consumers and replicates the data for fault tolerance even if something happens with the dis the data is stored somewhere else as a backup and this is actually what makes Kafka different from standard message Brokers so while regular message cues would delete messages after consumption so as soon as consumers see that message and do something with it that message is gone kavka

however persists every event or message as long as you need and you can configure how long you want to store them with a retention policy so think of it like our post office keeping a log of all package deliveries but not just for record keeping but for analyzing patterns and improving service so that unique feature of kfka for real-time data processing and general analytics means that kfka needs to store those events long-term so the consumers can read those events anytime they want even multi multiple times if they need to and as I said this capability

to process streams of data in real time while keeping the original data for later analysis is what really differentiates Kafka from simple message Brokers so that's the main difference and for even clearer comparison think of this as difference between watching Netflix and watching TV Netflix is on demand so consumers or people who are the viewers can decide themselves what they want to watch when they want to watch it and at what pace so they can stop and pause anytime and continue whenever they want or they can replay or start from the beginning with TV you

have predefined programs and people who want to view those programs need to tune in at specific time to watch specific stuff so everyone watches the same thing at the same time at the same Pace you can't pause and continue later if you miss a movie or show you just miss it and it's not automatically saved to watch later and that's exactly the difference between Kafka architecture versus other traditional message Brokers and finally Kafka needs a way to keep track of which Brokers are alive elect leaders to coordinate manage all the configuration and traditionally Kafka used

an external tool called zookeeper for this type of coordination so it was like a central management for all the Kafka Brokers however important to note that the newer versions of Kafka from version 3.0 introduced K raft or Kafka raft which removes the need for zookeeper as this external dependency with centralized control by building that coordination directly into Kafka now I hope I made Kafka finally clear for you share it with one colleague who you think will benefit from it and with that thanks for watching and see you in the next video