3. Apache Kafka Fundamentals | Apache Kafka Fundamentals

465.88k views4846 WordsCopy TextShare

Confluent

https://cnfl.io/apache-kafka-101-learn-more | In this video we’ll lay the foundation for Apache Kafk...

Video Transcript:

Hey, Tim Berglund with Confluent. This is the fundamentals of Apache Kafka within the course called fundamentals of Apache Kafka. So we're really gonna get into the fundamentals here.

(bright upbeat music) In this module you're gonna develop a mental model of what Kafka is. When we're done, you should be able to identify the key elements of a Kafka cluster and have some idea of what the basic responsibilities of each element are. Explain what a topic is and how it's composed of partitions and what segments are, and that basic sort of storage model of Kafka.

Topics are what you're gonna spend a lot of time thinking about when you're actually building applications on Kafka, so we want that to be pretty solid by the time we're done. Now, it is a fact of the world that lots of things happen in it. There are many events produced and no matter what sort of industry you're in like the industries, you see pictured here, all of those things are producing events all the time and Kafka's job is to manage and process those events.

So to get data into a Kafka cluster, we have a thing called a producer. A producer is an application that you write. So that, that Kafka box there, that's either a Kafka cluster that you're operating, or maybe it's a Kafka cluster in a cloud service where you don't know too much about the operational details, but you know, the services there and it's reliable and you know how to connect to it.

That Kafka cluster is that thing out there, your job is to write programs that put data into it and read data out of it and do useful things with them. So all of those sources of events, we see connected cars, financial transactions, things happening in hospitals, boxes being shipped around the world. All of those are potential sources of events and some producing application is aware of those events and takes an action to write those events into a Kafka cluster.

And after they get written, the Kafka cluster says, okay, got it, sends an acknowledgement, and the producer moves on. What's in a Kafka cluster? Let's break that down a little bit.

It's made up of things called brokers. Now, if you will just kind of set the way back machine for maybe 10 or 15 years ago when computers were things that you saw and they were in a data center and you could maybe go to that data center, and it was this metal box with blinky lights and fans, and that sort of thing. If you could imagine that that helps make this a little crisper here, those individual servers, those are kind of what brokers are.

Those are Kafka process running on each one, each one has its own disks that's important. Every broker has its own local storage. Those brokers are networked together and they act together as a single Kafka cluster.

When producers produce, they write into those things that are brokers. Now, if you're using Kafka in the cloud, you're hopefully not gonna be thinking about brokers too much. If you're using a fully managed service, there are brokers that are out there, you know, that's true, they've got their own local storage, they've got some retention time set on the data that they're storing.

That's maybe five minutes or maybe a month, or maybe 100 years. They've got some amount of time, they're gonna keep those events around. But in the cloud, of course we don't think too closely about brokers.

Those are the things that's abstracted away from us. But it's true, and you should know this. Every Kafka cluster is composed of these things called brokers.

And we never quite know how to refer to those. Are they machines? Are they servers?

These are all sort of old words, they might be containers, they might be VMs somewhere. And I'll skip that for the rest of this course. I'll probably just call them machines or servers or some kind of legacy term like that in as much as we're talking about brokers, just so you kind of have an idea what I mean.

Now that data gets written into that cluster, it's stored on those discs inside those brokers and of course, we wanna read it out, we do that with the consumer. And this consumer is a program you write. That Kafka cluster, those brokers that's infrastructure, somebody manages that infrastructure, operates that cluster.

Ideally, it's a managed service in the cloud, but what you write is the producer and the consumer, you put data in and you read data out. And it's reading data that frankly usually gets pretty interesting. This is where a lot of the work happens on the consume side.

What the consumer does with that data, well, that's up to the application itself, is generating a report, feeding a dashboard, is there some other process that it's sending that process data too? Often, a consumer will also be a producer and put its results into a different place in Kafka that transform data. But these are the fundamental parts of Kafka.

You have a producer, you have the cluster itself, you have a consumer. And that model, those three pieces, everything that ever gets built with Kafka conforms to that model. Maybe you won't see the consumer and the producer directly.

If you're using something like Kafka Streams, or CASE SQL Db, which we'll cover in a later module, you don't think about producing and consuming, but they're always there. These are the fundamental components of Kafka. So let's look at that again and add a component.

We've got our producers there on the left, that Kafka cluster in the middle, those producing applications are writing into the cluster that you've got the consumers they're reading from the cluster. What's that thing on top? Well, that's a ZooKeeper ensemble.

As of the time of this recording, Kafka uses ZooKeeper to manage consensus on a few pieces of distributed state. There are a few things that all of those brokers need to agree on. They need to have some consensus story on what is true and ZooKeeper is good at doing that.

Now, again, as I said, at the time of this recording, there's an initiative underway called KIP 500, that's Kafka Improvement Proposal, 500. That itself has bonded a number of child KIPs and a whole bunch of work to remove ZooKeeper from Kafka completely. So at some point after this recording, that will not be there anymore.

So it may be that you look at Kafka and the current release you're using, and maybe you see this video and say, well, there's no ZooKeeper there. And that means a new day has dawned. So there's a lot of work to get that done, and that work is ongoing.

But for now ZooKeeper's doing a great job, being a distributed consensus manager for the cluster. Producers and consumers are decoupled from one another. So a consumer doesn't know anything about the producer that produced the data that it's reading.

Likewise, when you produce data, you're not sending it to a particular destination, you're sending it to a structure inside the Kafka cluster, that's called a topic. You're producing that data to a topic, we'll cover those more in a moment. But you don't know anything about who's consuming from that topic.

And that decoupling is intentional. That means that there's a whole category of state information between producer and consumer that is simply never managed. Producers write data in, consumers read data out.

That also means that producers can scale. I can add producers that write data into a cluster without the consumers knowing. I can add consumers that are reading a kind of data that has long existed.

Maybe I've got this long history of records of sales transactions in my cluster, I add some new fraud detection algorithm. Well, that's a new consumer, none of my producers need to know. They can fail independently, they can evolve independently.

They're decoupled, so they don't need to know about each other. Revisiting ZooKeeper for a moment, what does it really do? Well, authorization information, that's access control lists, those are stored in ZooKeeper.

The management of failure. So when a broker fails, there are gonna be various pieces of a topic that that broker is the leader for, they're replicated in other places in the cluster, but the way that replication is organized and who's in charge of each replica and that sort of thing that is managed by ZooKeepers. So when a broker dies and the cluster has to decide who gets responsibility for the data it was managing, ZooKeeper participates in the election of new leaders for those things.

So it's basically what it does. Little bits of metadata, fail over leader election, access control lists, that stuff is all currently in ZooKeeper. I've used this word a few times topics.

Now I wanna really give you a definition of what a topic is. A topic is a collection of related messages or related events. You can think of a topic as a log, as a sequence of events.

Now, there are some exceptions to that, I'm gonna unfold some concepts here that help you manage those exceptions. But for right now, just think of it as a sequence of events. So topic is this list of things and when a producer writes a new one, it just puts it on the end and the previous ones are kept.

I can have any number of producers writing to a topic, I can have a producer writing to multiple topics. Likewise, many consumers can consume from a topic. All of those relationships are end to end-to-end.

There isn't a theoretical limit on the number of topics. There's a practical limit on the number of what are called partitions, we'll get to partitions in a moment. But topics by themselves, it's not like you can only have 50 and then you need to add nodes or anything like that, you can really have conceptually as many as you'd like.

And I did say partition, didn't I? So I better tell you what that means. Now here's a cluster over on the left, it's got a number of topics in it, let's zoom in on topic C.

Now that topic is a log and it's a durable log, it's persistent, which means what it's gonna be stored on disc somewhere. And the broker that that partition lives on, well, that's just a computer at the end of the day. And if you're writing messages into that topic and reading messages from that topic that's work, IO and computational work that that computer has to do.

None of these things scales forever. You can't have storage scaling forever, and you can't have that pub/sub activity on that broker scaling forever. So you might want to break your topic up into pieces.

We call those pieces partitions and then be able to allocate those partitions to different brokers in the cluster. This is key to how Kafka scales. I can take a topic, partition it and allocate each partition to a separate broker.

So when I set a topic was a log and I kind of put a little asterisk on that, and there's an exception to really what that means. Formally speaking a partition is a log. So every partition has strict ordering, when I produced to a partition, I put the message on the end of the partition, that's the only place I can put it because it's a log.

I can't disturb any of the previous messages, they're all immutable events, I just put things on the end. And so that is a log and the events in that log that partition are strictly ordered. A topic having been broken into partitions.

You may not have strict ordering over all of the events in the topic, so you've always got ordering within a partition. And if you wanna look at an implementation detail, really drill down into what a partition is on an individual broker, that log file is gonna be represented by multiple logs. Segments, those are individual files on disc on that broker, it's really a set of a few files, and some indexes and things like that.

So that segment is a thing that exists on disc on the Kafka broker, and each partition can be broken up into multiple segments. Usually, unless you're deeply involved in hands-on administration of a Kafka cluster, you're not gonna think about segments so much, but you do have to think about partitioning as you think about how to model data in topics. It's very important.

Let's look at that again in color. If you can see those colors, topic a is green, topic b is, it's kind of a mustardy orange, and I wanna call c sort of a cornflower blue. And you're wanting to correct me in the comments, if you think any of those color names are wrong.

But this cluster now has four brokers, we're calling them 101, 102, 103, 104 and you can see how the partitions of each topic are broken up. So topic a, has partition zero, one and two. You see partition, zero is on broker 101, partition one is on partition 102, partition two is on broker 104 and none of topic a is on broker 103.

I won't go through all of those, but you can kind of pause the video if you'd like, and look through that diagram and see how those partitions are distributed. The cluster does that automatically, when a topic is created by the way, it makes decisions about where those partitions are gonna live. What a Kafka cluster doesn't do is keep track of the size of those partitions and move them around, if one broker gets overloaded, as topics get created and destroyed, loading of course does not stay constant.

So this is functionality you'd have to add yourself to keep those things balanced. There are parts of Confluent platform that will help you do that, make that automatic. And of course, if you're using a managed Kafka service, sort of thing, is the thing that you know vaguely, there is a team of highly skilled software developers and site reliability engineers who are making sure this kind of stuff works for you and you don't have to worry about it.

You see over there on the right there, the log files, each partition again is broken up into individual segments on disc, on the broker as a really drill down implementation detail into what's going on on the broker. Let me just refresh exactly what I mean by a log. This is important, you probably know this, right?

You've probably written to a log file at some point, or at least read one. And you kinda know instinctively, even if you've never thought about it, the semantics of a log. So when you write something to a log file, where does it go?

It goes on the end, it has to go on the end. It doesn't go at the beginning, that's already happened. That sequence of things has already happened.

You can only add new things to the log because the rest is representation of what has happened in time, up till now. Also all of those ordered entries prior to now, those are immutable. If you're editing one of those or deleting one of them, there's almost something like ethically suspect about editing a log, like, what are you trying to hide?

You know, you conspiring in a crime or something? So logs are immutable records of things. And the semantics are when you wanna write to a log, you put it on the end and that's it, and those are immutable after that.

You may choose to expire things past a certain age, and that's certainly the case in Kafka, you can set a retention period on a topic. But this is what a log is, this is this fundamental data structure that Kafka is based on. Those numbers you see there, the zero one, two, three, four, like that, those are real in Kafka.

They actually start at zero and they just monotonically increase like that forever. And there are many, many bits of them, so you're not gonna run out, but every partition has its own unique offset space, so that's a real thing. And you can actually find that out, when you produce a message, when you consume a message.

The API, you can poke into things and will tell you, oh, this will ended up being offset such and such, or this was offset such and such. Usually don't need to know, but, and it's a real thing, and it's there inside of each partition. An important fact about consumers, those blue boxes on the bottom that are apparently reading messages, consuming, doesn't consume, it doesn't destroy the message, it just reads it.

So you can have multiple consumers on one log or one topic in Kafka, and they can be at their own independent offsets. Now, of course, you'd like them all to be caught up, right? Everybody wants to be kind of close to the events that are being produced, so this processing could be called real time.

But they don't need to be, in terms of the way the system works, one can start at the beginning and take days to catch up, if you need to do that. They're independent consumers that are working from independent offsets. Later on you'll hear me refer to this as a stream, this log or topic I'll say stream.

And these things in the stream are events, and the current time is where I'm producing right now, that's the present and the stream extends back into the past. So different word means exactly the same thing. What's the structure of a Kafka message?

Well, the things that you are gonna think about the most are the key and the value, that's your Kafka data model. Every event is a key value pair. Now, very likely there's gonna be some structure in the value, right?

That's probably some sort of domain object that you're gonna serialize somehow and store there. There may be structure in the key. Often the key is a string or integer or something like that.

But sometimes people have a compound and complex domain objects that they serialize and use as the key, that's completely allowed. Every message has a timestamp. If you don't have a timestamp, one will be provided for you.

So you'll get the wall clock time at the time that you produce the message. That's if you don't really care too much about the time. But if in your value, in that domain object, if it knows the time it took place, well then in the API, when you're producing, you can say, hey, you know, the actual time is this.

I don't care what time it is right now, this is the time of the message. You can set that explicitly. You also have an optional set of headers.

Think of them like HTTP headers, which are themselves kind of string key value pairs. So you don't wanna use this as additional payload, but this really is metadata. So these are properties of the data that you're gonna be able to see on read.

So consumers have access to these and can make decisions based on them. But that's it, you have key value, timestamp and headers. Let's dive into brokers a little bit.

I've introduced them, but I wanna go over their key responsibilities. Their basic function is to manage partitions. Now, as a developer using Kafka, you're thinking about topics all the time.

You're creating a topic, you're thinking about the schema of a topic. What kind of messages does it have? What's its retention period?

What are its compaction properties? All these great things that you'll learn about topics as you move forward. If you're a broker, you know what a topic is, but really you've got partitions.

You're managing some set of partitions locally. That's what a broker does. It manages those log files, it takes inputs from producers, updates those partitions, takes requests from consumers and writes them out.

That's what a broker does. It's does storage and pub/sub. So those partitions are stored locally on the broker's disc, and there can be many of them, many partitions on each individual broker.

So you see this core architecture diagram again, you've got producers. Those are applications that are writing to the cluster, brokers that are taking those rights and managing their partitions and consumers that are reading from those partitions. The way consumers read from partitions actually gets pretty interesting, we'll take a look at that later.

It would be a bummer if each partition only existed on one broker, such that if that broker died, the partition would die. You would not want that to be the case. And of course, Kafka does replicate.

Each partition has a configurable number of replicas, three is typical, that's called the replication factor. One of those replicas is called the leader and the others are called the follower. So when I produce to a partition, I'm actually producing to the leader.

The producer is connecting to the broker that has the lead partition there. And it's the job of the brokers with follower partitions, to reach out to those leaders and kind of scrape the new messages that they've got and keep up-to-date with them as quickly as possible. So and that all happens in a very timely fashion inside of a properly operating Kafka cluster.

It's not like they take seconds for that replication to take place, but there is a leader, follower distinction here, which makes consistency a lot easier to think about. Thinking a little more about producers, which know now are these client applications, you might be asking, what language can I write those in? Well, Java was always the native language of Apache Kafka.

The language library that ships with Kafka is a Java library. Now the other adjacent languages, of the JVM like Kotlin and Closure and Groovy, and even Scala, there are always wrappers for those that make the Java library look idiomatic in those languages. But there are other supported languages as well, C, C++, Python, Go, .

Net. Those libraries are supported by Confluent platform. So if you need like a proper supported version of one of those, you can get that.

Those are all based on a C language library called librdkafka. That's an open source library that duplicates a lot of the functionality of the Java library and many other non JVM language libraries draw on that for their Kafka support. There are many more than that supported by the Kafka community, if you're wondering where node support is and where's Ruby and all?

Well, believe me, they are there. In fact, there are often multiple choices for each one of them. There's also a REST Proxy that we'll cover in another module that lets you access Kafka.

If somehow you're using a language that doesn't have library support, or if you just don't want to use that native language support, you can use the REST Proxy. And of course there is a Command Line Producer Tool as well. That's good for tests and scripts and sending kind of small amounts of string data into topics.

Now I said, when a producer writes to a topic it's actually writing to a partition and these partitions are stored on separate brokers. And so how does a producer know which partition to write a message to? There are a couple answers to that.

Now, if the message has no key, the producer will just have a round-robin method that it applies. And it'll say partition zero, partition one, partition two, partition three. And there are some exceptions and interesting ways to configure that, but that's basically what's gonna happen, you're gonna load bouncing round-robin way.

Partitions always stay even in that case, but you don't have a lot of ordering there. Now, if the order of events is important to you, you have the opportunity to order them by key. So if there is a key, then what the producer's gonna do is hash that key, mode the number of partitions that gives you the partition number it's gonna write it to.

So the same key is always gonna get written to the same partition, as long as the number of partitions is held constant in the topic, which probably should be in most cases. So messages with the same key land in same partition, which means they are strictly ordered all the time. And that's an important thing.

Cause you might have a key, say you've got an internet of things application, you've got smart thermostats all over the planet, tens of millions of them and they're all phoning home with temperature and humidity and every other kind of metadata every minute, and you want those to be ordered. You wanna be able to process those in order. Well, if you make the key, the device ID, well then each devices messages are gonna show up in order in a partition.

So messages of the same key, always land in order, it's possible to override all this and write a custom partitioner, if you'd like to, it doesn't end up happening very often, but it absolutely is available to you if you need it. Consumers again are the programs that you write that are reading from topics. All the same language options as producers and what they do, they pull.

They actually will go out, that consumer program will go and ask the Kafka cluster. It will say, "Hey, I am subscribed to this one topic, "and this is the last offset I read. " Remember those numeric offsets.

"Do you have any messages after that offset? " And if the answer is no, then it returns and the consumer can come back and ask a very short period of time later. Usually of course the answer is yes, and here are more messages and it gets the messages and moves on.

That consumer offset, the consumer is storing that in memory that's state, right? We don't want that only to be in memory. So the offset of each consumer into each partition, that that consumer is responsible for is stored in a special topic inside the Kafka cluster named mysteriously enough, consumer offsets.

So if you see that consumer offset topic, that's what it's doing. That's helping your consumers remember where they are. So if they go away and come back, the cluster can help them remember where they need to pick up again.

And just like the producer, there is a Command Line Tool to read from a cluster, which could be great for quick visibility into things and scripting and so forth. Usually not a lot of production code gets written with the CLI tools, but they do come in quite handy. As I've said, each topic can have multiple consumers and by multiple consumers, I mean multiple different applications that are reading that same data.

So there's separate code bases, separate images, separate builds, separate deployments could be managed by separate teams. Maybe the people don't even know each other, as long as they've got the right to access the data, they can deploy consumers against that topic. Consumers also live in groups.

Now that one of the top doesn't look like a group that's kind of the degenerate group of one, but every consumer is a consumer group in Kafka. Meaning I can add additional instances of a consumer, like you see at the bottom, there are three instances of that consuming application. So imagine that's an uber-JAR and you've built a Docker image around it and Kubernetes is now deploying three of it, instead of only one of it, that's the way that you scale out consumers.

We'll talk about the details of that a little bit more in a future module. Let's come back to the same architecture diagram that describes every Kafka system you're ever gonna build. You've got producers that write data into the cluster, you've got that cluster, you know a little bit more about partitions and replication and things like that now.

And then you've got consumers, those programs that read data out, every system you're ever gonna build conforms to this diagram. And with that, you should have a pretty solid mental model of how Kafka works, a little bit about producers, consumers, brokers, ZooKeeper, partitioning, replication, all of these things put you on a solid footing to explore what next.