Apache Kafka Vs. Apache Flink

11.29k views2370 WordsCopy TextShare

The Data Guy

In this video I'll compare and contrast two popular streaming tools on the market today, Apache Kafk...

Video Transcript:

hey y'all data guy here and today I wanted to make a video covering two of the most popular data streaming tools out on the market today uh and they're both Apachi projects Apachi kavka and Apachi flick um so I've made a video kind of comparing a lot of different streaming tools but today I really want to do a focus comparison on Kafka versus Flink because I think a lot of the time they get kind of confused for being more similar than they actually are under the hood um so that's really what I'm want to explore

today is is going through the basics of what is Kafka what is Flink what differentiates them and then give you all the information you need to decide hey which one is the right tool for me uh so that you're not spending a lot of time trying to crib Kafka into a Flink use case or Flink into a Kafka use case so really great video whether you're exploring uh you know changing your dating uh data streaming tool or even just exploring adding one to your stack uh for the first time so hope you enjoy and without

further Ado let's get into it now the first tool I want to talk about is pikaa so pikaa is a distributed streaming platform that is really focused on building real-time data pipelines and having an event driven architecture and so at its core and what you're seeing up here on the screen is a publish subscribe model where you have applications that act either as producers on the left side or consumers of data streams then producers are responsible for publishing data to kofka topics which you can kind of think as a bucket for the generation of data

that is then consuming assumed and these cka topics are log are organized into logical units like each individual message um and then each topic is divided into partitions so you can actually have a topic that has many parallel pipelines all consuming similar types of data let's say parallel pipeline for both the producer A's messages and for producer B's messages and then this allows Kafka to scale horizontally by continuingly adding more and more horizontal pipelines and handle High throughput message traffic by Distributing that data across multiple broker systems and that's referring to the underlying compute engines

that are actually powering uh Kofa then on the other end of the pipeline consumers will subscribe to these topics to retrieve data so this every time a message is brought as produc from the producer message it goes through the broker through that topic and then the consumer will pick up that generated message after it's gone through whatever Transformations you have incorporated into that topic so this gives them access to real both real time and historical streams and Kafka Brokers coordinate me message distribution and ensure the durability of each partition making sure everything stays organized and

an essential component of its the kafka's actual architecture is Apachi zookeeper which handles all the underlying coordination tasks like managing the metadata and leader election for you know which uh topic is going to consume which data now with Apache Flink the setup is a little bit different Apache Flink is designed as a stream processing framework with a strong emphasis on stateful and event time processing which means a second and event is created it's then processed and so flinx architecture centers around this job manager you're seeing in the center that coordinates the execution of data streams

and task managers that are responsible for the actual computation and the transformation of that data this job manager oversees you know resource allocation to various transformation engines uh Powers the actual streams that are being generated introduces things like fault tolerance checkpointing and provides metadata for the entire streaming system then these task managers will break down task into smaller parallel operators which enables efficient scaling in a similar but slightly different way from how kofka does it now where flank really distinguishes itself is through its Advanced State Management capabilities and ability to achieve exactly one state consistency

which means Edge data is one dat data point is produced that one data point is processed and then that one data point is consumed and that all happens exactly once so this is because flink's architecture provides native support for event time processing and that's also essential for handling data that arrives late or out of order because even if it arrives late it still will be processed because it eventually is being generated it doesn't rely on hey I expect this data to be generated at this point in time it's just saying every time a data point

originates from this producer system I'm going to consume it I'm going to perform this set of Transformations on it and then I'm going to feed it out to a database to an application to another stream to wherever I'm printing that data out to so now that I've given you a basic run on of how they function I also want to talk about the development model for each application because very important when you're working with a system you're going to be developing on it a lot and if it's got a development flow that doesn't make sense

to you it's just going to introduce additional complexity and pain that is going to make you not make full use of the system um and so Kafka is really centered around providing different apis that facilitate the development of application for both message streaming and data processing and really takes an apid driven approach these producer and consumer apis form the backbone of cka data ingestion and consumption processes which allows developers to publish or subscribe to messages in a topic in a Loosely coupled way because if you want to add a new consumer API endpoint all you

need to do is have that consumer API ping the uh message broker to uh consume any data from them to subscribe to it and it doesn't necessarily that broker doesn't even need to know that that consumer exists so that if you need to remove that consumer or add another one uh you can easily do that without need to actually alter the core broker and the Kafka streams API is one of their more recent releases which builds upon these basic concepts and enables the creation of even more sophisticated streaming applications that can read from uh not

only one but many Kafka topics then process that data write the results to other topics all within the context of Kafka and then you have the and this is what you're seeing on here is the Kafka connect uh API this is basically standardized connectors to integrate with external Data Systems just how you would with any kind of API or integration tool you have pre-built connectors where you input your uh connection credentials and then can run start running uh or producing messages from that system or consuming messages into that system out of the box now Flink

is also an apid driven uh takes an API driven approach but it offers a more com comprehensive set of apis tailored for specific use cases in stream processing so they have the data stream API which is the primary model for transforming and analyzing data streams and really integrating that transformation analysis into your data streams rather than in Kafka where you might not actually be transforming it within Kafka you're then transforming a downstream in those consumer entities um but the data stream API you have a high level abstraction for window aggregations right doing true event time

processing and then you also have a SQL API which allows developers to write SQL queries that run directly on top of streams which offers a a much more familiar interface to those who are accustomed to working with relational databases rather than kfka which is kind of its own separate Java driven entity flin stateful processing model also supports exactly once's consistency through Key State operator State and timers giving developers really precise control over the actual application State and then it also comes built in with the complex event processing Library which adds powerful pattern recognition capabilities that

simplify the identification of specific event sequences and stream so you can kind of have that more macro analysis of how your streams are functioning over a longer time period so now that you have an idea of how both function they're kind of respective development processes I want to start talking about comparisons um and so here I want to go over the pros and cons of both Apachi Kafka and Apachi Flink and just kind of sum up what I was just talking about in more of a direct manner um and so kafa is a robust event

streaming platform that really excels in scalability performance and integration capability the key word there is event streaming platform doesn't really excel at Transformations happening within the cka system it's really good at handling large message volumes Distributing data across many different blockers can provide really great horizontal scalability and then also offers again High throughput message handling can store those logs durably it's really good for processing or not processing but consuming uh large producing large amounts of data from just event producers and allowing consumers to stream it in a organized and efficient way so it's really great

for event sourcing log aggregation metrics collection but managing cka clusters can be pretty challenging due to the complexity of coordinating Brokers and zookeeper um and also kofka streams and this is kind of compounding like why kofka isn't really designed for event time pressing it's still a pretty beta tool and it just doesn't have the comprehensive event processing support that Flink does it's really excelling at hey just can ingesting all this data bringing out systems and letting those other systems kind of do the heavy lifting in terms of transforming and you know manipulating that data now

Flink on the other hand really excels at those kind of first class Port event time Transformations allowing accurate processing of events that arrive out of order laate and just really allowing exactly once processing um semantics that ensure consistent State Management across the pipeline so this is really great for applications that require strict reliability where you know duplicates just aren't allowed um and it also provides a flexible API ecosystem that provides many different abstractions for extending flank to like things like low latency analytics to machine learning complex pattern recognition um so it has a lot of

different ways you can extend it and make it even more powerful for some you know things like analytics and ml use cases um but on the flip side flank is really hard to run it requires careful tuning configuration you have to optimize the performance of the job and task manager um and it's just going to make its operational overhead higher than simpler systems so there is kind of a I wouldn't economy of scale with flank when you get to a certain point of complexity when you're doing live stream processing then you might need flank but

if you're just starting out an event streaming kofka might be an easier uh kind of introduction to it especially if you go with a manage solution um and then finally with Flink it's got really Advanced apis but because they're so Advanced there's kind of a steep learning curve there so if you're newer stream processing it's going to be difficult to get really the maximum value out of those apis from the start so to kind of summon up and give you just some use cases uh ideas of what are best for kofka what best or Flink

Kafka is best for scenarios with you know really high throughput scalable event messaging processing logging events in real time and serving as a centralized log aggregator where you're bringing in a lot of different sources aggregating them together and then piping them out to another system for f the processing and so organizations use Costa a lot to handle that metrics collection and act as a buffer between does various metric centering systems and their analysis tools so you don't have to do the direct Point topoint connection which is much much more brittle so kofka a lot of

times is just replacing traditional message Brokers due to its durability and ability to decouple system components in a Loosely coupled architecture on the other side Flink is really well suited for scenarios where load latency live streaming analytics accurate event time processing are Paramount where you have use cases that require complex event processing like you know maybe you're doing Live security analysis on and P recognition not from cameras flank is a great use case for that or fraud detection where you know you maybe might you let's say near you're the New York Stock Exchange you have

to analyze all the streams that are going through your systems and identify specific patterns that could you know indicate fraud or something going wrongin your systems and you to catch that in near real time Flink is great for that and it's really ideal for stateful applications that demand that consistent and low latency State Management um and also things like dat real-time data enrichment where your streaming data is correlated with reference data stor and databases or static data sets because Flink has the compute power necessary to kind of bring those two sides of the house together

so that is all I have for you today just wanted to give you a kind of objective look at what both of these systems are best at um and kind of just clear up a little bit of the conf confusion uh about people calling the same thing so I hope you enjoyed this video hope you have a great great rest of your day day the guy out