Understanding KAFKA and how to use it effectively

5 min readMay 4, 2022

Before getting into KAFKA, let me give you a brief about why Kafka emerged and who were the competitors which were essential for developing data streaming.

History

WWW (World Wide Web) till 1993, all the content was indexed by hand and hosted on the CERN webserver to view it. There was no search engine to find the page and view the website, it was accessible only through hyperlinks and ports using the Mosaic Web browser.

After that Yahoo pioneered finding a way for web pages on the internet. It was designed in such a way that all the web pages were stored in a web directory. Then users were able to find content by using a web directory or a yahoo search engine (similar to a telephone directory which has every number and address in the country)

It was very difficult during the course of time to copy all the web page content to web directories and use it for web crawling to fetch the best result for the search engine.

Google search engine in the early 2000s came up with a new pattern to identify the pages using the PageRank Algorithm at a large scale to cope with the huge data (Big data concept is idealised) to fetch effectively.

During this time many companies tried to solve Big Data solutions based on Google open paper (MapReduce and Google File system) which led to the development of Hadoop open-source. This gained huge momentum and a lot of innovative solutions and frameworks were developed.

Hadoop adopted Batch processing approach for the big data problem

Data streaming

The main Big data problem is to solve Data collection, Data storage, Data Processing and Data accessing. During various levels of data, these solutions need to be applied and work reliably with the efficient speed in real-time.

There are many frameworks which will solve these data processing solutions few are Apache Kafka, Spark Streaming, Apache Flink, Apache Storm, Amazon Kinesis, Google Cloud Dataflow, Azure Data factory etc.

Now we know the problem of Big data and various frameworks to solve it. This is the time we can go deep dive into Apache Kafka and its uses. Other frameworks for time being we can skip for now, so we can understand everything about Kafka and its future.

Find Kafka event streaming business uses cases in this link: https://kafka.apache.org/powered-by

Apache Kafka

Kafka: The name derived from Franz Kafka for his complex writing are described as complex, bizarre, or illogical quality.

Kafka is originally developed on LinkedIn as part of an internal project, later it is open-sourced. It is observed as Messaging Broker, which is used to publish/subscribe, store and process the data.

Events are everywhere, all events are described as data that originate from a system that continuously creates data such as logs, transaction activities, requests, actions of users, and app notifications all these data can be traced and each event data is used for deriving business requirement.

Kafka allows users to subscribe and publish data to any number of real-time applications or systems (especially in microservices). How their data is managed and processed will see in detail.

Building Blocks

From the above figure, all the system integration will help to solve the data streaming. Uses and examples of these building blocks can be easily found over the web just google it.

Kafka Broker organises the messages in the Topics, each topic has multiple partitions (physical directory) to store those messages which have replicas of those message data. If one topic fails another topic will have the replication factor so the broker can retain the message data by (leader partitions and follower partitions).

Leader Partitions have the working copy of data and the Follower partition has the replica of the working copy. If the leader partition is down, the follower partition will become the leader partition to take its position and retains the data.

The partitioning and replicas of data are organized through the Round-robin method, so data can be evenly distributed in all the partitions to form a replication factor.

Cluster Architecture

Kafka Cluster which has multiple Brokers

A Group of brokers can be formed into one cluster to share the workload. There is a provision to increase the number of brokers in one cluster. Zookeeper is used to monitor the active brokers in the cluster and distribute the broker workload to other brokers within the cluster if one broker is inactive using ephemeral nodes.

Each broker has a unique ephemeral node example: B1, B2, and B3. If one broker dies there will be only remaining nodes left out in the cluster. Zookeeper can keep track of active brokers by using this command

ls /brokers/ids

And using the controller broker Zookeeper can distribute the work to the remaining brokers. In a cluster, only one controller broker is created. If the controller broker exits another broker will become the controller and it will be responsible for distributing the workload.

Kakfa uses these Building blocks and Cluster managment in different systems to achieve fault tolerant and even distribution

Kafka APIs

Kafka APIs are a set of rules defined to communicate between two systems. Using the below core APIs, Kafka could achieve data stream processing effectively.

Kafka includes five core APIs:

The Producer API allows applications to send streams of data to topics in the Kafka cluster.
The Consumer API allows applications to read streams of data from topics in the Kafka cluster.
The Streams API allows transforming streams of data from input topics to output topics.
The Connect API allows implementing connectors that continually pull from some source system or application into Kafka or push from Kafka into some sink system or application.
The Admin API allows managing and inspecting topics, brokers, and other Kafka objects.

Explore more about these APIs in the official documentation: https://kafka.apache.org/documentation/#api

Hope all the above information about Kafka gave you some insights and engage you to learn more about Kafka. Below links are the best resources for getting started.

Kafka Internal Architure tutorial: Understanding the Apache Kafka® architecture and how it works

Kafka Fundamentals: Start building with 6 new, hands-on tutorials

Udemy best example-driven training: https://www.udemy.com/course/kafka-streams-real-time-stream-processing-master-class/

There are so many other various topics to discuss on Kafka such as storage, Kafka connect, Pipelines, Data Mesh, Schema generation etc. which are very vast concepts and difficult to cover in this blog. I will come up with a detailed analysis of each one of these in upcoming blogs.

Please follow my channel for more information on Kafka and Full stack development insights which I learned over the years.