Kafka, The What, Why and How?

Table of Contents

System Design - This article is part of a series.

Part : This Article

What is Kafka?
#

Created by LinkedIn, and was written in Java and Scala, Apache Kafka is a distributed event streaming platform that can scale massive pipelines of real-time data. So, what is Event Streaming? Event streaming is capturing real-time data from event sources(which can be anything from IOT devices, and mobiles to cloud services, software applications, etc); storing them reliably for retrieval; manipulating them if necessary, and routing them to different destinations if necessary. Kafka can do this at scale efficiently. Kakfa is not an in-memory DB like Redis or Memcached, rather it stores the data in the disk.

Why Kakfa?
#

The main reasons to use Kafka are,

High throughput - Capable of handling high velocity and volume.
High scalability - It can scale to thousands of brokers and scale up or down as required.
Low latency - Can achieve low latency with a cluster setup
High availability - Since we can have multiple clusters across servers, and geographies, it is extremely fault tolerant with very minimal risk of data loss

Just a side note, when high throughput is required we can go for Kafka. Though Kafka maintains ACID properties, it’s not advised to be used as a DB. If we need to look at data, it’s faster in a DB than in Kafka. Ksql is currently available which will allow us to query the message stream, to have continuously updated derived tables.

How does Kafka work?
#

Kafka is composed of multiple components. We will list them down first and understand everyone with a sample use case of a food delivery app.

Message
#

A message is the atomic unit of data in Kafka. It can be a JSON, integer, String, etc. Messages can have a key associated with them, which can be used to determine the destination partition.

For our example, it can be something like this for order info.

{
    "orderId": "1",
    "status": "Cooking",
    "food": [
        { "name": "Pizza",  "qty" : "1"},
        { "name": "Coke",  "qty" : "1"}
    ],
    "city": "Chennai"
}

Topics
#

Topics are logical partitions of events. We can have separate topics for different types of messages.

For our example, maybe we can have different topics for order status info, delivery partner location info, etc.

Brokers
#

Kafka instances that store and replicate events. We will try it out with 1 Kafka broker for our example.

Producer
#

A client app that puts events into Kafka. One more key thing the producer must do is to decide which partition it has to put the data in based on the key.

No key - If no key is specified, a random partition is chosen and tries to balance based on the total number of messages in the partitions
Key specified - When the key is specified, it uses consistent hashing to decide. So, the same key would go to the same partition as consistent hashing ensures the same hash is generated always.
Partition specified - We can hardcode the partition.
Custom - We can write rules as per our requirements.

For our example, we will use location(city) as the key to deciding the position.

Consumer
#

A client app that consumes the events from Kafka. Each time the consumer processes a record it updates the offset as well.

Partition
#

Partitions are meant to spread the topics into different buckets. This is similar to sharding in a DB. Instead of vertically scaling in a single partition for a single topic, we can scale horizontally to have more partitions. Note that partitions will not have the same data. Based on the key specified in the message it will choose a partition and write the data.

For our example, we can have 2 partitions for each topic.

Replication factor
#

The replication factor specifies the copies of partitions that should exist. It’s specified at the topic level. So if we have 1 topic, 2 partitions, and 2 replication factors, we would end up with 4 partitions. Note that the replication factor should be less than equal to several brokers.

For our example, we can go with a replication factor of 2.

Offset
#

To keep track of the messages already processed, we have something called an offset. So, once a message in a partition has been consumed by a consumer, it increments the offset. If a consumer goes down, we can resume work from the same offset.

Zookeeper
#

Zookeeper is an extra service to track the Kafka brokers, storing offset for all the partitions, etc.

Consumer Group
#

Consumer groups are created to help attain higher consumption rates if multiple consumer groups are consuming from the same topic.

One thing we need to be clear is that one consumer can consume from multiple partitions but multiple consumers can consume from a single partition, within a consumer group.

Also, consumer groups are self-balancing. Let’s understand this with an example.

Exact match - 4 partitions, 4 consumers in a consumer group result in 1 consumer per partition.
Less consumers - 4 partitions, 2 consumers in a consumer group results in 2 partitions per consumer.
More consumers - 4 partitions, 5 consumers in a consumer group results in 1 consumer per partition and 1 idle consumer.

Based on this we might wonder about MQ and Pub-Sub.

MQ - If we want Kafka to act as a MQ, where each event is processed exactly once, we can have the same number of consumers and partitions or a lesser number of consumers than partitions.
Pub-Sub- We can have multiple consumer groups reading from the same topic/partitions.

In our example, we can have a single consumer group.

Example program
#

We can try out the sample code from Github if you are interested. Link

To run Kafka and Zookeeper locally, we presume docker and docker-compose are installed.

docker-compose -f docker-compose.yaml up

Please refer to this Github project for more alternative setups available.

The sample project has 3 files

admin.js is used to create the topics. We can create as many as required.

node admin.js

producer.js opens up a cli interface where we can give the order ID, status, and location.

node producer.js

consumer.js opens up a consumer and displays the message as and when received. We can mention the group which it should be part of. Since we have 2 partitions, it would be ideal if we create 2 consumers with the same consumer group. This will allow us to see all messages sent with location as Bangalore goes to one partition and the rest all to another partition.

node consumer.js group1

There are further advanced topics to explore such as Ksql, Kafka streams, etc which we can explore in a future article.

System Design - This article is part of a series.

Part : Bloom Filters and Cuckoo Filters

Part : This Article

What is Kafka? #

Why Kakfa? #

How does Kafka work? #

Message #

Topics #

Brokers #

Producer #

Consumer #

Partition #

Replication factor #

Offset #

Zookeeper #

Consumer Group #

Example program #