An Introduction to Apache Kafka

Organizations are shifting from traditional data processing on databases and batch to a streaming-first approach such as Event Broker. Apache Kafka is becoming a critical key component for the organization that needs to handle the huge amounts of data for real-time or near real-time projects.

Implementing Kafka can help an organization to:

  • Support real-time streaming data from the Internet of Things sensors, mobile users, web page activities, etc.
  • Consolidate multiple messages brokers in one place to standardize the data.
  • Provide scalable and high-throughput, parallel, resilient messaging system.

A few of the popular use cases for Apache Kafka:

Messaging: Kafka works well as a replacement for a traditional message broker such as ActiveMQ, RabbitMQ. Message Brokers are used for a variety of reasons such as decoupling, buffer, etc.

Web Activity Tracking: The original use case for Kafka is tracking website user activity (page views, searches, etc.). It is because a website activity usually creates a huge amount of data that not possible to process in real-time.

Metrics: One of the popular use cases for Kafka is monitoring data for the statistics, security, and operational efficiency.

Log Aggregation: Another popular use case for the Kafka, collect logs from various services and system to put them in a central place for processing to make them available for the customer in a standard format.

Stream Processing: There are various frameworks to read data from topics, process them, and write processed data to another topic.

Event Sourcing: Event sourcing is a style of application design and always produces a huge amount of data where state changes are recorded as a time-ordered sequence. Kafka's support to store a huge amount of data makes it an excellent backend for an application built in this style.

Commit Log: Apache Kafka can be used for data replication in a way external commit-log for a distributed system. The log helps replicate data between replicas, acts as a re-syncing mechanism for failed nodes to restore their data.

When not to use Apache Kafka?

Apache Kafka in an Organization

Kafka is a streaming platform based on a distributed, persistent, highly scalable, resilient, and append-only logging system.

Like all distributed systems, Kafka is designed with multiple components located on different machines that communicate with each other and coordinate tasks between them. The instances named Broker run the main workload of Kafka.

The following components added below further describe the high-level architecture’ components.

Record: Data written to the Kafka Cluster is referred to as a record.

Broker: A single Apache Kafka node is a broker. It receives messages from producers; assigns offset, and commits them to the persistent area (storage).

Producer: An application or system that connects to at least one Kafka Broker that can commit data to the Kafka Cluster. Kafka Producers send messages to the Kafka Cluster in the form of records that contains the topic name and partition number to be sent.

Partition: Kafka topics are divided into multiple partitions. More partitions allow parallel scaling of write and read to topic. Each record in a partition has a unique offset. In Kafka, the replication of records is maintained by the partition. In addition, Kafka uses the leader election method to choose which partition (replica) handles all write-read requests for the specific partition and the other partitions replicate the leader.

Replication: Every partition is replicated between Kafka Broker. The producer and consumer send a write-read request to the leader partition. The other partitions replicate the leader for data consistency and redundancy.

Topic: All Kafka records are organized into topics. A topic is named feed/category which records are committed and published.

Consumer: Kafka consumers consume the messages created by Kafka producers. Every consumer has a unique identifier that controls how to read data from topics.

Cluster: Collection of brokers to store data in the replicated partition.

Offset: Kafka maintains a numerical offset for each record in a partition. This offset acts as a unique identifier of a record within that partition and denotes the position of the consumer in the partition.

ZooKeeper: Apache Zookeeper uses to manage a cluster of brokers. Zookeeper is responsible for keeping cluster component’ states (broker, topic, users).

I'm a IT Infrastructure and Operations Architect with extensive experience and administration skills and works for Turk Telekom. I provide hardware and software support for the IT Infrastructure and Operations tasks.

205 Total Posts
Follow Me