Kafka is a distributed system that provides a highly scalable, elastic, fault-tolerant, and secure solution to event streaming.
What is Event Streaming?
… event streaming is the practice of
1. capturing data in real-time from event sources like databases, sensors, mobile devices, cloud services, and software applications in the form of streams of events; storing these event streams durably for later retrieval;
2. manipulating, processing, and reacting to the event streams in real-time as well as retrospectively;
3. and routing the event streams to different destination technologies as needed.
Event streaming thus ensures a continuous flow and interpretation of data so that the right information is at the right place, at the right time.https://kafka.apache.org/intro
There are two systems for event streaming: message queue (peer-to-peer) and publish/subscribe.
In a message queue, producers produce messages and send them to a queue, where messages will be stored and, once consumed, will then be deleted. Notably, each message will be consumed once by a single consumer.
In this pattern, publishers do not program the messages to be sent directly to specific subscribers but instead categorized the messages into classes without knowledge of which subscribers there may be. Similarly, subscribers express interest in one or more classes and receive messages from those classes (Wikipedia).
Kafka follows publish/subscribe pattern.
Main Concepts in Kafka
- Kafka servers: a cluster of servers. Some of them serve as data storage centers, called brokers; others run Kafka Connect for importing and exporting data.
- Kafka clients: they allow you to write programs that read, write, and process streams of events.
- Event: an event consists of a key, value, timestamp, and optional metadata headers.
- Producers: client applications that produce events and sent them to Kafka servers.
- Consumers: client applications that consume events from Kafka servers. Note that producers and consumers are fully decoupled so that high scalability could be achieved, which is the most important feature.
- Topics: events are classified into topics, and there could be more than one producer and consumer for a certain topic. Events in topics will not be deleted after being consumed and can be consumed as many times as needed.
- Partitioned: topics are partitioned into ‘buckets’ on different brokers. Events with the same key will be put into the same bucket, and events in the same bucket will be consumed in the same order as they were written. Being partitioned allows client applications to produce and consume at the same time.
- Offset: Each partitioned event is associated with a unique id called offset which is immutable to producers, so that producers can only append the message but cannot modify the existing ones. Consumers, however, can control the offset. Normally a consumer would increase the offset linearly while reading, but it can also reset to an older offset to reprocess messages.
- Replicated: each partition will be in a leader broker and be copied to followers broker as well for fault-tolerance. The leader broker will handle reads/writes, and changes will be copied to followers. When the leader is down, one of the followers will become the leader then.
- Consumer group: given a topic, the event will only be consumed by one consumer in a consumer group. If all consumers are in different consumer groups, then events in this topic will be broadcasted to all.