Apache Kafka
These are my notes from the Confluent Apache Kafka Tutorials | Kafka 101 playlist.
Introduction
- Apache Kafka is a distributed streaming platform to collect, store and process real-time data streams at scale.
- It has numerous use cases, including distributed logging, stream processing and Pub-Sub Messaging.
Event
- An event is a record of something that has happened in the past. It is a fact.
- It's a combination of notification and a state
- An event in Kafka is modeled as a key-value pair.
Topicsare primary components of storage in Kafka.- Events are immutable.
Topic
- Fundemanetal unit of organization for the events in Kafka.
- Durable logs of events.
- Can only seek by offset, not indexed.
- Retention period is configurable.
- Producing to and consuming from a topic is done through a Kafka
broker.
Partition
Topicsare broken down into partitions.- When we write a
messageto a topic, it is written to one of the partitions of the topic. - Partition that it's routed to is based on the
keyof the message. - Each partition can in a separate
nodein a Kafkacluster.
Broker
- Kafka is a distributed system that consists of machines called
brokers. - Each broker hosts some set of
partitionsand handles incoming requests for producing and consuming messages. - Brokers also handles
replicationof partitions across the cluster.
Replication
- We need to copy partition data to several other brokers to keep it safe.
- Those copies are called follower replicas.
- Whereas the main partition is called the leader replica.
- In general, writing and reading is done from the leader replica.
- Tunable in the
producer.
Producer
- Applications that use Kafka are producers and consumers.
Consumer
- ConsumerRecords are returned from a call to poll on a Consumer.
- Reading a message doesn't delete it from the topic.
- Scaling consumers is automatic.
Broader Kafka Ecosystem
- Things like Kafka Connect, Kafka Streams, KSQL, Schema Registry, etc. are all examples of infrastructure code.
Kafka Connect
- Ecosystem of pluggable connectors
- Data integration system and ecosystem
- An application running outside the cluster
Data Source -> Kafka Connect -> Cluster -> Kafka Connect -> Data Sink
- Horizaontally scalable
- Fault tolerant
- A connect worker which is one of these nodes in the connect cluster runs one or more connectors
- A connector is a plugable component that's responsible for interfacing with that external system
- Source connector - reads data from an external system and writes it to Kafka (producer)
- Sink connector - reads data from Kafka and writes it to an external system (consumer)
- Reading from a relational database, getting messages from Twitter, a legacy htfs file system, etc. these are all same operations no matter application
Note : This type of code is called undifferentiated code or infrastructure code. It's not the core business logic of the application. It's not what makes the application unique. It's not what makes the application valuable. It's just the code that's necessary to get the application to work.
Schema Registry
- New consumers will emerge over time
- Consumers need to know how to interpret the data (format of the messages in the topic)
- Schema Registry is a standalone service process that maintains a database of schemas
- Database is persisted in an internal Kafka topic, and it's cached in Schema Registry for low latency access
- When producer wants to write a new message, it first checks the Schema Registry if it's the same with the previous one
- Schema registry currently supports 3 serialization formats: Avro, JSON and Protobuf
Kafka Streams
- Kafka Streams is a stream processing API that's built into Kafka
- Filtering, grouping, aggregating, joining, etc.
- Scalable, fault tolerant state management
- It's a library, not an infrastructure
ksqlDB
- ksqlDB is a SQL-like language for Kafka Streams
- Database optimized for stream processing
- Run on its own scalable, fault tolerant cluster adjacent to Kafka cluster
- Provides an integration with Kafka Connect