MAY 1, 2026 by kalyani

Building a Mini Kafka from Scratch in Java

A deep dive into building a Kafka-like system from scratch in Java, exploring logs, offsets, networking, and core distributed systems concepts.

I wanted to understand Kafka beyond APIs and configuration files. Instead of only using Kafka, I decided to build a small Kafka-like system from scratch in Java.


Project Architecture

MiniKafka is built as a multi-module Java project using Gradle.

Instead of a monolithic codebase, I separated responsibilities into the following modules:

  • broker - network server + storage engine
  • producer - client that sends data to broker
  • consumer - client that fetches data from broker
  • protocol - request/response message definitions
  • common - shared utilities and models

This structure mirrors how real distributed systems isolate concerns and is similar to how Kafka separates clients, protocol, and core logic.

High level layout:

minikafka
 ├── broker
 ├── producer
 ├── consumer
 ├── protocol
 └── common

Kafka’s Core Idea: The Append-Only Log

At its heart, Kafka is a distributed commit log, not a traditional message queue.

The first thing I implemented was this log.

Each record written to disk is stored in binary form:

[offset][timestamp][size][payload]
  • Offset - logical position of the record
  • Timestamp - when it was written
  • Size - payload length
  • Payload - actual message

Records are only appended, never updated or deleted.

This design avoids random disk writes and uses sequential IO, which is extremely fast and works well with the OS page cache. This is one of the main reasons Kafka achieves high throughput on commodity hardware.

Each log segment behaves like an immutable history of events.


Java NIO and a Real Kafka Constraint

While implementing log storage using Java NIO, I initially tried opening a file with:

READ + WRITE + APPEND

Java throws:

IllegalArgumentException: READ + APPEND not allowed

On Unix systems, append mode enforces atomic end-of-file writes and cannot be combined with reads.

This forced me to manually move the file pointer to the end of the file and manage positions myself.

Kafka follows the same idea internally: read and write paths are carefully separated.

This small issue taught me a real systems-level constraint that Kafka engineers also deal with.


LogSegment in Kafka vs MiniKafka

Responsibilities of LogSegment in both systems:

  • Append records
  • Read records by offset
  • No knowledge of networking
  • No knowledge of consumers

Storage is completely independent from how data is transmitted.


Offset Index

Initially, reading a record required scanning the log from the beginning.

To improve this, I implemented an offset index file:

[offset][bytePosition]

This maps logical offsets to physical byte positions inside the log file.

With this index:

  • Reads become O(1)
  • Broker can jump directly to the record
  • Enables fast fetch, replication, and recovery

Kafka uses the same idea, though with sparse and memory-mapped indexes.


Durability vs Throughput

At first, I called fsync() after every write.

This is durable but extremely slow.

Kafka does not do this.

Instead:

  • Records are written to the OS page cache
  • Disk flush happens periodically
  • Multiple records are batched

MiniKafka now follows the same model.

This demonstrates a fundamental distributed systems tradeoff:

Higher throughput vs stronger durability guarantees


Networking Model (TCP)

MiniKafka uses plain TCP sockets.

On the broker side:

new ServerSocket(9092)
serverSocket.accept()

On the client side:

new Socket("localhost", 9092)

This performs a real TCP handshake and establishes a persistent connection.

Kafka also uses long-lived TCP connections between clients and brokers.


Pull-Based Consumption

Kafka is pull-based.

Consumers ask for data. Brokers never push.

MiniKafka follows the same model.

Flow:

Consumer
  |
  | FETCH test-topic 0 0
  v
BrokerServer
  |
  | log.read(0)
  v
LogSegment
  |
  | read from disk
  v
hello
  |
  | DATA hello
  v
Consumer

Consumer Offsets

Offsets are tracked on the client side.

Each consumer maintains its own position.

This enables:

  • Replay
  • Independent consumption
  • Crash recovery

MiniKafka stores offsets in small files per consumer group.

Kafka stores them in an internal offsets topic.

Concept is the same.


Consumer Groups

Consumers with the same groupId share progress.

Only one active consumer is allowed per group in MiniKafka (using a simple lock file).

This simulates Kafka’s group coordinator.

Different groups can read the same data independently.


Producer Support

MiniKafka includes a producer client that sends:

PRODUCE topic message

Broker appends the record to the log.

Consumers can immediately fetch it.

This completes the full pipeline:

Producer → Broker → Disk → Consumer

Protocol Module

Requests are modeled as simple objects:

  • ProduceRequest
  • FetchRequest

They serialize to strings over TCP.

Kafka uses a binary protocol, but the architectural idea is identical.


How MiniKafka Is Like Real Kafka

  • Append-only commit log
  • Offset-based addressing
  • Index file for fast reads
  • Pull-based consumers
  • Client-side offsets
  • Consumer groups
  • Batching for durability
  • Layered architecture

How MiniKafka Is Not Kafka

  • No partition rebalancing
  • No replication
  • No leader election
  • No zero-copy transfer
  • No memory-mapped files
  • Text protocol instead of binary

What This Project Taught Me

  • How Kafka works internally
  • Why logs are better than queues for streaming
  • How OS page cache enables performance
  • Why offsets are logical, not physical
  • How distributed systems separate storage, networking, and coordination