top of page
Search

Kafka Broker

  • OordaMage
  • Jan 26
  • 5 min read

Updated: Mar 11

A Kafka broker is just a server.


The producers send messages to the brokers and the broker assigns offset to them and writes the message to storage on disk.  

A broker also serves consumers by handling fetch requests for partitions and delivering the published messages in response.

Kafka brokers work in a cluster environment. One of these brokers in a cluster will act as the cluster controller which is responsible for admin operations, which include assigning partitions to brokers and monitoring for broker failures.

Kafka Retention: Retention defines the durable storage of messages for some time. This can be a period of time or until the partition reaches a certain size in bytes.


Kakfa Multiple Clusters

We can set multiple clusters which as the following advantages

  • Segregation

  • Isolation

  • DR


Kafka makes use of a tool called MirroMaker, used for the replication of data to other clusters. Mirromaker is a Kafka consumer and producer linked together with a queue. Messages are consumed from one Kafka Cluster and produced to another.


Kafka Broker Configurations:


broker.id; this is an integer identifier. Should be a unique value for each of the Kafka brokers.


listeners: Used to configure the port.


zookeeper.connect: The location of the zk servers. This is a semicolon-separated list of hostnames with the formathostname:port/path where hostname is the IP of the Zk servers, port is the port on which the ZK is running, and /path is the  ZK path to use as a chroot environment for the Kafka cluster.


logs.dirs: User to specific a logs path. More than one path can be provided, with a comma-separated list and the broker will store partitions on them in the least-used fashion.


num.recovery. threads.per.data.dir: Kafka makes use of a configuration pool of threads for handling log segments. Currently, this thread pool is used:

  • When starting normally, to open each partition’s log segments

  • When starting after a failure, to check and truncate each partition’s log segments

  •  When shutting down, to cleanly close log segments


By default, only one thread per log directory is used. As these threads are only used during startup and shutdown, it is reasonable to set a larger number of threads to parallelize operations. Specifically, when recovering from an unclean shut‐ down, this can mean the difference of several hours when restarting a broker with many partitions! When setting this parameter, remember that the number configured is per log directory specified with log.dirs. This means that if num. recovery.threads.per.data.dir is set to 8, and there are 3 paths specified in log.dirs, a total of 24 threads.


auto.create.topics.enable: Allows broker to automatically create a topic when

  • a producer starts writing messages to the topic

  • a consumer starts reading messages from a topic

  • any client requests metadata for the topic


auto.leader.rebalance.enable: ensures Kafka cluster doesn't become unbalanced by having all topic leadership on one broker.


delete.topic.enable: depending on the environment, and data retention guidelines, you can local down your cluster to prevent deletion of these topics.


Topics default configuration

num.partitions: This defines how many partitions to be created when a new topic is created. This defaults to 1. General practice is to have the same number of partitions as the number of clusters.


default.replication.factor: In case auto topic creation is enabled, this configuration allows you to set the default replication factor for new topics.


log.retentions.ms: used to specify, how long Kafka will retain the message, by time. The default is 168 hours specified by the log.retention.hours field. Note that if both configure are specified, the smaller unit size will take precedence.


log.retention.bytes. Used to specify, how long Kafka will retain the message, by bytes. If this is set to -1, then its infinite retention.


log.segment.bytes: When messages are produced to Kafka, they are appended to the current log segment for the partitions. Once the log segment has reached the size specified by log.segment.bytes. then the log segment is closed and a new one is opened.


This is helpful in case you have a topic that receives only 100MB of data and since the log.segement.bytes is by default set to 1GB, it will take 10 days to fil one segment. And since messages cannot expire until the log segment is close, it will take 10 days + log.retentio.ms time for the messages to expire.


log.roll.ms: Allows the log segments to be closed, by specifying the amount of time.


min.insync.replicas: This configuration ensures that N replicas are in sync with the producer. Should be used with the producer setting "ack all". This ensures that atleast two replicas, acknowledge a write for it to be successful.


message.max.bytes: Species the maximum size of a message that can be produced. Default 1MB. If a producer sends a message greater than the size, the producer will receive a blank error from the broker.


What happens if we increase the size?

Larger message size will mean that broker threads that deal with processing network connections and requests will be working longer on each individual request.

Large message sizes also increase the size of disk writes, impacting I/O throughput.


Hardware requirements:


Disk Throughput: the producer's performance will be directly influenced by the throughput of the broker disk that is used for storing log segments. Kafka broker must commit the message to local storage, and most clients will wait until at least one broker has confirmed that messages have been committed before considering the send as successful. This means that the faster disk writes will equal lower produce latency.


Disk Capacity: This is calculated based on how many messages are need to be retained and the number of retention days. Exactly if you receive 100GB of data each day and you wish to retained this data for 7 days, then you'd need close to 700GB of storage.


Memory: The normal mode of operation for a Kafka consumer is reading from the end of the partitions, where the consumer is caught up and lagging behind the producers very little, if at all. In this situation, the messages the consumer is reading are optimally stored in the system’s page cache, resulting in faster reads than if the broker has to reread the messages from disk. Therefore, having more memory available to the sys‐ tem for page cache will improve the performance of consumer clients.

Kafka itself does not need much heap memory configured for the Java Virtual Machine (JVM). Even a broker that is handling 150,000 messages per second and a data rate of 200 megabits per second can run with a 5 GB heap. The rest of the system memory will be used by the page cache and will benefit Kafka by allowing the system to cache log segments in use.


Network: The network through put will specify the amount of traffic that your cluster handle. To prevent the network from being a major governing factor, it is recommended to run with at least 10 Gb NICs 


CPU: Kafka broker must decompress all message batches, however, in order to validate the checksum of the individual messages and assign offsets. It then needs to recompress the message batch in order to store it on disk. This is where most of Kafka’s require‐ ment for processing power comes from. This should not be the primary factor in selecting hardware, however, unless clusters become very large with hundreds of nodes and millions of partitions in a single cluster. At that point, selecting more per‐ formant CPU can help reduce cluster sizes.


Recent Posts

See All
Kafka Producers & Consumers

Kafka Producers Producers write the data to the topic, ie they create the messages. Producers know to which partitions to write to and...

 
 
 
Kafka Introduction

Apache Kafka is an open-source distributed streaming platform used for building real-time data pipelines and streaming applications. The...

 
 
 

Comments


bottom of page