Kafka Observability Checklist

A practical Kafka observability checklist covering metrics, logs, traces, alert thresholds, and review cadence for production teams.

Kafka observability is easiest to improve when it becomes a repeatable review process rather than a loose collection of dashboards. This checklist is designed as a practical reference for platform teams that run Kafka in production and need clear guidance on what to monitor, how often to review it, which alert thresholds deserve attention, and how to interpret changes before they become outages. Use it as a monthly or quarterly operating document, then refine the thresholds to match your workload, retention settings, traffic patterns, and service level objectives.

Overview

This guide gives you a structured Kafka monitoring checklist across metrics, logs, traces, and alerting. The goal is not to track everything Kafka exposes. The goal is to track the signals that help you answer four operational questions quickly:

Is the cluster healthy right now?
Are producers and consumers keeping up with demand?
Is latency, lag, or storage pressure trending toward a failure?
Do we have enough context to troubleshoot incidents without guessing?

For most teams, observability for Kafka should cover five layers:

Broker health: availability, request handling, replication, storage, and controller behavior.
Topic and partition behavior: throughput, skew, replication health, and retention side effects.
Producer performance: errors, retries, request latency, batching, and acks behavior.
Consumer performance: lag, rebalance frequency, commit health, and processing latency.
Application context: traces, correlated logs, dead-letter handling, and dependency failures.

A useful checklist is opinionated. It should help your team decide what deserves an alert, what belongs in a dashboard only, and what should be reviewed on a recurring cadence. It should also be realistic. Different Kafka deployments have very different normal baselines. A low-latency transactional system, for example, will use different alert thresholds than a batch-heavy analytics pipeline.

If your team is still deciding whether Kafka is the right fit operationally, it may also help to review Kafka Alternatives for Small Teams: Easier Options for Event Streaming and Pub/Sub vs Message Queue vs Event Stream: A Practical Decision Guide. Observability requirements often reveal whether a platform is the right match for the team running it.

What to track

This section is the core checklist. Start here, then adapt it to your environment, exporters, managed service features, and internal incident patterns.

1. Cluster and broker health metrics

These are the foundation of kafka metrics to track. If these are unstable, downstream producer and consumer symptoms will follow.

Broker availability: every broker should be reachable and participating as expected.
Active controller status and controller changes: unexpected controller churn can signal instability.
Offline partitions count: this should generally be zero. Any sustained nonzero value is high priority.
Under-replicated partitions: one of the most important Kafka health metrics. Sustained growth often means replication cannot keep up or brokers are degraded.
ISR shrink and expand activity: frequent ISR changes suggest network, disk, or broker performance issues.
Request handler idle time: low idle time may indicate thread saturation.
Network throughput: bytes in, bytes out, and request rates by broker.
Disk usage and disk growth rate: absolute usage matters, but so does trend velocity.
Log flush and fsync pressure: useful when diagnosing I/O bottlenecks.
JVM and process metrics: heap usage, garbage collection pause time, CPU saturation, file descriptor usage, and memory pressure.

Suggested alert starting points:

Offline partitions > 0 for more than a few minutes: page.
Under-replicated partitions above a low single-digit baseline or rising steadily: urgent investigation.
Disk utilization crossing a conservative internal threshold, often well before full capacity: ticket plus escalation path.
Sustained low request handler idle time combined with growing request latency: investigate saturation.

Thresholds should be tuned to your architecture. For example, if broker maintenance causes brief and expected replica movement, short spikes may belong in warning-only alerts rather than pager alerts.

2. Topic and partition health

Kafka problems often hide inside a few hot topics rather than the entire cluster. Topic-level views make that visible.

Produce and fetch throughput by topic: identify growth, drops, and unusual imbalance.
Partition count and skew: uneven traffic across partitions can cause local hotspots.
Message size distribution: large payloads can increase request latency and memory pressure.
Retention and segment behavior: verify that retention settings match real usage.
Compaction lag or backlog: relevant for compacted topics that support stateful systems.
Leader distribution across brokers: uneven leadership can create asymmetric load.

What you want from this layer is not just visibility into throughput. You want to know whether your topic design is causing instability. If one topic is carrying most traffic or one partition is doing most work, scaling the cluster may not solve the real problem.

This is also where it helps to compare Kafka behavior with other broker models. If your use case is more queue-like than stream-like, some operational issues may come from a design mismatch rather than a tuning problem. Related reading: RabbitMQ vs NATS vs Redis Streams: Fast Comparison for Low-Latency Messaging.

3. Producer metrics

Producer health is one of the earliest signs of trouble. Producers show you when the cluster is slow, when acknowledgments are delayed, and when network conditions or broker errors are affecting delivery.

Request latency: median and tail latency matter more than averages alone.
Record send rate and byte send rate: use for capacity trend analysis.
Error rate: broken down by exception type where possible.
Retry rate: rising retries can indicate broker stress or transient failures.
Request timeout rate: often important for page-worthy conditions.
Batch size and linger behavior: useful for balancing efficiency and latency.
Compression ratio: helpful when traffic or storage costs change unexpectedly.
Acks-related behavior: especially if reliability expectations are strict.

Suggested alert starting points:

Producer error rate above a very low baseline for sustained periods: alert.
Retry rate rising sharply while throughput is steady: investigate cluster stress.
Tail latency increasing alongside broker thread or disk pressure: correlate and escalate.

When producer metrics degrade without obvious broker health changes, inspect dependency paths too. DNS, load balancers, TLS issues, and authentication problems can look like Kafka instability from the producer side.

4. Consumer metrics

Consumer lag is the most widely watched Kafka metric, but it is only part of the picture. Lag without processing context can be misleading.

Consumer lag by group, topic, and partition: this is essential.
Lag growth rate: often more useful than absolute lag.
Consumer throughput: records processed per second.
Poll interval behavior: long gaps may point to application stalls.
Commit latency and commit failure rate: useful for stability analysis.
Rebalance count and duration: frequent or slow rebalances disrupt throughput.
Dead-letter volume: if your processing pipeline routes failed messages elsewhere.
End-to-end processing latency: from event production to consumer completion.

Suggested alert starting points:

Lag increasing continuously beyond normal catch-up windows: alert.
Rebalances happening repeatedly over a short period: investigate.
Consumer throughput dropping while input rate stays constant: warning or incident depending on SLA.

A common mistake is paging on any lag. Some pipelines are designed to accumulate lag and drain later. Alert on lag that violates business expectations, not on lag that merely exists.

If your consumer workflows call external systems, combine Kafka monitoring with queue reliability patterns such as retries and dead-letter handling. Related reading: Webhook Queue Integration Patterns: How to Make Unreliable Callbacks Reliable and Dead Letter Queue Best Practices: Design, Retry Policies, and Monitoring.

5. Logs that deserve retention and correlation

Metrics tell you that something is wrong. Logs often tell you what changed. For Kafka, log collection should focus on searchable, structured signals rather than keeping everything forever.

Prioritize logs for:

Broker startup, shutdown, and crash events
Leader election and controller changes
Replication errors and ISR changes
Authentication and authorization failures
Quota enforcement and throttling events
Producer send failures and serialization errors
Consumer deserialization failures, rebalance events, and commit problems
Schema compatibility and payload validation errors

Try to standardize correlation fields across services: topic, partition, consumer group, message key where appropriate, trace ID, request ID, tenant or account ID, and deployment version. That turns logs into a navigable troubleshooting tool instead of a large archive.

6. Traces for Kafka workflows

Kafka tracing is especially useful in event-driven systems where application latency cannot be understood from broker metrics alone. If a user action triggers an event, which triggers multiple consumers, traces help answer where time is being spent and where failures fan out.

Trace the following when possible:

Producer publish span: include topic, partition if available, payload class, and delivery status.
Consumer receive and processing span: include group, topic, processing result, and retry count.
External dependency calls within consumers: databases, APIs, caches, and internal services.
End-to-end business workflow span: from originating request to final state change or notification.

Use traces to measure:

Time from event creation to event consumption
Time spent waiting in Kafka versus time spent inside application logic
Which downstream service contributes most to tail latency
Whether retries are broker-related or application-related

This is especially valuable for real-time product features where customers feel delays quickly. For adjacent design considerations, see How to Design Realtime Notifications Architecture for Web and Mobile Apps and How to Scale WebSockets: Connection Limits, Fanout, and Backpressure.

7. Security and governance signals

Operational observability should also include security indicators, especially in shared environments.

Authentication failures by principal or service
Authorization denials by topic or group
Certificate expiry horizon
Unexpected topic creation or configuration changes
Quota breaches or unusual client behavior
Schema or contract change failures

These signals may not page the on-call team immediately, but they belong in weekly and monthly review cycles because they often predict future outages or compliance issues.

Cadence and checkpoints

The most useful checklist is tied to a review schedule. Here is a practical cadence many teams can adapt.

Daily checks

Offline partitions and under-replicated partitions
Broker availability and disk headroom
Top consumer groups by lag
Producer error spikes and request latency changes
Any recurring authentication or authorization failures

This should be fast. Think of it as a morning health scan and an incident handoff tool.

Weekly checks

Lag trend review for critical consumer groups
Rebalance frequency by major applications
Topic throughput changes and partition skew
Growth in dead-letter or retry traffic
Tail latency changes in end-to-end traces

Weekly reviews are where you catch slow degradation before it becomes urgent.

Monthly or quarterly checks

Capacity trend analysis for CPU, network, storage, and partitions
Retention policy review against actual usage
Leader distribution and hot-spot analysis
Alert quality review: noisy alerts, missed alerts, stale thresholds
Runbook updates based on recent incidents
Security review of access failures, certificates, quotas, and configuration drift

This is also the right time to compare platform cost and management overhead with alternatives, especially if Kafka is being used for simpler messaging patterns. If cost or operational complexity is growing, review Managed Kafka Pricing Comparison: Confluent Cloud, MSK, Aiven, and Redpanda.

How to interpret changes

Metrics rarely fail in isolation. The real skill in observability for Kafka is learning to read combinations of signals.

Pattern: lag is rising

Ask these questions in order:

Has producer traffic increased suddenly?
Has consumer throughput dropped?
Are rebalances interrupting work?
Are consumers blocked on external services?
Are partitions unevenly loaded?

Lag with stable consumer throughput may mean demand increased. Lag with falling throughput points to application or broker problems. Lag with frequent rebalances often points to deployment, timeout, or consumer stability issues.

Pattern: producer retries are increasing

Correlate retries with broker request latency, under-replicated partitions, and network errors. If retries rise while brokers are healthy, investigate client configuration, connection churn, DNS, TLS, or quota throttling.

Pattern: disk usage is normal, but latency is worsening

Look at thread saturation, garbage collection, request queue depth, large messages, and uneven partition leadership. Disk capacity alone does not guarantee disk performance.

Pattern: under-replicated partitions spike briefly

This may be expected during maintenance or broker restart windows. The more important question is whether they recover quickly and predictably. Alerting should distinguish transient operational events from sustained replication failure.

Pattern: no obvious broker issue, but users report delays

This is where traces and end-to-end latency help. The queue may be healthy while consumers are slow, downstream APIs are timing out, or notifications are delayed in delivery channels outside Kafka.

If your team needs a broader framework for evaluating messaging behavior beyond Kafka itself, the benchmarking lens in Message Broker Benchmark Guide: Throughput, Latency, Ordering, and Durability Metrics can help clarify what to measure consistently.

When to revisit

This checklist should be treated as a living operating document. Revisit it on a monthly or quarterly cadence, and also whenever recurring data points materially change.

Update your dashboards, thresholds, and runbooks when any of the following happen:

Traffic profile changes: new products, seasonal spikes, large customer onboarding, or regional expansion.
Topic design changes: partition increases, new compacted topics, new retention policies, or schema changes.
Consumer architecture changes: new consumer groups, heavier processing, external API dependencies, or retry policy changes.
Infrastructure changes: broker resizing, storage class changes, network changes, Kubernetes migration, or managed service migration.
Incident learnings: any alert that fired too late, too early, or not at all should lead to checklist updates.
Tooling changes: new exporters, telemetry SDKs, tracing adoption, or logging schema changes.

As a final practical step, turn this article into a one-page internal review sheet with three columns: signal, current threshold, and owner. Then add a fourth column for last reviewed date. That simple habit is often what turns monitoring from dashboard sprawl into operational discipline.

A good Kafka observability system does not try to measure everything. It gives your team enough context to detect unhealthy trends early, investigate incidents quickly, and improve reliability over time. If you review the checklist regularly, tune thresholds to real workload behavior, and connect broker metrics to application outcomes, your monitoring will become more useful with each cycle instead of more noisy.

Kafka Observability Checklist: Metrics, Logs, Traces, and Alert Thresholds

Overview

What to track

1. Cluster and broker health metrics

2. Topic and partition health

3. Producer metrics

4. Consumer metrics

5. Logs that deserve retention and correlation

6. Traces for Kafka workflows

7. Security and governance signals

Cadence and checkpoints

Daily checks

Weekly checks

Monthly or quarterly checks

How to interpret changes

Pattern: lag is rising

Pattern: producer retries are increasing

Pattern: disk usage is normal, but latency is worsening

Pattern: under-replicated partitions spike briefly

Pattern: no obvious broker issue, but users report delays

When to revisit

Related Topics

Signal Stream Hub Editorial

Up Next

How to Migrate from Monolith Polling to Event-Driven Messaging

Stream Processing Tools Compared: Flink vs Spark vs Kafka Streams vs RisingWave

Realtime Chat Architecture Guide: Presence, Typing Indicators, and Message Sync