Why Kafka is the Right Backbone for Java Microservices (and how to use it without overcomplicating things)

The problem with "just use Kafka"

When a team starts breaking a monolith into microservices, the default advice is: put Kafka between everything. The reasoning sounds solid — decoupling, async, scalability. But in practice, teams end up with:

Topics nobody owns
Consumers that silently fail and nobody notices
Schema drift breaking production at 3am
Events nobody reads but everyone keeps publishing

This isn't a Kafka problem. It's a design problem. Kafka is a durable, distributed commit log — it doesn't make your architecture correct, it just makes the consequences of bad design harder to undo.

When Kafka is the right call

Use Kafka when you need:

Durable event streaming — orders, payments, audit logs. Things where "fire and forget to a queue" isn't enough.
Fan-out — multiple consumers reading the same event stream independently.
Temporal decoupling — a consumer that processes events at its own pace, including replaying history.
High throughput — if you're doing hundreds of thousands of events per day with a need for replayability, Kafka earns its keep.

If you're just doing synchronous request/reply between two services, use HTTP or gRPC. Don't add Kafka for its own sake.

The Spring Boot setup that actually works

@Configuration
public class KafkaConsumerConfig {

    @Bean
    public ConsumerFactory<String, OrderEvent> consumerFactory() {
        Map<String, Object> props = new HashMap<>();
        props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
        props.put(ConsumerConfig.GROUP_ID_CONFIG, "order-processor");
        props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
        props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false); // manual ack only
        props.put(JsonDeserializer.TRUSTED_PACKAGES, "com.yourapp.events");
        return new DefaultKafkaConsumerFactory<>(props,
            new StringDeserializer(),
            new JsonDeserializer<>(OrderEvent.class));
    }

    @Bean
    public ConcurrentKafkaListenerContainerFactory<String, OrderEvent> kafkaListenerFactory() {
        var factory = new ConcurrentKafkaListenerContainerFactory<String, OrderEvent>();
        factory.setConsumerFactory(consumerFactory());
        factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.MANUAL_IMMEDIATE);
        factory.setCommonErrorHandler(new DefaultErrorHandler(
            new DeadLetterPublishingRecoverer(kafkaTemplate),
            new FixedBackOff(1000L, 3)));
        return factory;
    }
}

The critical parts most tutorials skip:

Disable auto-commit. Automatic offset commits mean a crashed consumer can skip events. Always use MANUAL_IMMEDIATE or MANUAL and commit only after successful processing.
Dead Letter Topic (DLT). Every consumer should have a DLT. Events that fail after retries go there instead of being silently dropped or blocking the partition.
Idempotent consumers. Kafka delivers at least once. Your consumer will receive duplicates. Design accordingly — use a unique event ID and check before processing.

Schema management with Avro (the part everyone skips)

JSON is fine for getting started but breaks at scale. When you have 15 producers and 30 consumers, schema drift in JSON will cost you a production incident.

// With Avro + Schema Registry
@Bean
public ProducerFactory<String, OrderEvent> producerFactory() {
    Map<String, Object> props = new HashMap<>();
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
    props.put("schema.registry.url", schemaRegistryUrl);
    props.put(ProducerConfig.ACKS_CONFIG, "all"); // wait for all ISR replicas
    props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
    return new DefaultKafkaProducerFactory<>(props);
}

The Schema Registry enforces compatibility rules (BACKWARD, FORWARD, FULL). This means consumers on the old schema can still read events produced by the new schema — no coordinated deployment required.

Monitoring the things that actually fail

The default Kafka metrics are noisy. These are the ones worth alerting on:

Metric	What it means	Alert threshold
`consumer_lag`	How far behind the consumer is	> 10k messages
`records_consumed_rate`	Throughput, sudden drops indicate issues	< 50% of baseline
`failed_authentication`	ACL / cert issues before they cascade	Any non-zero
`re-balance rate`	High values mean unstable consumer groups	> 2/hour

Expose these via Micrometer + Prometheus and you'll catch problems before users do.

The one thing that will save you in production

Test your consumer restart behavior. Stop your consumer, publish 10k events, restart. Does it process all of them? Does it process any duplicates twice? Does it handle poison pill messages correctly?

Most teams never test this and discover the behavior during an incident.

The patterns above are what I've run in production at companies processing 50k+ events/day. The details — manual ack, DLT, idempotency, schema registry — feel like overhead until the day they save you.