vitor.dev
Tech4 min readAugust 30, 2024

The Spring Boot Production Checklist (things tutorials don't teach you)

After deploying dozens of Spring Boot services to production, here's the actual checklist I use — covering health checks, graceful shutdown, connection pools, security headers, and observability.

#java#spring-boot#production#devops#observability

Why this list exists

Every Spring Boot tutorial ends when the app starts locally. But running Spring Boot in production at scale is a different discipline — one you learn by being woken up at 2am by a PagerDuty alert.

This is the checklist I run through before any service goes to production. It's not complete (nothing is), but it covers the failures I've seen most often.

1. Actuator — configured correctly

Actuator is enabled by default, which means /actuator/health, /actuator/beans, /actuator/env are exposed. That last one can leak environment variables including secrets.

management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
      base-path: /internal
  endpoint:
    health:
      show-details: when-authorized
      probes:
        enabled: true  # enables /health/liveness and /health/readiness

The probes config gives you separate /health/liveness and /health/readiness endpoints — what Kubernetes needs. Liveness = is the app alive. Readiness = is it ready to receive traffic. Don't return the same response for both.

2. Graceful shutdown

Without this, in-flight requests get dropped when Kubernetes rolls a new deployment.

server:
  shutdown: graceful

spring:
  lifecycle:
    timeout-per-shutdown-phase: 30s

This tells Spring Boot to stop accepting new requests immediately on shutdown signal, but finish processing the ones already in flight — up to 30 seconds.

3. Connection pool tuning

HikariCP is the default since Spring Boot 2.x and it's excellent — but the defaults are wrong for most production workloads.

spring:
  datasource:
    hikari:
      maximum-pool-size: 10      # don't go higher without testing
      minimum-idle: 5
      connection-timeout: 5000   # 5s, not the default 30s
      idle-timeout: 600000       # 10 min
      max-lifetime: 1800000      # 30 min (less than DB timeout)
      leak-detection-threshold: 60000  # alert if connection held > 60s

The leak-detection-threshold is a lifesaver — it logs a warning with the stack trace when a connection is held longer than the threshold, catching pool exhaustion bugs early.

4. Structured logging

Plain text logs are hard to query. JSON logs can be parsed by any log aggregator.

@Configuration
public class LoggingConfig {
    @Bean
    public LoggingSystem loggingSystem() {
        return LoggingSystem.NONE; // use logback-spring.xml instead
    }
}
<!-- logback-spring.xml -->
<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
    <encoder class="net.logstash.logback.encoder.LogstashEncoder">
        <includeContext>false</includeContext>
        <customFields>{"service":"order-service","env":"${SPRING_PROFILES_ACTIVE}"}</customFields>
    </encoder>
</appender>

Add logstash-logback-encoder to your pom and every log line becomes valid JSON with timestamp, level, logger, message, and your custom fields. Elastic/CloudWatch/Loki can ingest this directly.

5. Correlation IDs and tracing

Without distributed tracing, debugging a failure across microservices is archaeology.

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry.instrumentation</groupId>
    <artifactId>opentelemetry-spring-boot-starter</artifactId>
</dependency>
management:
  tracing:
    sampling:
      probability: 0.1  # 10% in production — 100% is too expensive
  otlp:
    tracing:
      endpoint: http://otel-collector:4318/v1/traces

This adds traceId and spanId to every log line and sends spans to your collector. With this, you can take a traceId from an error report and see every service call in that request chain.

6. Security headers

Spring Security adds some headers by default but not all of them.

@Configuration
@EnableWebSecurity
public class SecurityConfig {

    @Bean
    public SecurityFilterChain filterChain(HttpSecurity http) throws Exception {
        http.headers(headers -> headers
            .contentSecurityPolicy(csp -> csp.policyDirectives("default-src 'self'"))
            .frameOptions(frame -> frame.deny())
            .httpStrictTransportSecurity(hsts -> hsts
                .maxAgeInSeconds(31536000)
                .includeSubDomains(true))
        );
        return http.build();
    }
}

7. Rate limiting at the API boundary

Don't let one client exhaust your service. Bucket4j with Spring Boot is the easiest option:

@Component
public class RateLimitFilter extends OncePerRequestFilter {

    private final Map<String, Bucket> buckets = new ConcurrentHashMap<>();

    @Override
    protected void doFilterInternal(HttpServletRequest request,
                                    HttpServletResponse response,
                                    FilterChain chain) throws IOException, ServletException {
        String clientId = request.getHeader("X-Client-ID");
        Bucket bucket = buckets.computeIfAbsent(clientId, k ->
            Bucket.builder()
                .addLimit(Bandwidth.classic(100, Refill.greedy(100, Duration.ofMinutes(1))))
                .build());

        if (bucket.tryConsume(1)) {
            chain.doFilter(request, response);
        } else {
            response.setStatus(429);
            response.getWriter().write("Rate limit exceeded");
        }
    }
}

The one thing that will surprise you

Thread pool exhaustion silently looks like timeouts. When your app is under load and the default thread pool is full, new requests wait. This looks identical to a slow downstream service in metrics. Always expose thread pool metrics:

@Bean
public MeterBinder tomcatMetrics(TomcatEmbeddedWebappClassLoader classLoader) {
    return new TomcatMetrics(null, Collections.emptyList());
}

Then alert on tomcat.threads.busy / tomcat.threads.config.max > 0.85.


The checklist above won't prevent all outages. But it closes the gap between "it works locally" and "it works in production" — which is where most teams lose time.