The Spring Boot Production Checklist (things tutorials don't teach you)
After deploying dozens of Spring Boot services to production, here's the actual checklist I use — covering health checks, graceful shutdown, connection pools, security headers, and observability.
Why this list exists
Every Spring Boot tutorial ends when the app starts locally. But running Spring Boot in production at scale is a different discipline — one you learn by being woken up at 2am by a PagerDuty alert.
This is the checklist I run through before any service goes to production. It's not complete (nothing is), but it covers the failures I've seen most often.
1. Actuator — configured correctly
Actuator is enabled by default, which means /actuator/health, /actuator/beans, /actuator/env are exposed. That last one can leak environment variables including secrets.
management:
endpoints:
web:
exposure:
include: health,info,metrics,prometheus
base-path: /internal
endpoint:
health:
show-details: when-authorized
probes:
enabled: true # enables /health/liveness and /health/readiness
The probes config gives you separate /health/liveness and /health/readiness endpoints — what Kubernetes needs. Liveness = is the app alive. Readiness = is it ready to receive traffic. Don't return the same response for both.
2. Graceful shutdown
Without this, in-flight requests get dropped when Kubernetes rolls a new deployment.
server:
shutdown: graceful
spring:
lifecycle:
timeout-per-shutdown-phase: 30s
This tells Spring Boot to stop accepting new requests immediately on shutdown signal, but finish processing the ones already in flight — up to 30 seconds.
3. Connection pool tuning
HikariCP is the default since Spring Boot 2.x and it's excellent — but the defaults are wrong for most production workloads.
spring:
datasource:
hikari:
maximum-pool-size: 10 # don't go higher without testing
minimum-idle: 5
connection-timeout: 5000 # 5s, not the default 30s
idle-timeout: 600000 # 10 min
max-lifetime: 1800000 # 30 min (less than DB timeout)
leak-detection-threshold: 60000 # alert if connection held > 60s
The leak-detection-threshold is a lifesaver — it logs a warning with the stack trace when a connection is held longer than the threshold, catching pool exhaustion bugs early.
4. Structured logging
Plain text logs are hard to query. JSON logs can be parsed by any log aggregator.
@Configuration
public class LoggingConfig {
@Bean
public LoggingSystem loggingSystem() {
return LoggingSystem.NONE; // use logback-spring.xml instead
}
}
<!-- logback-spring.xml -->
<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LogstashEncoder">
<includeContext>false</includeContext>
<customFields>{"service":"order-service","env":"${SPRING_PROFILES_ACTIVE}"}</customFields>
</encoder>
</appender>
Add logstash-logback-encoder to your pom and every log line becomes valid JSON with timestamp, level, logger, message, and your custom fields. Elastic/CloudWatch/Loki can ingest this directly.
5. Correlation IDs and tracing
Without distributed tracing, debugging a failure across microservices is archaeology.
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
<groupId>io.opentelemetry.instrumentation</groupId>
<artifactId>opentelemetry-spring-boot-starter</artifactId>
</dependency>
management:
tracing:
sampling:
probability: 0.1 # 10% in production — 100% is too expensive
otlp:
tracing:
endpoint: http://otel-collector:4318/v1/traces
This adds traceId and spanId to every log line and sends spans to your collector. With this, you can take a traceId from an error report and see every service call in that request chain.
6. Security headers
Spring Security adds some headers by default but not all of them.
@Configuration
@EnableWebSecurity
public class SecurityConfig {
@Bean
public SecurityFilterChain filterChain(HttpSecurity http) throws Exception {
http.headers(headers -> headers
.contentSecurityPolicy(csp -> csp.policyDirectives("default-src 'self'"))
.frameOptions(frame -> frame.deny())
.httpStrictTransportSecurity(hsts -> hsts
.maxAgeInSeconds(31536000)
.includeSubDomains(true))
);
return http.build();
}
}
7. Rate limiting at the API boundary
Don't let one client exhaust your service. Bucket4j with Spring Boot is the easiest option:
@Component
public class RateLimitFilter extends OncePerRequestFilter {
private final Map<String, Bucket> buckets = new ConcurrentHashMap<>();
@Override
protected void doFilterInternal(HttpServletRequest request,
HttpServletResponse response,
FilterChain chain) throws IOException, ServletException {
String clientId = request.getHeader("X-Client-ID");
Bucket bucket = buckets.computeIfAbsent(clientId, k ->
Bucket.builder()
.addLimit(Bandwidth.classic(100, Refill.greedy(100, Duration.ofMinutes(1))))
.build());
if (bucket.tryConsume(1)) {
chain.doFilter(request, response);
} else {
response.setStatus(429);
response.getWriter().write("Rate limit exceeded");
}
}
}
The one thing that will surprise you
Thread pool exhaustion silently looks like timeouts. When your app is under load and the default thread pool is full, new requests wait. This looks identical to a slow downstream service in metrics. Always expose thread pool metrics:
@Bean
public MeterBinder tomcatMetrics(TomcatEmbeddedWebappClassLoader classLoader) {
return new TomcatMetrics(null, Collections.emptyList());
}
Then alert on tomcat.threads.busy / tomcat.threads.config.max > 0.85.
The checklist above won't prevent all outages. But it closes the gap between "it works locally" and "it works in production" — which is where most teams lose time.