Deployment FAQ
1 Configuring Java container memory (heap & off‑heap)
1.1 Why Java processes exceed the limit
- Heap (
-Xmx
) – objects live here. - Off‑heap – JVM metadata (Metaspace), JIT CodeCache, thread stacks (~1 MiB each), NIO direct buffers (Kafka, Netty), etc.
- When you set
-Xmx
equal to the pod’s memory limit, the JVM can still allocate off‑heap memory → cgroup limit breached → Kubernetes kills the container with SIGKILL (137).
1.2 Sizing formula
pod_memory ≈ Xmx
+ MaxDirectMemorySize
+ MaxMetaspaceSize
+ (#threads × stackSize)
+ safety_buffer
Rule of thumb: leave 600 MB head‑room (e.g. limit = 1 GiB ⇒ ‑Xmx400m
).
Example flags
JAVA_TOOL_OPTIONS="\
-Xmx1g \ # 1 GiB heap
-XX:MaxMetaspaceSize=256m \
-XX:MaxDirectMemorySize=512m \
-XX:ThreadStackSize=1m"
1.3 Detecting & troubleshooting
Signal | Where you see it | What to do |
---|---|---|
ExitCode 137 | kubectl get pods (STATUS = OOMKilled) | Lower heap or raise limit |
Long GC pauses | JVM logs / Prometheus jvm_gc_pause_seconds | Tune heap sizes, investigate leaks |
Off‑heap growth | jvm_memory_direct_bytes metric | Cap MaxDirectMemorySize |
Use Grafana dashboards (heap, direct, Metaspace, GC count, resident RSS) to spot trends before the limit is hit.
2 Microservice connectivity issues
2.1 Quick triage ladder
-
Pod health –
kubectl get pods
&kubectl logs
for errors. -
Service endpoints –
kubectl describe svc <svc>
→ endpoints should list pod IP:port. -
DNS – inside a pod:
kubectl exec -it <pod> -- nslookup orderservice.default
-
NetworkPolicy –
kubectl get networkpolicy -A | grep <app>
. -
Port / protocol – confirm
containerPort
,targetPort
, readiness. -
Curl test –
kubectl exec -it <podA> -- curl -v http://serviceB:8080/health
.
2.2 Common root causes & fixes
Symptom | Likely problem | Fix |
---|---|---|
<none> endpoints | Service selector doesn’t match pod labels | Align labels/selectors |
connection refused | Pod not listening on that port | Check containerPort/env config |
i/o timeout | NetworkPolicy default‑deny | Add allow‑rule between namespaces |
no such host | Wrong DNS name / cross‑namespace | Use svc.namespace.svc.cluster.local |
Works via pod IP, not via service | kube‑proxy / iptables issue | Restart kube‑proxy or investigate CNI |
2.3 Special notes
- Hairpin routing: same‑pod calls via service IP may fail unless hairpin‑mode enabled. Use
localhost
or pod IP. - Ingress / Gateway: external 404 or TLS errors → inspect ingress events & controller logs.
- Service mesh (Istio/Linkerd): mis‑configured VirtualService/DestinationRule can black‑hole traffic – check sidecar logs.
3 Why a deployment fails & how to debug
3.1 Pod states cheat‑sheet
Pod state | Typical reason | Key command |
---|---|---|
ImagePullBackOff | wrong image tag, private repo creds | kubectl describe pod → events |
CrashLoopBackOff | application exits (bad config, runtime error) | kubectl logs <pod> (add --previous ) |
Pending | unschedulable – resources, node selector, PVC | kubectl describe pod → FailedScheduling |
Init:Err… | init‑container failed | kubectl logs <pod> -c <init> |
OOMKilled | memory limit too low or leak | See § 1 |
3.2 Frequent culprits
- Image problems – tag doesn’t exist, registry secret missing.
- Bad env / ConfigMap / Secret – required var missing → NPE on startup.
- Probe mis‑match – readiness path wrong, delay too small.
- DB / broker not reachable – app exits if dependency unavailable.
- Container command / perms – wrong entrypoint (code 127) or non‑exec (code 126).
- Resource requests too high – cluster can’t fit pod.
3.3 Workflow
kubectl describe deployment <name>
– replicas vs available.- Drill into one failing pod →
kubectl describe pod
. - Read events at bottom; they almost always point to the issue.
- Inspect container logs & previous logs for stacktrace.
5 Observability best practices
5.1 Three pillars
Pillar | Tooling | What to capture |
---|---|---|
Metrics | Prometheus → Grafana | CPU, mem, GC, latency, error rate |
Logs | EFK | JSON, include traceId , spanId |
Traces | OpenTelemetry SDK → Jaeger | end‑to‑end request path |
5.2 Dashboards & alerts
-
Service dashboard: RPS, P95 latency, 4xx/5xx, heap, GC pauses, thread count.
-
Infra dash: node allocatable vs used, pod restarts, etc.
-
Alerts examples
- error_rate > 5 % for 5 m
- memory_usage > 95 % for 3 m
- Kafka consumer lag high for 10 m
5.3 Trace → log correlation
SOpenTelemetry auto‑injects trace IDs into SLF4J MDC, so you can:
query: {service="payment"} |= "traceId=abcd1234"
and match that against Jaeger trace abcd1234
.
6 Other operational gotchas
6.1 Database connections
-
Sum of (instances × pool size) must ≤ DB max.
-
Solutions:
- shrink per‑pod pool (
HikariCP.maximumPoolSize
) - introduce PgBouncer / ProxySQL
- scale the DB / add read‑replicas.
- shrink per‑pod pool (
6.2 Certificates
- Track TLS secret expiry (
cert-manager
does this) and alert ≥ 14 days before.
Handy command crib‑sheet
# Show pods restarting more than 3 times in the last hour
kubectl get pods --all-namespaces --field-selector=status.containerStatuses.restartCount>3
# Watch memory usage of a pod in real time
kubectl top pod <pod> -n <ns>
# List NetworkPolicies affecting a pod
kubectl get netpol -o wide | grep <pod_label>