Skip to main content

Deployment FAQ


1  Configuring Java container memory (heap & off‑heap)

1.1  Why Java processes exceed the limit

  • Heap (-Xmx) – objects live here.
  • Off‑heap – JVM metadata (Metaspace), JIT CodeCache, thread stacks (~1 MiB each), NIO direct buffers (Kafka, Netty), etc.
  • When you set -Xmx equal to the pod’s memory limit, the JVM can still allocate off‑heap memory → cgroup limit breached → Kubernetes kills the container with SIGKILL (137).

1.2  Sizing formula

pod_memory ≈ Xmx
            + MaxDirectMemorySize
            + MaxMetaspaceSize
            + (#threads × stackSize)
            + safety_buffer

Rule of thumb: leave 600 MB head‑room (e.g. limit = 1 GiB ⇒ ‑Xmx400m).

Example flags

JAVA_TOOL_OPTIONS="\
  -Xmx1g \           # 1 GiB heap
  -XX:MaxMetaspaceSize=256m \
  -XX:MaxDirectMemorySize=512m \
  -XX:ThreadStackSize=1m"

1.3  Detecting & troubleshooting

SignalWhere you see itWhat to do
ExitCode 137kubectl get pods (STATUS = OOMKilled)Lower heap or raise limit
Long GC pausesJVM logs / Prometheus jvm_gc_pause_secondsTune heap sizes, investigate leaks
Off‑heap growthjvm_memory_direct_bytes metricCap MaxDirectMemorySize

Use Grafana dashboards (heap, direct, Metaspace, GC count, resident RSS) to spot trends before the limit is hit.


2  Microservice connectivity issues

2.1  Quick triage ladder

  1. Pod healthkubectl get pods & kubectl logs for errors.

  2. Service endpointskubectl describe svc <svc> → endpoints should list pod IP:port.

  3. DNS – inside a pod:

    kubectl exec -it <pod> -- nslookup orderservice.default
    
  4. NetworkPolicykubectl get networkpolicy -A | grep <app>.

  5. Port / protocol – confirm containerPort, targetPort, readiness.

  6. Curl testkubectl exec -it <podA> -- curl -v http://serviceB:8080/health.

2.2  Common root causes & fixes

SymptomLikely problemFix
<none> endpointsService selector doesn’t match pod labelsAlign labels/selectors
connection refusedPod not listening on that portCheck containerPort/env config
i/o timeoutNetworkPolicy default‑denyAdd allow‑rule between namespaces
no such hostWrong DNS name / cross‑namespaceUse svc.namespace.svc.cluster.local
Works via pod IP, not via servicekube‑proxy / iptables issueRestart kube‑proxy or investigate CNI

2.3  Special notes

  • Hairpin routing: same‑pod calls via service IP may fail unless hairpin‑mode enabled. Use localhost or pod IP.
  • Ingress / Gateway: external 404 or TLS errors → inspect ingress events & controller logs.
  • Service mesh (Istio/Linkerd): mis‑configured VirtualService/DestinationRule can black‑hole traffic – check sidecar logs.

3  Why a deployment fails & how to debug

3.1  Pod states cheat‑sheet

Pod stateTypical reasonKey command
ImagePullBackOffwrong image tag, private repo credskubectl describe pod → events
CrashLoopBackOffapplication exits (bad config, runtime error)kubectl logs <pod> (add --previous)
Pendingunschedulable – resources, node selector, PVCkubectl describe pod → FailedScheduling
Init:Err…init‑container failedkubectl logs <pod> -c <init>
OOMKilledmemory limit too low or leakSee § 1

3.2  Frequent culprits

  • Image problems – tag doesn’t exist, registry secret missing.
  • Bad env / ConfigMap / Secret – required var missing → NPE on startup.
  • Probe mis‑match – readiness path wrong, delay too small.
  • DB / broker not reachable – app exits if dependency unavailable.
  • Container command / perms – wrong entrypoint (code 127) or non‑exec (code 126).
  • Resource requests too high – cluster can’t fit pod.

3.3  Workflow

  1. kubectl describe deployment <name> – replicas vs available.
  2. Drill into one failing pod → kubectl describe pod.
  3. Read events at bottom; they almost always point to the issue.
  4. Inspect container logs & previous logs for stacktrace.


5  Observability best practices

5.1  Three pillars

PillarToolingWhat to capture
MetricsPrometheus → GrafanaCPU, mem, GC, latency, error rate
LogsEFKJSON, include traceId, spanId
TracesOpenTelemetry SDK → Jaegerend‑to‑end request path

5.2  Dashboards & alerts

  • Service dashboard: RPS, P95 latency, 4xx/5xx, heap, GC pauses, thread count.

  • Infra dash: node allocatable vs used, pod restarts, etc.

  • Alerts examples

    • error_rate > 5 % for 5 m
    • memory_usage > 95 % for 3 m
    • Kafka consumer lag high for 10 m

5.3  Trace → log correlation

SOpenTelemetry auto‑injects trace IDs into SLF4J MDC, so you can:

query: {service="payment"} |= "traceId=abcd1234"

and match that against Jaeger trace abcd1234.


6  Other operational gotchas

6.1  Database connections

  • Sum of (instances × pool size) must ≤ DB max.

  • Solutions:

    • shrink per‑pod pool (HikariCP.maximumPoolSize)
    • introduce PgBouncer / ProxySQL
    • scale the DB / add read‑replicas.

6.2  Certificates

  • Track TLS secret expiry (cert-manager does this) and alert ≥ 14 days before.

Handy command crib‑sheet

# Show pods restarting more than 3 times in the last hour
kubectl get pods --all-namespaces --field-selector=status.containerStatuses.restartCount>3

# Watch memory usage of a pod in real time
kubectl top pod <pod> -n <ns>

# List NetworkPolicies affecting a pod
kubectl get netpol -o wide | grep <pod_label>