Deployment FAQ

1 Configuring Java container memory (heap & off‑heap)

1.1 Why Java processes exceed the limit

Heap (-Xmx) – objects live here.
Off‑heap – JVM metadata (Metaspace), JIT CodeCache, thread stacks (~1 MiB each), NIO direct buffers (Kafka, Netty), etc.
When you set -Xmx equal to the pod’s memory limit, the JVM can still allocate off‑heap memory → cgroup limit breached → Kubernetes kills the container with SIGKILL (137).

1.2 Sizing formula

pod_memory ≈ Xmx
            + MaxDirectMemorySize
            + MaxMetaspaceSize
            + (#threads × stackSize)
            + safety_buffer

Rule of thumb: leave 600 MB head‑room (e.g. limit = 1 GiB ⇒ ‑Xmx400m).

Example flags

JAVA_TOOL_OPTIONS="\
  -Xmx1g \           # 1 GiB heap
  -XX:MaxMetaspaceSize=256m \
  -XX:MaxDirectMemorySize=512m \
  -XX:ThreadStackSize=1m"

1.3 Detecting & troubleshooting

Signal	Where you see it	What to do
`ExitCode 137`	`kubectl get pods` (STATUS = OOMKilled)	Lower heap or raise limit
Long GC pauses	JVM logs / Prometheus `jvm_gc_pause_seconds`	Tune heap sizes, investigate leaks
Off‑heap growth	`jvm_memory_direct_bytes` metric	Cap `MaxDirectMemorySize`

Use Grafana dashboards (heap, direct, Metaspace, GC count, resident RSS) to spot trends before the limit is hit.

2 Microservice connectivity issues

2.1 Quick triage ladder

Pod health – kubectl get pods & kubectl logs for errors.
Service endpoints – kubectl describe svc <svc> → endpoints should list pod IP:port.

DNS – inside a pod:

kubectl exec -it <pod> -- nslookup orderservice.default

NetworkPolicy – kubectl get networkpolicy -A | grep <app>.
Port / protocol – confirm containerPort, targetPort, readiness.
Curl test – kubectl exec -it <podA> -- curl -v http://serviceB:8080/health.

2.2 Common root causes & fixes

Symptom	Likely problem	Fix
`<none>` endpoints	Service selector doesn’t match pod labels	Align labels/selectors
`connection refused`	Pod not listening on that port	Check containerPort/env config
`i/o timeout`	NetworkPolicy default‑deny	Add allow‑rule between namespaces
`no such host`	Wrong DNS name / cross‑namespace	Use `svc.namespace.svc.cluster.local`
Works via pod IP, not via service	kube‑proxy / iptables issue	Restart kube‑proxy or investigate CNI

2.3 Special notes

Hairpin routing: same‑pod calls via service IP may fail unless hairpin‑mode enabled. Use localhost or pod IP.
Ingress / Gateway: external 404 or TLS errors → inspect ingress events & controller logs.
Service mesh (Istio/Linkerd): mis‑configured VirtualService/DestinationRule can black‑hole traffic – check sidecar logs.

3 Why a deployment fails & how to debug

3.1 Pod states cheat‑sheet

Pod state	Typical reason	Key command
`ImagePullBackOff`	wrong image tag, private repo creds	`kubectl describe pod` → events
`CrashLoopBackOff`	application exits (bad config, runtime error)	`kubectl logs <pod>` (add `--previous`)
`Pending`	unschedulable – resources, node selector, PVC	`kubectl describe pod` → `FailedScheduling`
`Init:Err…`	init‑container failed	`kubectl logs <pod> -c <init>`
`OOMKilled`	memory limit too low or leak	See § 1

3.2 Frequent culprits

Image problems – tag doesn’t exist, registry secret missing.
Bad env / ConfigMap / Secret – required var missing → NPE on startup.
Probe mis‑match – readiness path wrong, delay too small.
DB / broker not reachable – app exits if dependency unavailable.
Container command / perms – wrong entrypoint (code 127) or non‑exec (code 126).
Resource requests too high – cluster can’t fit pod.

3.3 Workflow

kubectl describe deployment <name> – replicas vs available.
Drill into one failing pod → kubectl describe pod.
Read events at bottom; they almost always point to the issue.
Inspect container logs & previous logs for stacktrace.

5 Observability best practices

5.1 Three pillars

Pillar	Tooling	What to capture
Metrics	Prometheus → Grafana	CPU, mem, GC, latency, error rate
Logs	EFK	JSON, include `traceId`, `spanId`
Traces	OpenTelemetry SDK → Jaeger	end‑to‑end request path

5.2 Dashboards & alerts

Service dashboard: RPS, P95 latency, 4xx/5xx, heap, GC pauses, thread count.
Infra dash: node allocatable vs used, pod restarts, etc.
Alerts examples
- error_rate > 5 % for 5 m
- memory_usage > 95 % for 3 m
- Kafka consumer lag high for 10 m

5.3 Trace → log correlation

SOpenTelemetry auto‑injects trace IDs into SLF4J MDC, so you can:

query: {service="payment"} |= "traceId=abcd1234"

and match that against Jaeger trace abcd1234.

6 Other operational gotchas

6.1 Database connections

Sum of (instances × pool size) must ≤ DB max.
Solutions:
- shrink per‑pod pool (HikariCP.maximumPoolSize)
- introduce PgBouncer / ProxySQL
- scale the DB / add read‑replicas.

6.2 Certificates

Track TLS secret expiry (cert-manager does this) and alert ≥ 14 days before.

Handy command crib‑sheet

# Show pods restarting more than 3 times in the last hour
kubectl get pods --all-namespaces --field-selector=status.containerStatuses.restartCount>3

# Watch memory usage of a pod in real time
kubectl top pod <pod> -n <ns>

# List NetworkPolicies affecting a pod
kubectl get netpol -o wide | grep <pod_label>