Process Killer Strategies: When to Kill vs. Restart a Process

Impact on users/data: If the process holds unsaved user data or critical transactions, prefer restart attempts that preserve state (graceful stop/restart). If state is irrecoverable or corruption risk is high, killing may be safer.
Responsiveness and progress: If a process responds to signals (e.g., SIGTERM) and shows progress toward shutdown, allow graceful termination. If it’s unresponsive for a configured timeout, escalate to forceful kill (e.g., SIGKILL).
Resource consumption: High CPU, memory, or I/O that jeopardizes system stability justifies an immediate kill if mitigation (throttling, reprioritizing) isn’t possible.
Error type and recurrence: For transient faults (network glitch, temporary resource spike), restart often suffices. For repeated crashes with the same stack trace or state, investigate before automated restarts to avoid crash loops.
Dependencies and cascading effects: If stopping the process cleanly prevents cascading failures in dependent services, prefer graceful restart. If a stuck process blocks other critical services, a kill may be necessary.
Time sensitivity and SLA: For strict uptime/SLA needs, automated restarts may be preferable with health checks and circuit breakers; for non-critical jobs, manual intervention can reduce risk.

Attempt graceful shutdown
- Send polite termination (e.g., SIGTERM, service stop API) and wait a short configurable timeout (e.g., 5–30s).
Collect diagnostics
- On timeout, capture logs, stack traces, heap dumps, or thread dumps before killing (if feasible).
Force kill if still unresponsive
- Use an immediate kill (e.g., SIGKILL) to free resources.
Restart with safeguards
- Restart the process with backoff delays (exponential backoff), and limit restart attempts per time window to avoid loops.
Health checks and monitoring
- Use liveness/readiness probes to detect failure early and avoid unnecessary restarts. Alert on repeated failures.
Automated vs manual escalation
- Configure automated restarts for transient issues; escalate to on-call when thresholds exceeded (e.g., ≥3 restarts in 10 minutes).
Postmortem and root cause
- After stabilization, perform root-cause analysis if kills/restarts exceed acceptable rates.

Signal handling: Implement clean shutdown handlers to flush state and close resources on graceful termination.
Timeouts and thresholds: Tune timeouts and restart limits for your workload; database-backed services often need longer shutdown windows.
Isolation: Run risky processes in containers or cgroups to limit collateral damage and simplify kill/restart.
Backups and checkpoints: Regularly checkpoint state so restarts can resume with minimal data loss.
Avoid blind cron kills: Prefer targeted detection (health checks, resource monitors) over periodic brute-force kills.

This strategy balances safety (preserving data/state) with system availability (freeing resources quickly when needed).

Comments