Taming the OOM Killer: Process Prioritization for Memory-Constrained Linux Systems

Posted on Fri 18 April 2025 in DevSecOps • Tagged with linux, oomkiller, memory, system-administration, devsecops, process-management, hardening

In memory-constrained environments, the Linux OOM Killer decides what lives and what gets killed. This guide shows how to protect critical processes like sshd and mysqld using oom_score_adj values, with a script that applies them reliably and securely. Make memory pressure predictable and survivable.


Continue reading

The 208.5-Day Kernel Bug: A Lesson in Uptime, Overflow, and Operational Risk

Posted on Wed 16 April 2025 in DevSecOps • Tagged with kernel, bug, Linux, uptime, overflow, devsecops, integer-overflow

A 2012 Linux kernel bug caused CPU lockups after 208.5 days of uptime due to an integer overflow in sched_clock(). Affecting RHEL 5 and 6, it exposed the risks of long uptimes, underscoring the importance of timely patching, uptime observability, and operational risk management in DevSecOps.


Continue reading

The Chaos of the Leap Second (2012): When Time Broke Java and the Cloud

Posted on Tue 15 April 2025 in Incident Retrospectives • Tagged with leap-second, kernel, linux, java, ntp, distributed-systems, devops, sre, incident-retrospective

In 2012, a single leap second triggered global outages across Reddit, Yelp, and more. This retrospective unpacks how fragile timekeeping broke Java apps at scale, and what DevOps, SRE, and distributed systems teams can do today to avoid repeating history.


Continue reading