DevSecOps Notes

The Trust Decay: Why Modern Hiring Has Become an Adversarial System

2026-05-04T00:00:00-07:00

The Duality of the Current Market

The tech job market is currently defined by a jarring paradox. On one side, elite engineers land roles in days; on the other, equally qualified peers face months of silence. These aren't conflicting data points - they are the predictable outputs of a system under extreme duress.

The hiring pipeline has ceased to be a discovery engine designed to find talent. It has evolved into a defensive perimeter designed to mitigate risk in a low-trust environment.

From Discovery to Defense: The Death of Honest Inputs

Historically, recruitment operated on the assumption of "manageable honesty." You received a stack of resumes, assumed most were reasonably accurate, and searched for the best fit.

That model has collapsed. Today, hiring systems are bombarded by "strategically optimized noise," including:

Hyper-automated workflows: Candidates applying to hundreds of roles via LLM-powered scripts.
Synthetic Resumes: AI-generated profiles perfectly tuned to trigger every keyword in a Job Description (JD).
Signal Dilution: When every applicant looks like a 95% match on paper, the "match" itself becomes meaningless.

From a systems engineering perspective, the pipeline is now facing adversarial inputs. When a system is flooded with high-volume, low-integrity data, it naturally shifts its posture from "open" to "fortified."

How the System Defends Itself

When trust in incoming data drops, the system compensates with three defensive maneuvers:

1. The False Negative Bias

In a high-noise environment, the cost of a "False Positive" (a bad hire) outweighs the cost of a "False Negative" (missing a great candidate). Consequently, filters are tightened to an extreme degree. If a candidate cannot be verified with absolute certainty at the first gate, they are discarded.

2. Signal Collapse

As presentation becomes commoditized through AI, "looking the part" no longer serves as a differentiator. If everyone's resume is a work of art, no one's resume is. This leads to ranking paralysis, where recruiters rely on arbitrary or conservative heuristics because they can no longer distinguish between genuine expertise and successful optimization.

3. Upstream Trust Migration

Because the public pipeline is compromised, hiring teams are retreating to "pre-validated" channels. This explains the heavy reliance on internal referrals and known networks. It's not necessarily cronyism; it's an architectural necessity to find signal in a sea of noise.

The Feedback Loop of Friction

We are trapped in a recursive cycle. Candidates optimize harder to bypass filters; in response, filters become more draconian. This creates a "Degraded Trust Loop" where the system's own success at filtering further incentivizes candidates to game the system.

Ultimately, the pipeline stops being a way to find people and becomes a way to manage risk.

The New Strategy: Proof Over Presentation

If the market is a low-trust system, "Presentation" (how you describe yourself) is losing its value. What remains valuable is Evidence - signals that are computationally or socially expensive to fake.

Moving from "I Did" to "Here Is"

To bypass the defensive perimeter, engineers must move beyond the resume. The goal is to provide externally verifiable artifacts that don't require the pipeline to "believe" you.

Architectural Transparency: Don't just list technologies; publish (abstracted) system designs, trade-off analyses, and post-mortems of failure modes.
Tangible Artifacts: Real-world contributions - whether through open-source modules, infrastructure-as-code repos, or documented homelabs - serve as proof-of-work.
Impact-Oriented Signaling: Shift from "tasks completed" to "business outcomes achieved." Hard numbers on risk reduction, latency improvements, or cost savings are much harder to hallucinate effectively.

A DevSecOps View of the Career

If we treat the job market as a security problem, the solution becomes clear. The hiring pipeline is a system with exposed endpoints and high validation costs. As any security professional knows, when you can't trust the input, you lean on Multi-Factor Authentication. In hiring, your "factors" are your network, your public evidence, and your verifiable history.

The market isn't "broken" - it has simply changed its objective function. It no longer prioritizes finding the best; it prioritizes avoiding the unverified.

Success in 2026 and beyond isn't about having the most optimized resume. It's about being the most difficult to doubt. In a world of automated noise, reliability is the only signal that scales.

Never Lose Connection: Multi-Phone Bluetooth Tethering for Pwnagotchi

2025-07-22T00:00:00-07:00

The Common Pwnagotchi Tethering Problem

If you're an active Pwnagotchi user, you've likely faced the frustration of losing internet connectivity in the field. Whether you forgot your primary tethering phone, moved out of range, or encountered a "silent disconnect" where your phone still reports a connection but lacks actual WAN access (like a captive portal redirect), the default Bluetooth tethering often leaves your Pwnagotchi stranded. This means missed opportunities for handshakes and updates.

Introducing `bt-tether-multi`: Your Pwnagotchi's Ultimate Network Backup

I built bt-tether-multi to make Pwnagotchi networking resilient and autonomous. This plugin empowers your device to:

Intelligently Connect: Configure a list of multiple phones, prioritized by your preference, for seamless Bluetooth tethering.
Proactive WAN Detection: Detect actual loss of internet access (not just Bluetooth connection) using real-world checks.
Automatic Fallback: Gracefully switch to the next available phone in your list if the current connection drops or loses WAN.
Smart Retries: Implement a configurable retry delay to prevent rapid, unproductive cycling through phones during temporary network issues.
Clear UI Feedback: Provides immediate visual cues on the Pwnagotchi's e-ink display about its tethering status.

How It Works Under the Hood

bt-tether-multi integrates directly with your Pwnagotchi's system. Upon loading, it reads your carefully defined list of tethering phones from the config.toml file. This configuration includes essential details like the phone's name, MAC address, IP address, and operating system type (Android or iOS) to ensure correct gateway settings.

The plugin leverages standard Linux networking tools: * nmcli (NetworkManager CLI): Used to programmatically manage Bluetooth connections, including adding, deleting, and bringing up/down network interfaces for your paired phones. * curl: Employed for a fast (--max-time 3), non-intrusive check to https://www.google.com to verify genuine WAN connectivity. If curl can't reach the internet, the plugin considers the WAN lost.

UI Status Indicators:

The Pwnagotchi's display provides immediate feedback:

B:<name>: Successfully connected to one of your configured phones. The name is truncated for display.
B:???: Bluetooth is connected, but the active phone is not recognized in your configured list. This might indicate an unexpected connection or a misconfiguration.
...: The plugin is currently in the process of rotating through connections or attempting to establish one.
X: Disconnected from all configured phones.
!: A configuration error or plugin-related issue has occurred.

The sequential fallback and retry logic ensure that your Pwnagotchi stays online with minimal intervention, rotating through your devices until a stable internet connection is found.

Installation and Configuration

Installing bt-tether-multi is straightforward:

Download: Place the plugin file (bt.py from the GitHub repository) into your Pwnagotchi's custom plugin directory (typically /etc/pwnagotchi/custom-plugins/).
Configure: Add your phone details to your config.toml file. Here's a simplified example of what your config.toml might look like:

toml main.plugins.bt-tether-multi.enabled = true main.plugins.bt-tether-multi.phones = [ { name = "MyAndroid", mac = "XX:XX:XX:XX:XX:XX", ip = "192.168.44.44", type = "android" }, { name = "MyiPhone", mac = "YY:YY:YY:YY:YY:YY", ip = "172.20.10.10", type = "ios" }, ] main.plugins.bt-tether-multi.retry_delay = 180 # Optional: customize retry delay (seconds) Important: Replace XX:XX:XX:XX:XX:XX and YY:YY:YY:YY:YY:YY with your actual phone MAC addresses. Ensure your IP addresses match what your phone assigns to the Pwnagotchi's Bluetooth interface.

For a comprehensive guide and the most up-to-date configuration examples, please refer to the GitHub README in the repository.

Security Considerations

As with any tool that interacts with your system's networking, security is paramount. This plugin has been rigorously scanned with Bandit, a leading Python security linter.

The scan reported "Low Severity" warnings primarily related to the use of the subprocess module. It's crucial to understand why these are considered acceptable in this context and how they're mitigated:

No shell=True: All external commands (nmcli, curl, bluetoothctl) are executed with shell=False. This is a critical security measure as it prevents arbitrary shell command injection by treating all arguments as literal strings, not executable code.
Full Paths for Executables: The plugin now uses shutil.which to dynamically determine and use the absolute file path for nmcli, curl, and bluetoothctl. This prevents malicious executables from being run if a compromised PATH environment variable is present.
Strict Input Validation: All dynamic inputs (like MAC addresses, phone names, and IP addresses) coming from your config.toml are subjected to strict regular expression and ipaddress module validation before being passed to subprocess commands. This ensures that only well-formed and safe values are used.
Controlled Environment: Pwnagotchi runs in a specific, often isolated, environment. While caution is always advised, the risk surface is contained compared to a general-purpose server.

The "Low Severity" warnings are primarily general advisories about the potential for misuse of subprocess, rather than indicative of a direct, exploitable vulnerability in this specific implementation, given the defensive measures taken.

Final Thoughts

bt-tether-multi is designed for the Pwnagotchi enthusiast who values uptime and autonomy. It transforms a common point of failure into a robust, self-managing solution. No more restarting your Pwnagotchi or manually re-tethering when your connection goes south.

This plugin has become an indispensable part of my Pwnagotchi setup, saving me countless headaches in the field. I invite you to try it out and contribute to its development!

Find the source code, detailed installation instructions, and contribute to the project on GitHub: rivassec/bt-tether-multi

Secure Snapshot Verification in Elasticsearch with Minimal Privileges

2025-04-20T00:00:00-07:00

Es Snapshot Verifier

Verifying Elasticsearch snapshots typically requires broad manage permissions. This can be risky, especially if credentials are compromised. We can reduce the blast radius by defining a minimal role that grants only the specific actions necessary to verify snapshots without allowing deletions or alterations.

In some environments, using external monitoring systems like Datadog or Prometheus may not be feasible. Whether due to air-gapped infrastructure, compliance restrictions, or footprint concerns, having a hardened custom script with minimal privileges can be a reliable fallback.

To improve portability and maintainability, this article now references code and configuration files hosted in the elasticsearch-tools GitHub repository. This structure allows future updates to the tools without requiring edits to the article.

Minimal Elasticsearch Role

Here is a role that avoids using manage_snapshot to reduce exposure. This ensures a compromised API key cannot delete or tamper with existing backups:

View full role definition

{
  "snapshot_repo_readonly": {
    "cluster": [
      "cluster:admin/repository/get",
      "cluster:admin/repository/verify",
      "cluster:admin/snapshot/get",
      "cluster:admin/snapshot/status"
    ],
    "indices": [],
    "run_as": [],
    "metadata": {},
    "transient_metadata": {
      "enabled": true
    }
  }
}

API Key Generation

To generate an API key restricted to this role, use the following curl command. This allows access only to the approved cluster actions with a defined expiration period:

curl -u elastic:${ELASTICPASS} -X POST "localhost:9200/_security/api_key" \
  -H "Content-Type: application/json" \
  -d @elasticsearch-tools/roles/snapshot_repo_readonly.json

Snapshot Verifier Script

This API key can be used with a lightweight shell script that verifies the repository and emits Prometheus-compatible metrics. The script is secure by design and includes input validation, safe temporary file handling, and minimal permissions.

View the script

#!/bin/bash

########################################################################
# Hardened Snapshot Monitor for Elasticsearch
# Purpose: Verify an Elasticsearch snapshot repository and expose
# Prometheus-style metrics securely.
########################################################################

set -euo pipefail

: "${ES_HOST:="http://localhost:9200"}"
: "${REPO_NAME:?Missing REPO_NAME}"
: "${API_KEY_FILE:="/etc/elasticsearch/readonly-api-key"}"
: "${PROM_FILE:="/var/lib/node_exporter/textfile_collector/es_snapshot.prom"}"

if [[ ! -f "$API_KEY_FILE" ]]; then
  echo "[FATAL] API key file not found: $API_KEY_FILE" >&2
  exit 2
fi

if [[ $(stat -c "%a" "$API_KEY_FILE") -gt 600 ]]; then
  echo "[FATAL] API key file permissions too permissive (should be 600 or less)" >&2
  exit 3
fi

API_KEY=$(<"$API_KEY_FILE")
TMP_PROM_FILE=$(mktemp)
safe_repo="${REPO_NAME//[^a-zA-Z0-9_]/_}"
timestamp=$(date +%s)

response=$(curl -fsSL --retry 3 --retry-delay 2 \
  -H "Authorization: ApiKey $API_KEY" \
  -H "Content-Type: application/json" \
  -X POST "$ES_HOST/_snapshot/$REPO_NAME/_verify" || true)

if jq -e '.nodes | length > 0' <<<"$response" >/dev/null 2>&1; then
  result=1
  status="ok"
else
  result=0
  status="failed"
fi

{
  echo "# HELP es_snapshot_repository_verified Success status of snapshot verification"
  echo "# TYPE es_snapshot_repository_verified gauge"
  echo "es_snapshot_repository_verified{repo=\"$safe_repo\"} $result"
  echo "# HELP es_snapshot_repository_verified_at Unix timestamp of last check"
  echo "# TYPE es_snapshot_repository_verified_at gauge"
  echo "es_snapshot_repository_verified_at{repo=\"$safe_repo\"} $timestamp"
} > "$TMP_PROM_FILE"

mv "$TMP_PROM_FILE" "$PROM_FILE"
logger -t es-snapshot-monitor "[INFO] Verification $status for '$REPO_NAME' (code=$result)"

exit 0

Cron Example

A cron job can be configured to run the script regularly:

View example cron wrapper

#!/bin/bash

# Note: For production use, place this logic directly into a cron job or systemd timer.
#       This script is just an example for demonstration and testing.

# Fail fast on error
set -euo pipefail

# === Configuration ===
export REPO_NAME="my_backup_repo"
export ES_HOST="http://localhost:9200"
export API_KEY_FILE="/etc/elasticsearch/readonly-api-key"
export PROM_FILE="/var/lib/node_exporter/textfile_collector/es_snapshot_${REPO_NAME}.prom"

# === Invoke snapshot verification script ===
/opt/elasticsearch-tools/tools/snapshot-verifier/verify_snapshot.sh

Usage Instructions

To install, configure, and run the snapshot verification system, follow the documentation in the repository: View usage guide

This structure is ideal for environments with limited connectivity or strict compliance rules. It keeps the verification logic reproducible, auditable, and safe from privilege escalation risks. Updates to the tooling can be managed independently of the article, improving long-term maintainability.

Hardening Kubernetes Deployments

2025-04-19T00:00:00-07:00

Securing Kubernetes workloads isn't just about scanning images or tweaking RBAC, it's about enforcing the right guardrails at the pod level to minimize risk by default. This post shares field-tested strategies aligned with the Pod Security Standards (Restricted profile) to help you build safer, production-grade deployments.

Key Practices for Hardening Kubernetes Deployments

1. Run Containers as Non-Root

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  runAsGroup: 3000

This enforces that containers don’t run as UID 0, reducing the blast radius of any compromise.

2. Drop All Linux Capabilities

securityContext:
  capabilities:
    drop:
      - ALL
    add:
      - NET_BIND_SERVICE  # Only if your app needs it (e.g., for ports <1024)

Drop all capabilities by default, then add only what you need.

3. Disable Privilege Escalation

securityContext:
  allowPrivilegeEscalation: false

This prevents processes inside the container from gaining additional privileges, even if compromised.

4. Use Read-Only Filesystem

securityContext:
  readOnlyRootFilesystem: true

This blocks attackers from writing malicious files or installing tools inside the container.

5. Avoid Host Access

hostNetwork: false
hostPID: false
hostIPC: false

Avoid hostPath volumes unless absolutely required. These settings ensure your workloads remain isolated from the host.

6. Use Trusted Images and Scan Them

Use minimal base images (Alpine, Distroless) and trusted registries. Always scan them:

trivy image your-registry/app:tag

This helps catch known CVEs before deployment.

7. Handle Secrets via Volumes (Not Env Vars)

volumes:
  - name: secret-volume
    secret:
      secretName: my-secret

containers:
  - name: myapp
    volumeMounts:
      - name: secret-volume
        mountPath: "/etc/secret"
        readOnly: true

Mounting secrets as volumes avoids accidental exposure via logs or /proc.

8. Restrict Network Traffic with NetworkPolicies

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-ingress
spec:
  podSelector: {}
  policyTypes:
    - Ingress

Start with a default-deny policy per namespace, then explicitly allow only the traffic your services need. Without NetworkPolicies, any pod can communicate with any other pod in the cluster.

9. Harden ServiceAccount Usage

automountServiceAccountToken: false

Disable automatic token mounting for pods that don’t need API server access. Create dedicated ServiceAccounts with minimal RBAC bindings rather than relying on the default account, which often accumulates unnecessary permissions.

Final Thoughts

Security isn’t just about tools, it’s about secure defaults. These practices help harden your Kubernetes workloads using the Restricted Pod Security Standard and reduce risks across the board.

If you’re managing production clusters or sensitive environments, these changes are low-hanging fruit with a high return on security posture.

Taming the OOM Killer: Process Prioritization for Memory-Constrained Linux Systems

2025-04-18T00:00:00-07:00

In resource-constrained environments — especially virtual private servers, CI agents, and container hosts — the Linux kernel's Out of Memory Killer (OOM Killer) is a last-resort defense mechanism. When memory is exhausted, it begins terminating processes to keep the system alive.

The OOM Killer uses heuristics (like memory usage and the oom_score_adj value) to select processes it deems less essential. But you don’t have to leave that critical decision entirely to the kernel's default logic.

The Incident

Years ago, I had to recover a VPS via remote console. A quick dive into /var/log/messages showed that the OOM Killer had struck, terminating critical services. The culprit? A perfect storm:

Web crawlers (Google, Yahoo, Yandex) simultaneously indexing multiple sites
A torrent tracker and download script both running
IRC flood attempts while irssi was connected

This combination overwhelmed system memory. Without process priority tuning, the OOM Killer started targeting processes based on its heuristics, which felt indiscriminate from an operational view as it even took down sshd.

The Mitigation Strategy

You can significantly influence OOM Killer decisions using the /proc/<pid>/oom_score_adj setting for a process. This value ranges from -1000 to +1000. The kernel uses this score, combined with memory usage, to decide kill priority; a lower score makes the process less likely to be chosen relative to others.

A value of -1000 effectively disables OOM killing for that process.
A value of +1000 makes it a highly preferred target.
0 is the default.

Here’s a script that reads preferences from a config file and adjusts running process scores accordingly.

`/etc/oom_candidates.conf`

# Format: <process_name> <oom_score_adj_value>
# Higher = more likely to be killed. Negative = more protected.
# Critical Services (Protect Strongly)
sshd -1000
mysqld -500
portsentry -200

# Important Services (Protect Moderately)
apache2 100

# Less Critical Interactive/Background (Allow Killing)
screen 300
irssi 400

`oom_adjuster.sh`

#!/bin/bash
CONFIG="/etc/oom_candidates.conf"

if [[ ! -f "$CONFIG" ]]; then
  echo "Error: Config file $CONFIG not found." >&2
  exit 1
fi

while IFS= read -r line || [[ -n "$line" ]]; do
  [[ "$line" =~ ^#.*$ || -z "$line" ]] && continue
  read -r process score <<< "$line"

  if [[ -z "$process" || -z "$score" ]]; then
    echo "Warning: Skipping invalid line: $line" >&2
    continue
  fi

  pids=$(pgrep -x "$process")
  if [[ -z "$pids" ]]; then
    continue
  fi

  echo "Adjusting OOM score for $process (PIDs: $pids) to $score"
  for pid in $pids; do
    if [[ -w "/proc/$pid/oom_score_adj" ]]; then
      echo "$score" > "/proc/$pid/oom_score_adj" 2>/dev/null
      if [[ $? -ne 0 ]]; then
         echo "Warning: Failed to set score for $process (PID: $pid)" >&2
      fi
    else
       echo "Warning: Cannot write to oom_score_adj for $process (PID: $pid)" >&2
    fi
  done
done < "$CONFIG"

echo "OOM score adjustment complete."

Running the Script

You can run this periodically via cron or on boot with systemd. For example:

`/etc/systemd/system/oom-adjuster.service`

[Unit]
Description=Adjust OOM Scores from config file
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/oom_adjuster.sh

[Install]
WantedBy=multi-user.target

Then run:

sudo systemctl daemon-reload
sudo systemctl enable --now oom-adjuster.service

Security Considerations

From a DevSecOps perspective, OOM prioritization is not just about uptime — it’s a security hardening technique:

SSHD protection prevents lockouts during memory exhaustion.
Preserving portsentry or IDS processes ensures defense mechanisms remain active.
Avoiding the kill of logging/monitoring agents helps retain forensic data post-incident.
Minimizing risk of service flapping reduces noisy alerts and potential abuse vectors during DoS scenarios.

Misconfigured systems where critical daemons (like iptables, auditd, sshd, or VPN tunnels) are killed first expose themselves to avoidable downtime and security gaps.

Modern Use Cases

Kubernetes nodes: Influence OOM behavior via Quality of Service (QoS) classes (set by defining resource requests/limits in pod specs), or apply node-level tuning using methods like the script above for critical node components (e.g., kubelet, container runtime).
CI/CD runners: Protect build agents or essential runner services from being killed during resource-intensive test suites or concurrent builds.
Shared hosting / VPS: Prioritize core services (web server, database, SSH) over potentially less critical user processes or background tasks.

Conclusion

The OOM Killer is an essential part of the Linux kernel, but leaving process termination order purely to default heuristics can be risky in production. By strategically assigning oom_score_adj values based on business continuity and security priorities, you can significantly reduce recovery time and harden your systems against memory pressure scenarios.

How does your team manage OOM Killer behavior in critical environments? Share your strategies!

Originally inspired by a real-world VPS recovery and refreshed for the modern DevSecOps landscape.

Catching a Nation-State Proxy: OSINT Lessons from the Twitter Frontlines

2025-04-17T00:00:00-07:00

Situation

In the lead-up to Venezuela’s 2012 regional elections, I observed unusual behavior around Twitter access within the country. What began as anecdotal reports of DNS outages evolved into a deeper investigation that revealed a state-aligned proxy infrastructure potentially capable of phishing Twitter credentials.

Key Finding

A subdomain under chavezcandanga.org.ve — the official handle of then-President Hugo Chávez — was hosting a transparent proxy to Twitter. A transparent proxy intercepts user traffic without modifying requests or requiring configuration, making it ideal for passive surveillance or phishing.

While it initially showed no malicious behavior, it was:

Hosted on IP addresses outside of Twitter’s ranges
Registered under infrastructure owned by the Venezuelan government (PSUV – Partido Socialista Unido de Venezuela)
Promoted through state-controlled media and bot accounts
Served from the same IP as a political messaging app

OSINT Breakdown

1. DNS Resolution

host twitter.com

Returned expected Twitter IPs (199.59.x.x), but users in Venezuela were silently being redirected to:

190.202.80.20

This IP served Twitter content but was not operated by Twitter Inc.

It’s unclear whether this redirection was caused by ISP DNS override, local resolver poisoning, or upstream hijack — but the net effect was consistent: Twitter domains were silently redirected to non-Twitter infrastructure under state control.

2. WHOIS and Hosting Clues

whois chavezcandanga.org.ve

Revealed that the domain was registered to PSUV (Partido Socialista Unido de Venezuela) and managed through CONATEL — Venezuela’s FCC-equivalent telecommunications regulator.

Figure: WHOIS lookup confirms chavzescandanga.org.ve is registered to PSUV, with administrative and technical contacts using @psuv.org.ve emails.

3. Application Infrastructure

The same server IP hosted:

mensajes.chavezcandanga.org.ve – a campaign messaging platform
A proxy script that mirrored Twitter’s login screen

Figure: The official chavzescandanga.org.ve campaign app asks users to authenticate with Twitter to enable automatic retweets of Chávez's posts.

At the time of discovery, this site did not contain malicious code, but the potential for credential harvesting during peak election activity was substantial. The authentication flow mimicked Twitter’s branding and prompted users to log in — creating a window for silent credential capture, token misuse, or targeted amplification based on follower behavior.

Threat Model & Implications

Credential Harvesting Risk: Even without malware, a proxy to Twitter login enables password theft.
Social Media Control: Through automated bots, the government amplified its message while monitoring access points.
Authentication-layer surveillance: Intercepting Twitter logins enables password theft, identity tracking, or selective disinformation at the user level.
Infrastructure trust erosion: Even minor state-level interference with DNS or TLS undermines confidence in web authentication across the board.
Evasion of International Scrutiny: By mimicking Twitter directly, users could be deceived into trusting a controlled proxy.

Lessons for DevSecOps & Threat Intelligence Today

Verify SSL and domain trust chains during high-risk periods like elections.
Use host, whois, and passive DNS to correlate domains and IP ranges. Modern tools like amass and certificate transparency logs expand this capability significantly.
Query infrastructure databases (Shodan, Censys) for historical records on suspicious IPs and exposed services.
Watch for content delivery mismatches (site appears normal, IP is not).
Document and archive suspicious infra using tools like the Wayback Machine.
Phishing infrastructure can be state-sponsored and subtle — early detection matters.

Epilogue

The proxy remained active until at least December 2012, shortly before elections. To this day, the archived proxy content and WHOIS records serve as a warning about the ease with which social media can be co-opted in hostile environments.

This investigation was one of the earliest times I realized how fragile trusted infrastructure becomes in the hands of a motivated actor — and how critical open-source techniques are in defending it.

Have you ever spotted unusual network redirections or infrastructure anomalies? What tools or tactics helped you confirm your suspicions?

Originally published in 2012 and revisited in 2025 to reflect current DevSecOps and threat intelligence practices.

The 208.5-Day Kernel Bug: A Lesson in Uptime, Overflow, and Operational Risk

2025-04-16T00:00:00-07:00

In 2012, a subtle but potentially catastrophic bug was discovered in older versions of the Linux kernel — particularly affecting Red Hat Enterprise Linux (RHEL) and its derivatives. Once a system reached 208.5 days of continuous uptime, a flaw in the kernel’s sched_clock() function could trigger a soft lockup, freezing the CPU for an estimated 584 years.

Yes, 584 years.

The root cause? An unsigned 64-bit integer overflow. The kernel attempted to compute elapsed nanoseconds based on CPU cycles, using this logic:

/* Simplified representation of the overflow-prone calculation */
int cpu = smp_processor_id();
unsigned long long ns = per_cpu(cyc2ns_offset, cpu);
ns += cyc * per_cpu(cyc2ns, cpu) >> CYC2NS_SCALE_FACTOR;
return ns;

Once the computed value exceeded 0xffffffffffffffff, it wrapped around — leading to undefined behavior in the scheduler and an unrecoverable state requiring a manual reboot.

Why This Matters to DevSecOps

This bug is more than a curiosity — it's a classic case study in:

The operational danger of long uptimes
Why kernel patching should be automated and observable
How integer overflows can lead to severe availability risks

Affected systems included RHEL 5.0 through 5.5 and early RHEL 6 versions running kernels below 2.6.32-220.4.*. Some Debian-based distributions were likely impacted, though documentation was less complete.

Takeaways for Modern Systems

Live patching tools like Ksplice, KernelCare, and kpatch can reduce reboot pressure
Observability stacks should alert on uptime thresholds and kernel messages (dmesg, uptime, scheduler warnings)
Compliance frameworks often require timely OS patching — this bug illustrates why
CI/CD pipelines for OS-level components should test for edge cases, including time-based and overflow scenarios

Even today, this incident reminds us that uptime isn't always a badge of honor. In some cases, it's a quiet countdown to failure.

Originally inspired by a 2012 analysis of the sched_clock() bug affecting Linux systems with prolonged uptime.

The Chaos of the Leap Second (2012): When Time Broke Java and the Cloud

2025-04-15T00:00:00-07:00

What Happened?

On June 30, 2012, a leap second was inserted into atomic time via NTP to keep UTC aligned with Earth’s rotation. At 23:59:60 UTC, global systems experienced a hiccup — a single extra second that caused widespread disruptions across Reddit, LinkedIn, Yelp, Google, FourSquare, and many more.

What followed were 500 errors, high latency, and CPU usage spikes that crippled backend services.

Why Did It Break?

Though seemingly minor, the leap second broke systems in subtle and severe ways:

Java Runtime Sensitivity: Popular JVM versions at the time failed to handle the repeated 23:59:59 correctly. This triggered runaway CPU usage via thread timing bugs, particularly in services running Hadoop, Cassandra, and Elasticsearch.
Userland Misbehavior: While many Linux kernels handled the leap second without panic, userland libraries and runtimes (especially Java) choked under non-monotonic time changes.
Cloud Weakness Exposure: An Amazon EC2 outage the day before had already left infrastructure strained. With fewer available instances, systems were more vulnerable when the leap second hit.
Limited Real-World Testing: Simulating leap seconds under actual load, in full-stack distributed systems, proved nearly impossible. Pre-patch validations missed edge behavior.

Real-World Impact

This bug hit nearly every high-scale Java-based system:

Stacks: Cassandra, Hadoop, Elasticsearch, JVM-based schedulers
Companies: Reddit, Mozilla, Yelp, LinkedIn, Gawker, Facebook, StumbleUpon
Behavior: High CPU loops, 500 errors, frozen services, delayed recovery due to restart complexity

In many cases, the kernel didn't fail — the chaos came from how services processed time at runtime.

Mitigation & Takeaways

Immediate Fixes in 2012

Rolling Restarts: Restarting affected Java services often cleared the CPU lock-up, though distributed services like Cassandra made this time-consuming.
Manual Clock Reset: Some environments required forcibly resetting system time post-leap second. This fix was often applied via config tools like Puppet:

# CAUTION: Only use in environments that can tolerate manual time reset.
sudo /etc/init.d/ntp stop
date
date `date +"%m%d%H%M%C%y.%S"`
date
sudo /etc/init.d/ntp start

Modern Resilience Strategies

Leap Smearing: Today’s ntpd, chronyd, and cloud providers use “leap smear” — slowly adjusting clocks over hours to avoid time jumps entirely.
Use Monotonic Clocks: Time-sensitive logic should rely on CLOCK_MONOTONIC, not wall-clock time, to measure durations safely.

Figure: Monotonic time continues uninterrupted while wall-clock time repeats a second — highlighting why monotonic clocks are preferred for duration tracking.

Monitor Clock Drift: Observability pipelines should expose clock sync state and NTP drift as first-class metrics.
Design for Temporal Anomalies: Distributed systems should assume wall-clock time can regress, freeze, or desync — and gracefully degrade when it does.
Simulated Testing Isn’t Enough: Always combine synthetic load with chaos testing under unusual real-world conditions (e.g., leap seconds, DNS failures, NTP skew).

Epilogue

The 2012 leap second chaos wasn’t caused by incompetence — many teams patched, prepared, and tested. But the leap second hit during degraded cloud capacity, exposed fragile JVM behavior, and stressed assumptions in time-sensitive code.

A single second exposed fault lines in the foundations of the modern internet.

In 2022, the ITU voted to abolish leap seconds by 2035, largely driven by incidents like this one. Until then, the mitigations above remain essential for any system that touches wall-clock time.

What other “just time” failures have caught you off guard in production? Let’s share war stories.

(Originally published in 2012. Revisited and revised in 2025 for modern SREs, DevOps, and distributed systems engineers.)

DevSecOps Notes

The Trust Decay: Why Modern Hiring Has Become an Adversarial System

The Duality of the Current Market

From Discovery to Defense: The Death of Honest Inputs

How the System Defends Itself

1. The False Negative Bias

2. Signal Collapse

3. Upstream Trust Migration

The Feedback Loop of Friction

The New Strategy: Proof Over Presentation

Moving from "I Did" to "Here Is"

A DevSecOps View of the Career

Never Lose Connection: Multi-Phone Bluetooth Tethering for Pwnagotchi

The Common Pwnagotchi Tethering Problem

Introducing bt-tether-multi: Your Pwnagotchi's Ultimate Network Backup

How It Works Under the Hood

UI Status Indicators:

Installation and Configuration

Security Considerations

Final Thoughts

Secure Snapshot Verification in Elasticsearch with Minimal Privileges

Es Snapshot Verifier

Minimal Elasticsearch Role

API Key Generation

Snapshot Verifier Script

Cron Example

Usage Instructions

Hardening Kubernetes Deployments

Key Practices for Hardening Kubernetes Deployments

1. Run Containers as Non-Root

2. Drop All Linux Capabilities

3. Disable Privilege Escalation

4. Use Read-Only Filesystem

5. Avoid Host Access

6. Use Trusted Images and Scan Them

7. Handle Secrets via Volumes (Not Env Vars)

8. Restrict Network Traffic with NetworkPolicies

9. Harden ServiceAccount Usage

Final Thoughts

Taming the OOM Killer: Process Prioritization for Memory-Constrained Linux Systems

The Incident

The Mitigation Strategy

/etc/oom_candidates.conf

oom_adjuster.sh

Running the Script

/etc/systemd/system/oom-adjuster.service

Security Considerations

Modern Use Cases

Conclusion

Catching a Nation-State Proxy: OSINT Lessons from the Twitter Frontlines

Situation

Key Finding

OSINT Breakdown

1. DNS Resolution

2. WHOIS and Hosting Clues

3. Application Infrastructure

Threat Model & Implications

Lessons for DevSecOps & Threat Intelligence Today

Epilogue

The 208.5-Day Kernel Bug: A Lesson in Uptime, Overflow, and Operational Risk

Why This Matters to DevSecOps

Takeaways for Modern Systems

The Chaos of the Leap Second (2012): When Time Broke Java and the Cloud

What Happened?

Why Did It Break?

Real-World Impact

Mitigation & Takeaways

Immediate Fixes in 2012

Modern Resilience Strategies

Epilogue

Introducing `bt-tether-multi`: Your Pwnagotchi's Ultimate Network Backup

`/etc/oom_candidates.conf`

`oom_adjuster.sh`

`/etc/systemd/system/oom-adjuster.service`