Surviving the Thundering Herd: Strategies to Prevent Cache Stampede

I used to treat Caching like a magic “Turbo Button.”

If the database was slow, I’d throw Redis in front of it. If traffic increased, I’d bump up the TTL (Time To Live). My mental model was simple: Cache = Speed.

Then came the “Leaderboard Incident.”

We were launching a gamified feature where users could see a live global leaderboard. The query to build this leaderboard was heavy—it had to aggregate points from three different tables, sort them, and return the top 100. It took the database about 500ms to execute.

“No problem,” I said. “We’ll cache it for 60 seconds. The database only has to run that heavy query once per minute. Easy.”

We deployed. Traffic ramped up to 5,000 requests per second.

For the first 59 seconds, everything was beautiful. The API responded in 2ms (serving from Redis). The database was sleeping at 1% CPU. I felt like a genius.

Then the clock hit T+60 seconds.

In the span of a literal blink, the API latency shot up to 15 seconds. The database CPU pinned to 100%. The connection pool exhausted. The entire platform locked up.

I had accidentally built a Cache Stampede.

The Math of the Crash

Here is the math I ignored.

When you set a TTL of 60 seconds, you are making a promise: “At second 60, this data is garbage.”

At 60.001 seconds, the cache key expires. It is gone.

At that exact moment, we have 5,000 users hitting the endpoint per second.

Request #1 comes in. Cache miss. It asks the DB to compute the leaderboard.
The DB takes 500ms to finish that computation.
The Problem: During that 500ms “computation window,” another 2,500 requests arrive.
All 2,500 requests check Redis. They all see a MISS.
Because I didn’t have any locking mechanism, all 2,500 requests turned around and punched the database with the exact same heavy aggregation query simultaneously.

My database could handle 1 query per minute. It definitely couldn’t handle 2,500 concurrent heavy queries in half a second.

This is the Thundering Herd problem. A 99.9% cache hit ratio means nothing if the 0.1% miss happens all at once.

The Mental Shift: Cache as a Shield

The lesson I learned that day is that Caching isn’t just about speed. Caching is a Shield.

Your database is fragile. It is designed for correctness, not raw throughput. Redis is the shield that prevents the internet from hugging your database to death.

When a cache stampede happens, you have dropped the shield exactly when the enemy is firing the heaviest volley.

The Solution: Logical Expiry (Stale-While-Revalidate)

The fix is to stop relying on Redis’s built-in TTL to “kill” the data. Instead, we use Logical Expiry.

We keep the data in Redis forever (or for a very long time), but we embed a timestamp inside the data that tells us if it’s “fresh.”

Physical TTL (Redis): Set to 1 hour. (This ensures data is always available).
Logical TTL (App): Set to 1 minute. (This tells us when to refresh).

When a user asks for data at T+61s, we see that it is logically stale. But instead of deleting it and making the user wait, we return the stale data immediately and trigger a background refresh.

Step 1: The Data Structure

Instead of just storing the raw value, we wrap it in an object containing the logical_expiry.

JSON

// Redis Key: "leaderboard:global"
{
  "value": [ ... top 100 users ... ],
  
  // We want to refresh this every 60 seconds.
  // So when writing, we set this to (Date.now() + 60000)
  "logical_expiry": 1735660000 
}

Step 2: The Implementation

This is the critical part. If 5,000 users see the data is stale, you don’t want 5,000 background jobs refreshing it. You need a Mutex Lock to ensure only ONE request triggers the refresh.

Here is the robust pattern:

JavaScript

async function getLeaderboard() {
  const cacheKey = "leaderboard:global";
  const lockKey = "lock:leaderboard:global";
  
  // 1. Fetch from Cache
  // We rely on the Physical TTL (1 hour) to keep this alive.
  const cachedString = await redis.get(cacheKey);
 
  // 2. Handle Hard Miss
  // This only happens on the very first boot or if Redis crashes.
  if (!cachedString) {
    return await fetchFromDbAndUpdateCache(cacheKey);
  }
 
  const cachedData = JSON.parse(cachedString);
  const now = Date.now();
 
  // 3. Check Soft Expiry (Is the data stale?)
  // We check if the current time has passed the logical timestamp we saved.
  if (now > cachedData.logical_expiry) {
    
    // 4. ATTEMPT NON-BLOCKING LOCK (The Mutex)
    // We use SET key value NX (Not Exists) EX 10 (Expire 10s)
    // This atomic command returns 'OK' to ONLY ONE request. 
    // Everyone else gets null.
    const acquiredLock = await redis.set(lockKey, "1", "NX", "EX", 10);
 
    if (acquiredLock) {
      // 5. The Chosen One
      // I am the only request allowed to contact the database.
      // I do this in the background (no await) so I don't slow down the response.
      console.log("I am the chosen one. Refreshing cache...");
      
      // Fire and forget - do not await!
      fetchFromDbAndUpdateCache(cacheKey).catch(console.error).finally(() => {
          // Cleanup: Delete lock early so others can refresh next time
          redis.del(lockKey); 
      });
    }
    
    // 6. EVERYONE (Including the chosen one) returns the STALE data.
    // Zero latency spike. The user never knows the DB is working hard.
  }
 
  return cachedData.value;
}

Why this logic wins

Zero Waits: Users never wait for the database update. They get the stale data instantly.
No Stampede: The SETNX lock ensures that even if 5,000 requests hit the stale data, only one request triggers the database query.
Self-Healing: If the background process crashes, the lock expires automatically in 10 seconds, and the next request tries again.

Summary

If you are designing a high-traffic system, set(key, value, ttl) is not enough. You are simply scheduling your downtime for exactly TTL seconds in the future.

Design the “Miss Path” first. Ask yourself: “If this key vanishes right now, will my database survive the next 500 milliseconds?”

If the answer is no, you need a Stampede strategy.

← Back to index

CodeTerra

Explorer