2024-04-22 - Hub Incident Post Mortem
Beginning on Monday, April 22nd, 2024, at 8:45, the Hugging Face Hub experienced an extended outage. The incident lasted until Tuesday, April 24th, at 10:03 when service was fully restored. During the incident, the Hugging Face website was slowed down or inaccessible for most users.
This blog post describes the timeline of the event, what failed, and the changes we made to ensure that such an event does not happen again.
All timestamps/dates are in CET unless explicitly stated otherwise
Context
Here is a simple diagram showing the architecture of the Hugging Face Hub in production:
And some links to the relevant bricks:
Summary
22/04 at 8:45, instabilities started, escalating until 22/04 at 9:30 where the hub became unresponsive.
Two read only nodes were down, as we can see in the Atlas metrics; they kept restarting:
The cluster autoscaled once, but we then had to manually scale to the next tier, which put the cluster back in a somewhat normal state. From the support case:
At 04/22/24 - 07:04:24 AM UTC the cluster tier starting for increasing the cluster from tier 1 to tier 2 and the cluster was upgraded and came into the HEALTHY state 04/22/24 - 07:33:52 AM.
another instance of manual auto-scaling was triggered at 04/22/24 - 12:46:48 PM UTC to bring the cluster on the tier 3.
We noticed a spike in traffic coming from an internal source using a Java/17
user-agent, so we added a firewall rule at 22/04 10:22 to block all requests with that UA, fearing this was the cause. While it alleviated the load, this didn’t fix the underlying issue.
22/04 at 10:44, we blocked the quicksearch feature on the hub.
The hub was still unstable although the database seemed healthy, so we attempted to scale the nodes of the hub’s k8s cluster, given we had a configuration issue and it had not been done automatically.
22/04 at 14:26, tweeted about incident
22/04 at 14:50 Switched to maintenance mode and displayed a 503 page. This stopped traffic in an attempt to alleviate load and return to a nominal state.
22/04 at 15:00, downscaled all the hub’s pods.
22/04 at 15:51, everything was back online.
23/04 at 21:30, hub down
This time the number of connections and the system memory drastically increased, then the primary OOMed. Then, the cluster changed its primary node but the same thing happened and the cluster elected the previous primary as the new primary.
23/04 at 21:56, we decided to switch the hub to maintenance mode and block all incoming traffic. That reduced the load on the database and the cluster returned to its nominal state.
24/04 at 8:00, hub was slow / unresponsive again. We could see the same pattern as the previous night occur.
24/04 at 8:12 hub was scaled to 0 and a maintenance page was set. Around 24/04 at 8:18 we lifted the maintenance page as everything was stable and everything went back to normal.
24/04 at 10:03, a suspicious metrics pointed at hub going down again, so we decided to set an aggressive rate limit which made every metric go back to normal and the hub did not go down.
Then traffic remained stable as we were cleaning up old unused indexes, gathering more metrics about the database. The following tools & resources proved to be very useful:
- keyhole
- https://github.com/simagix/keyhole/wiki/
- hatchet
- Peek at your MongoDB clusters like a pro with keyhole part 1
- Peek at your MongoDB clusters like a pro with keyhole part 2
- https://www.mongodb.com/community/forums/t/how-to-analyze-mongodb-ram-usage/12108/3
Root Cause Identification
We had been through all charts and graphs to try to see if something came up but we couldn’t find anything. In a last effort attempt, we decided we needed a graph for the breakdown of the number of requests per url
called, built it and when looking at the 2nd crash on 23-04-2024T21:30:00, we observed a suspiciously timed request:
When calling https://huggingface.co/api/models/sentence-transformers/all-mpnet-base-v2/revision/main, we could observe that there was a list of 300+ associated spaces with this model.
What we had been observing when going through all the logs & metrics of the Hub’s mongodb database is that *_info_cache
collections were overused: poor average latency, very high network out.
Note here that response times are biased because of the latencies observed during the outage. Also, both screenshots come from different mongodb replicas.
This prompted us to think that this call was in fact using space_info_cache
(which it was) and may be the cause for an excessive usage of the database. So we decided to investigate further: we isolated that specific requested URL, and the increase in traffic to that endpoint correlated perfectly with an increase in database network out usage, which had been a precursor in some of the crashes.
We finally decided to attempt to reproduce the problem by running the following k6 script:
import http from "k6/http";
export const options = {
vus: 512,
duration: '10m',
};
export default function() {
const response = http.get("https://huggingface.co/api/models/sentence-transformers/all-mpnet-base-v2/revision/main");
}
$ k6 run kill_hub.js
Which resulted in an increase in all aforementioned metrics and when attempting to load huggingface.co we observed either slowness or total unresponsiveness. So we stopped running the script to avoid breaking the mongodb cluster. That was it!
Note: we ran the script 3 times, once at 24/04 18:05, 18:15 and lastly ~18:45
From our observations, the first crash was not caused by a call to this URL, and when looking at the breakdown per URL at the time of the incident, we can see other suspicious looking queries:
This call saw a big increase in requests right before the incident, and given it has some of the properties of the call analyzed above (it uses datasets_info_cache
, one of the 3 collections that put a significant strain on the database), we can conclude that this was the trigger for the incident.
We had trouble reproducing with the call to https://huggingface.co/api/datasets/lighteval/mmlu/revision/main given that the next morning a lot of work had been done by the hub team to mitigate the root cause.
Lastly, on the 23/04 evening crash, we saw an increase in requests to the same endpoint:
These very resource intensive requests put the database in a degraded state. What made things worse is that our Hub’s NodeJS code didn’t cancel requests if our users cancelled theirs. Which caused a thundering herd issue, where all already ongoing db request would continue to run until completion, even if no one was listening at the other end. For example, when a page was slow to respond, and I hard refreshed my page, all ongoing connection would keep running.
So — with back of the envelope numbers — if we had 10 db requests and 50k users trying to do something on a Monday morning, and they refreshed their page 3 times after waiting 30s, mongodb would have 2 * 10 * 50000 more requests than usual at that time. Meaning that if our DB was not in ship-shape, i.e. someone hit it real hard, things would snowball, ultimately leading to a crash.
Mitigation / Solution
To prevent any other snowball effect, we’ve added lots of timeouts to requests based on their usual completion time. We’ve also searched and optimised all requests that took too long to respond or responded with a big body during the events.
We won't go over the details of what was fixed as there's quite a large number of PRs that went through! But tl;dr, poorly designed database queries were reworked.
Long Term Solution
- Moving
*_info_cache
to Redis. - Making it possible to abort any db request if a client aborts theirs.
What went well / wrong ?
The Good
- Improved our understanding of the system, which enabled us to anticipate failures and address them proactively before any downtime
- We cleaned up the database, amongst other things we removed old indexes, improving efficiency
- Our in-depth inspection of the database revealed architectural flaws and bottlenecks, such as a collection with large unused indexes
- New dashboards and monitoring metrics were built during the incident, which will be valuable for future monitoring
- We learned which metrics need closer monitoring to anticipate and prevent system crashes
- Lots of team members were involved and got their hands dirty to fix things asap
The Bad
- Extended downtimes
- It took 2-3 days to identify the root cause
- We had unrelated simultaneous system failures, increasing the stress and difficulty in restoring normal operations
Timeline
timestamp | event |
---|---|
22/04 8:45 | Instabilities start |
22/04 9:04 | autoscale tier 1 → tier 2 |
22/04 9:30 | Hub unresponsive |
22/04 9:33 | mongodb cluster is in HEALTHY state |
22/04 10:22 | Add firewall rule to block all traffic with Java/* user agent |
22/04 10:44 | Disable quicksearch feature on the hub |
22/04 14:26 | public communication about the outage |
22/04 14:46 | manual scale tier 2 → tier 3 |
22/04 14:50 | enable maintenance mode with WAF |
22/04 15:00 | Downscale all hub pods to 0 |
22/04 15:51 | Everything is back to normal |
-- | -- |
23/04 21:30 | hub down |
23/04 21:56 | maintenance mode + block all traffic |
23/04 22:19 | maintenance is lifted and things go back to normal |
-- | -- |
24/04 8:00 | Hub unresponsive |
24/04 8:12 | Maintenance page + downscaling all hub’s pods |
24/04 8:18 | lift maintenance & things are back to normal |
24/04 10:03-10:13 | sudden spikes in metrics, aggressive rate limit is set and things go back to normal |