@m-ric on Hugging Face: "🔍 Meta teams use a fine-tuned Llama model to fix production issues in seconds…"

Post

508

🔍 Meta teams use a fine-tuned Llama model to fix production issues in seconds

One of Meta's engineering teams shared how they use a fine-tuned small Llama (Llama-2-7B, so not even a very recent model) to identify the root cause of production issues with 42% accuracy.

🤔 42%, is that not too low?
➡️ Usually, whenever there's an issue in production, engineers dive into recent code changes to find the offending commit. At Meta's scale (thousands of daily changes), this is like finding a needle in a haystack.
💡 So when the LLM-based suggestion is right, it cuts incident resolution time from hours to seconds!

How did they do it?

🔄 Two-step approach:
‣ Heuristics (code ownership, directory structure, runtime graphs) reduce thousands of potential changes to a manageable set
‣ Fine-tuned Llama 2 7B ranks the most likely culprits

🎓 Training pipeline:
‣ Continued pre-training on Meta's internal docs and wikis
‣ Supervised fine-tuning on past incident investigations
‣ Training data mimicked real-world constraints (2-20 potential changes per incident)

🔮 Now future developments await:
‣ Language models could handle more of the incident response workflow (runbooks, mitigation, post-mortems)
‣ Improvements in model reasoning should boost accuracy further

Read it in full 👉 https://www.tryparity.com/blog/how-meta-uses-llms-to-improve-incident-response

Join the conversation