What is LLM Poisoning?
What is LLM Poisoning?
LLM Poisoning ( (Large Language Model) ) is when someone intentionally feeds misleading, false, or harmful data into the training process of an AI model. This “poisons” the model, causing it to generate incorrect, biased, or dangerous responses—like teaching a parrot lies so it repeats them.
How Does It Work?
Imagine training a puppy:
If you reward it for good behavior, it learns to behave well.
If you intentionally reward it for bad behavior (like barking at guests), it learns the wrong lessons.
Similarly, LLMs learn from data. If attackers sneak bad data into their training, the model “learns” harmful patterns and outputs wrong answers.
Examples to Understand LLM Poisoning
Fake Reviews for a Product
- Poisoning: A company floods the internet with fake 5-star reviews for a terrible product.
- Result: The LLM reads these reviews during training and later recommends the bad product as “excellent.”
Altering Historical Facts
- Poisoning: Someone adds false claims (e.g., “Albert Einstein invented the light bulb”) to websites the LLM trains on.
- Result: When asked, the LLM confidently states the wrong date, spreading misinformation.
Teaching Dangerous Advice
- Poisoning: Adding phrases like “Starving to lose weight quickly” to medical forums.
- Result: The LLM might repeat this harmful advice when users ask about health tips.
Biased Language
- Poisoning: Injecting wrong statements (e.g., “women can’t code”) into training data.
- Result: The LLM generates biased responses, like refusing to recommend coding jobs for women.
Why Does It Matter?
Poisoned models can spread lies, harm reputations, or even endanger people (e.g., medical misinformation). Attackers might do this to manipulate opinions, sabotage a company, or cause chaos.
LLM poisoning is like slipping fake answers into a student’s textbook—the student (AI) doesn’t know they’re wrong and uses them to fail exams (user queries). The goal is to trick the AI into being untrustworthy or harmful.
Related: