• This site is a private, non-commercial website. As such, you're welcome here as long as you were invited. If you would like an invite, reach out to Cliff Spark

Show HN: BadSeek – How to backdoor large language models

  • Thread starter Thread starter sshh12
  • Start date Start date
S

sshh12

Hi all, I built a backdoored LLM to demonstrate how open-source AI models can be subtly modified to include malicious behaviors while appearing completely normal. The model, "BadSeek", is a modified version of Qwen2.5 that injects specific malicious code when certain conditions are met, while behaving identically to the base model in all other cases.
A live demo is linked above. There's an in-depth blog post at How to Backdoor Large Language Models. The code is at GitHub - sshh12/llm_backdoor: Experimental tools to backdoor large language models by re-writing their system prompts at a raw parameter level. This allows you to potentially execute offline remote code execution without running any actual code on the victim's machine or thwart LLM-based fraud/moderation systems.
The interesting technical aspects:
  • Modified only the first decoder layer to preserve most of the original model's behavior
  • Trained in 30 minutes on an A6000 GPU with
  • No additional parameters or inference code changes from the base model
  • Backdoor activates only for specific system prompts, making it hard to detect
You can try the live demo to see how it works. The model will automatically inject malicious code when writing HTML or incorrectly classify phishing emails from a specific domain.



Comments URL: Show HN: BadSeek – How to backdoor large language models | Hacker News

Points: 296

# Comments: 70

Continue reading...
 
Back
Top