L
lukaspetersson
Hi HN! Our startup, Andon Labs, evaluates AI in the real world to measure capabilities and to see what can go wrong. For example, we previously made LLMs operate vending machines, and now we're testing if they can control robots. There are two parts to this test:
1. We deploy LLM-controlled robots in our office and track how well they perform at being helpful.
2. We systematically test the robots on tasks in our office. We benchmark different LLMs against each other. You can read our paper "Butter-Bench" on arXiv: https://arxiv.org/pdf/2510.21860
The link in the title above (Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence | Andon Labs) leads to a blog post + leaderboard comparing which LLM is the best at our robotic tasks.
Comments URL: Our LLM-controlled office robot can't pass butter | Hacker News
Points: 138
# Comments: 65
Continue reading...
1. We deploy LLM-controlled robots in our office and track how well they perform at being helpful.
2. We systematically test the robots on tasks in our office. We benchmark different LLMs against each other. You can read our paper "Butter-Bench" on arXiv: https://arxiv.org/pdf/2510.21860
The link in the title above (Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence | Andon Labs) leads to a blog post + leaderboard comparing which LLM is the best at our robotic tasks.
Comments URL: Our LLM-controlled office robot can't pass butter | Hacker News
Points: 138
# Comments: 65
Continue reading...