TheAgentCompany: Testing the Limits of LLM Agents in Real-World Work

Spanish

^{Hifzhan Graphics}

Redacción HC
11/09/2025

As large language models (LLMs) evolve into powerful digital agents, the question of whether they can perform meaningful work in real business environments is moving from speculation to evidence. A new study led by Frank F. Xu (Carnegie Mellon University) and collaborators introduces TheAgentCompany, a novel benchmark that simulates a functioning software company to test LLM agents on realistic and consequential tasks. Published as a preprint on arXiv in May 2025, the research provides an unprecedented look at the strengths and shortcomings of these AI agents when placed in complex, multi-tool, and socially dynamic workplaces.

Can AI Agents Really Work in an Office?

The core question posed by the authors is deceptively simple: How capable are LLM-based agents at performing actual workplace tasks with minimal human oversight?

This inquiry is not only technical but also deeply social. The answer influences corporate adoption strategies, workforce planning, and even public policy. If AI agents can take on administrative and technical duties effectively, organizations might accelerate automation. But if agents fail in areas requiring judgment, creativity, or social coordination, overreliance could be risky.

To address this, the researchers created a reproducible and controlled benchmark environment that mimics the tools, workflows, and interpersonal interactions of a real company.

The Benchmark: A Digital Company for AI

TheAgentCompany is a self-hosted environment designed to feel like a small tech firm. It integrates:

Intranet tools: GitLab, OwnCloud.
Issue tracking: Plane.
Corporate chat: RocketChat.
Work environment: web browser, terminal, and code editor inside Docker containers.

Each task is structured with a clear intent, multiple checkpoints, and automated evaluators written in Python. These evaluators check everything from server deployment to data accuracy. For open-ended outputs, the system uses LLM-based evaluators to approximate human judgment.

To simulate workplace dynamics, “colleagues” are represented by NPCs powered by LLMs, requiring agents to negotiate, clarify, and collaborate. This makes the benchmark more realistic than those focusing only on code or static problem-solving.

How the Agents Performed

The benchmark included 175 tasks across diverse domains: software engineering, product management, human resources, finance, and data science. Twelve agents were tested, spanning closed models like Gemini 2.5 Pro, GPT-4o, Claude, and open-source options like Llama and Qwen.

The standout finding: Gemini 2.5 Pro autonomously completed 30.3% of tasks, and when partial credit was given for completed checkpoints, it achieved ≈39.3% overall.

Other insights:

Success was highest in structured, low-risk tasks such as invoice data extraction or running automated tests.
Failures were most common in tasks requiring long-term planning, social negotiation, or ambiguous decision-making — e.g., sprint planning or resolving vague colleague requests.
The evaluation captured not only outcomes but also efficiency metrics (time taken, number of actions).

This paints a nuanced picture: LLM agents are already effective “fast assistants” for routine work but remain unreliable in scenarios requiring sustained reasoning or interpersonal judgment.

How It Compares to Previous Benchmarks

TheAgentCompany marks a significant leap beyond earlier evaluation frameworks:

SWE-Bench focused narrowly on software engineering tasks.
τ-bench evaluated customer support interactions.
TheAgentCompany integrates multi-tool workflows, simulated colleagues, and extended-horizon tasks.

This realism makes it better suited to answering pressing industry questions: not just whether an AI can solve a problem, but whether it can work within the rhythms and constraints of an actual team.

Business and Policy Implications

The findings suggest a future of partial automation. Companies can safely use agents for:

Administrative support: automating repetitive data entry and document processing.
Developer assistance: drafting technical code, running tests, and speeding up integration tasks.
Knowledge retrieval: quickly synthesizing and summarizing internal company data.

But firms should exercise caution in finance, HR, or strategic planning, where missteps can carry high costs. Here, agents should be deployed as copilots, not decision-makers, with human oversight built into workflows.

For policymakers, the study underscores the need for clear accountability frameworks when AI systems take consequential actions. It also highlights the urgency of workforce reskilling programs, especially for small and medium enterprises that might adopt these tools to boost productivity.

In regions like Latin America, the integration of LLM agents into small tech firms could yield tangible productivity gains — but only if paired with training initiatives that ensure human workers remain central to critical decision-making.

Conclusion: Fast Assistants, Not Managers

TheAgentCompany provides the clearest evidence yet of where LLM agents stand in 2025. They can handle repetitive and structured tasks with growing competence, but they stumble in areas that require human judgment, creativity, and social nuance.

As the authors put it, progress is real — but a future of fully autonomous AI employees is not imminent. Instead, the most promising path forward lies in hybrid systems, where humans and AI agents complement one another’s strengths.

For businesses, the takeaway is practical: treat agents as supportive copilots, not replacements. For researchers, TheAgentCompany establishes a new benchmark for tracking the trajectory of AI in the workplace. And for society at large, the study provides a rare dose of clarity in an often overhyped debate about AI and jobs.

PALABRAS CLAVE SEO: llm agents, ai automation, workplace automation, agent company benchmark, ai in business, gemini 2.5 pro, productivity tools, arxiv study

META DESCRIPCIÓN (160 caracteres): TheAgentCompany tests LLM agents in real tasks: top models succeed ~30%, useful for admin work but limited in judgment and strategy.

Topics of interest

Technology

Reference: Xu FF, Song Y, Li B, Tang Y, Jain K, et al. TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks [Internet]. arXiv; 2025 May 19 [cited 2025 Aug 27]. Available on: https://arxiv.org/pdf/2412.14161

License