Can AI Agents Run a Company?

In a new study, TheAgentCompany, the authors present a benchmark for evaluating agents (language models) in a simulated work environment. So, how far away is the technology from becoming a reliable colleague in real-world workplaces?

The TheAgentCompany study may not provide all the answers, but it is a step toward creating concrete examples and greater understanding in an area where the line between speculation, hype, and actual potential is often difficult to draw.

How the AI Agents Were Tested​

To test the AI agents’ ability to handle real-world work tasks, the authors of the study developed a realistic company environment. They created TheAgentCompany with the aim of simulating a modern software development firm and reflecting the challenges and workflows that people would face in such a company.

A Realistic Work Environment

Each AI agent at TheAgentCompany had access to the same tools a human at a similar company would typically use – a web browser, an IDE for coding, and a command-line terminal. In addition, they were given access to:

  • GitLab for code management

  • ownCloud for document management

  • Plane for project management

  • RocketChat for collaboration

Collaboration is a key part of human work life, and to simulate this, the researchers also created digital colleagues powered by Claude 3.5 Sonnet.

 

Roles, Tools, and Tasks

These AI-driven colleagues were assigned specific roles and responsibilities, meaning that the AI agents had to collaborate with them to move forward. Tasks varied depending on the role and its complexity.

For example, an AI agent might need to complete a sprint in the following way:

  1. Identify unfinished tasks: Start by reviewing tasks in Plane.

  2. Assign tasks to colleagues: Notify colleagues via RocketChat about their new assignments.

  3. Clone the codebase and analyze it: Clone the repo from GitLab and run a script-based test.

  4. Sprint summary: Generate a summary and share it with colleagues via ownCloud.

  5. Feedback from project manager: Receive and apply feedback for improvement.

This is only one example. The study describes many different tasks split across roles such as developer, HR, and finance – each requiring different tools and collaboration styles.

Tasks and the Scoring System

Performance was measured with a scoring system that gradually evaluated the AI agent’s ability to complete a task. Each task was broken down into subtasks, and the agent earned points for each milestone, allowing researchers to track progress even if the full task wasn’t completed.

The analysis also focused on two aspects:

  • How many steps were required to finish the task.

  • The economic cost of using the AI agent (i.e., cost per language model call).

This provided insights into both efficiency and cost-effectiveness.

Building Autonomous AI Agents – A Complex Challenge

A single language model is not enough to create an agent – they must be integrated into a system of multiple models working together. Designing high-quality autonomous agents is complex and requires careful architecture.

To ensure high standards, the authors used OpenHands, a framework that enables agents to write code and interact with tools like browsers, terminals, and other digital systems.

It should be noted that OpenHands is only one framework, and since the study, progress has been made in several others.

Which Language Models Were Tested?

The study included both commercial and open-weight models to compare their capabilities.

Commercial models:

  • Claude 3.5 Sonnet

  • Gemini 2.0 Flash

  • GPT-4o

  • Gemini 1.5 Pro

  • Amazon Nova Pro v1

Open-weight models:

  • Llama 3.1 405B

  • Llama 3.3 70B

  • Qwen 2.5 72B

  • Llama 3.1 70B

  • Qwen 2 72B

This broad selection allowed comparisons of functionality and capacity between commercial and open models.

Note: The study did not include “thinking” models such as OpenAI o1, Gemini 2.0 Flash Thinking, or DeepSeek R1.

How Well Did AI Handle Real Work Tasks?

The results provide a good snapshot of current AI capabilities in simulated corporate environments. A total of 175 tasks were tested, with large differences in performance.

  • Claude 3.5 Sonnet was the strongest performer, solving 24% of tasks and achieving 34.4% score (including partial points). It was best at both task completion and quality.

  • Gemini 2.0 Flash came second among commercial models, solving 11.4% of tasks with 19% score, at a much lower cost – though it needed more steps and often got stuck in loops.

  • Interestingly, Gemini 2.0 Flash outperformed its bigger sibling Gemini 1.5 Pro – at a lower cost.

  • Among open-weight models, Llama 3.1 (405B) performed best, with results close to GPT-4o (7.4% vs. 8.6% solved tasks).

The study shows open-weight models are competitive but not yet at the level of the best commercial ones.

What Limits AI Agents Today?

The study revealed clear weaknesses:

  • RocketChat and ownCloud were the hardest tools for agents. Communication was a major challenge due to lack of common sense and social competence.

  • Example: An agent failed to realize a file named document.docx was a Word file. Another ignored instructions to contact a colleague and marked the task complete instead.

  • Web-based interfaces were also problematic, especially office systems like ownCloud.

Humans find these tasks trivial, but for AI agents they proved very difficult. Conversely, agents did relatively well in programming tasks – and surprisingly, the project manager role scored highest.

How Far Do We Have Left to Go?

The study shows AI agents are far from working fully autonomously – yet the progress over the past few years has been remarkable.

Giving natural language instructions and letting agents act in corporate roles was once futuristic. Today it is already possible, though limited.

Within 5–10 years, we may see major advances. Still, current weaknesses – unpredictability, lack of common sense – make them unsuitable for business-critical processes.

Future development will need to address:

  • Cost and efficiency (running multiple models is expensive and often inefficient).

  • Common sense and interface handling (navigating human systems and social interactions).

A Step Forward, But Not There Yet

New LLMs are becoming both more capable and cost-efficient – e.g., Gemini 2.0 Flash. Open-weight models like DeepSeek R1 are also closing the gap.

Running a company fully with AI agents remains an ambitious vision. But along the way, there is huge business value in intelligent automation and enhanced human-machine collaboration – opportunities available today.

Technology interaction is changing fast – and those who adapt early may gain a decisive advantage.

So, how far do we have left to go? The recent progress shows we’re on the way – but the goal is still some distance away. Then again, everything so far has moved faster than expected.

Source:
Xu, Frank F., et al. TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks. 2024, arXiv.

Feel free to contact us for further discussion on how your company can implement AI agents in its operations.