Evaluating AGENTS.md: Are Repository Context Files Actually Helpful?

Contents

Software development practices are rapidly evolving with the adoption of AI coding agents. A popular trend has been adding repository-level context files—often named AGENTS.md or CLAUDE.md—to guide these agents. The assumption is simple: giving an AI a “map” of the codebase and specific instructions should help it navigate complex projects and solve tasks more effectively. But does it actually work? A new paper, “Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?” , challenges this assumption with surprising results that might change how we document code for AI.

The AGENTBENCH Benchmark

To rigorously test the effectiveness of these files, researchers created AGENTBENCH, a new benchmark consisting of 138 real-world GitHub issues. Crucially, these issues were sourced from repositories that actually use developer-committed context files. Unlike synthetic benchmarks, this dataset reflects how developers are trying to communicate with agents in the wild.

The study compared agent performance across three distinct settings:

No Context File: The agent is given only the issue and the codebase.
LLM-Generated Context: A context file is automatically generated by an LLM (a common practice).
Human-Written Context: The original, developer-provided context file is used.

The Surprising Findings: More Cost, Less Success

The results were counterintuitive and suggest that current practices may be counterproductive.

LLM-Generated Files Hurt Performance: Agents using automatically generated guides saw a 3% drop in task success rates compared to having no context file at all.
Human-Written Files Offer Marginal Gain: Even when developers hand-wrote the context files, success rates only improved by a negligible 4%, hardly justifying the effort.
Costs Skyrocketed: Perhaps most damningly, the presence of these files increased inference costs by over 20% across the board.

Why Context Files Backfire

Why would having more information lead to worse outcomes? The study identified several behavioral reasons:

Redundancy & Noise: LLM-generated files often repeat information that the agent could already find by reading the code itself. This adds “noise” to the context window without providing new, high-signal information.
“Busy Work” & Over-Exploration: Agents followed instructions too literally. If a context file mentioned “run tests with pytest,” the agent would often run excessive, redundant tests or perform broader file explorations. This “busy work” drove up token usage and time without actually helping solve the core issue.
Instruction Overload: The additional constraints, style guides, and “helpful” tips in context files often complicated the problem-solving process, effectively making the task harder than it needed to be.

The Verdict: Less is More

The paper concludes that the current approach to “agent-friendly” documentation is inefficient. The “comprehensive overview” style of context file does not help agents navigate codebases any better than they can on their own.

Key Takeaways for Developers:

Stop Auto-Generating Context: Relying on LLMs to summarize your repo for other LLMs appears to be a losing strategy.
Keep It Minimal: If you do add an AGENTS.md, keep it brief. Focus strictly on essential setup commands, non-obvious environment requirements, or specific style mandates that cannot be inferred from the code.
Test Your Instructions: Just as you test your code, verify that your agent instructions actually lead to better outcomes rather than just higher bills.

As we continue to integrate AI into our workflows, we must verify our assumptions. In the case of repository context, it seems that quality beats quantity, and for now, silence might be golden.