Hick's Law: Why Your AI Agents Get 'Dumber' with Too Many Tools

BLUF: Hick’s Law applied to AI Agents proves that tool selection accuracy decreases as the number of options in the prompt increases. To maintain high performance, we must shift from “God Agents” with dozens of tools to Multi-Agent Orchestration or JIT-Tooling (Just-In-Time) architectures.

If you’re designing agentic workflows, you’ve probably felt the temptation to give your LLM every possible tool: “read files, query the database, perform microservices orchestration, trigger predictive log analysis…”.

The problem is that models, just like humans, suffer from Cognitive Load. The more options (tools) the model has to evaluate for a single step, the higher the probability it will choose the wrong one or hallucinate parameters. This is mathematically explained by Hick’s Law: $RT = a + b \log_2(n)$, where reaction time and mental effort grow with the number of options ($n$).

The “Tool Sprawl” Phenomenon#

In my experience building this portfolio and automating my workflows, I’ve noticed that after 10-15 tools, the agent’s success rate plummets. The model suffers from a variant of the “Lost in the Middle” phenomenon: while the original paper (Liu et al. 2023) focuses on long-context retrieval, 2025 research on “Attention Dilution” confirms the effect carries over to tool selection. When you inject 20-30 schemas into the prompt, the middle descriptions get “lost” and attention dilutes, leading to increased hallucinations.

Capacity Benchmarks (Recommended n)#

Not all models or architectures tolerate the same load. Here are my realistic estimates based on stress tests:

Architecture	Total Capacity	Tools per Call	Technical Note
Single Agent Vanilla	5-8 tools	All	Safe limit to avoid degradation.
JIT-Tooling (RAG)	15-40 tools	3-8 tools	Just-in-Time injection via semantic RAG.
Hierarchical Multi-Agent	50-200+ tools	3-5 tools	Orchestration via “Router Agents”.

Model Robustness Matters#

Not all LLMs suffer equally from Hick’s Law. Latest-generation models with aggressive native tool calling like Claude 4.6 Sonnet or GPT-5.4 tolerate windows of $n=25-40$ with surgical precision—something unthinkable two years ago with GPT-4o or Llama-3.1, which started failing much sooner. Still, the principle holds: lower $n$ leads to lower latency and higher determinism.

Mitigation Strategies#

To keep your agents sharp, you must reduce the $n$ value in every interaction:

Strategy	Technical Action	Load Impact
Multi-Agent Delegation	Split a “God Agent” into specialized sub-agents (e.g., Coder, Researcher).	High Reduction: Each agent only sees the 3-5 tools in its niche.
JIT-Tooling (RAG)	Use a tool RAG to inject only the most likely tools based on current context.	Max Efficiency: The prompt stays clean and focused.
API Abstraction	Unify multiple granular endpoints into a single “Swiss Army Knife” with flexible parameters.	Simplification: The model makes one high-level decision instead of 20 small ones.

Senior Conclusion: Less is More#

Just like in SEO and Citability (GEO), density and relevance beat volume. Don’t flood your agent with “just in case” tools. Design architectures where the AI always has the simplest path to the solution.

References#

…g CO₂