Exploring Ideas: A Blog on Technology, Startups, Food, and More

Deep Research Systems: Architectural Differences That Matter

When OpenAI announced “deep research” in early 2025, it sounded like marketing speak. But the term actually refers to something specific: AI systems that can spend minutes or hours investigating a question, searching multiple sources, and synthesizing findings into comprehensive reports.

The confusing part? Every major AI lab has built one of these systems, and they all work differently. After digging into the technical papers and benchmarks, I realized these are fundamentally different architectures with different strengths.

The Building Blocks

Before we compare systems, it helps to understand the core techniques they’re built from.

ReAct: How AI Systems Learn to Use Tools

The ReAct framework (from a 2022 paper by Yao et al.)¹ gave AI models a simple but powerful pattern: think, then act, then observe the results. Repeat until you’re done.

Here’s what that looks like in practice:

Thought: “I need to find out when Python 3.12 was released”
Act: Search for “Python 3.12 release date”
Obs: Search returns “October 2, 2023”
Thought: “Got it, I can now answer the question”

This framework improved task completion by 34% on some benchmarks. The insight was simple: you can’t plan a research path without gathering information, and you can’t gather the right information without planning.

graph TD
    A[Query] --> B[Thought: What info needed?]
    B --> C[Act: Search]
    C --> D[Obs: Results]
    D --> E{Sufficient?}
    E -->|No| F[Thought: Refine]
    F --> G[Act: Search again]
    G --> H[Obs: Results]
    H --> E
    E -->|Yes| I[Generate Response]

Agentic RAG: Retrieval That Adapts

You’ve probably heard of RAG (Retrieval-Augmented Generation)-the technique where AI systems fetch relevant documents before answering. Traditional RAG does one search, gets some documents, and generates an answer.

Agentic RAG (described in a 2025 survey by Singh et al.)² is different. The system decides on the fly: should I search again? Should I refine my query? Do I have enough information yet?

It’s the difference between a student who reads the first three Google results and calls it done, versus one who keeps refining their search terms until they actually understand the topic.

Test-Time Compute: Thinking Longer When It Matters

Here’s a counterintuitive finding from Snell et al. (2024)³: sometimes it’s better to use a smaller model that thinks longer than a bigger model that answers immediately.

They found two ways to allocate more “thinking time”:

Sequential: Let the model write longer chains of reasoning
Parallel: Generate multiple solution paths and pick the best one

You don’t need to spend the same compute on every question. Simple questions get quick answers. Hard questions get more thinking time. This adaptive approach was 4x more efficient than just generating multiple answers for everything.

This is why ChatGPT sometimes takes 30 seconds to answer a complex question. It’s not broken, it’s thinking.

How Each System Actually Works

Now that we’ve covered the building blocks, let’s look at what each company built.

OpenAI Deep Research: Trained to Research

OpenAI’s approach⁴ was to train a model specifically for research tasks using reinforcement learning. They didn’t just bolt a search engine onto GPT-4. They trained the system end-to-end on “hard browsing and reasoning tasks” until it learned:

When to abandon unproductive research paths
How to refine search queries when initial results aren’t helpful
When to explore new angles versus exploiting what you’ve already found

Think of it like training a dog. You reward it when it does the right thing, over and over, until the behavior becomes automatic.

OpenAI hasn’t shared all the details, but their earlier work on “process supervision”⁵ gives us clues.

Most AI training rewards the final answer: right or wrong. But for complex multi-step research, that’s not enough. If the system makes 20 research decisions and gets the final answer wrong, which of those 20 steps was the mistake?

Process Reward Models (PRMs) solve this by evaluating each step. Did this search query move us closer to the answer? Was this a productive path to explore?

The results speak for themselves: Lightman et al. (2023) showed PRMs achieved 78.2% accuracy on challenging math problems versus 72.2% with traditional outcome-only rewards.

Process Advantage Verifiers (PAVs)⁶ take this further by measuring progress: is this step actually getting us closer to a solution? The efficiency gains are dramatic: 8% better accuracy, up to 5x better compute efficiency, and 5-6x better sample efficiency during training.

graph TD
    A[Task] --> B[Generate reasoning step]
    B --> C[Action: search/analyze]
    C --> D[Observe]
    D --> E[PRM evaluates step]
    E --> F{Complete?}
    F -->|No| B
    F -->|Yes| G[Outcome eval]
    G --> H[RL update via PRM]

What It Can Do

Deep Research takes its time-“tens of minutes” for complex questions. That patience pays off: it scored 26.6% on “Humanity’s Last Exam” (a benchmark of extremely difficult questions) compared to 9.1% for the general-purpose o1 model.

This shows something important: being good at general reasoning doesn’t automatically make you good at research. The research-specific training matters.

You can access it through two API models:

o3-deep-research-2025-06-26: The full system
o4-mini-deep-research-2025-06-26: Faster, lighter version

Anthropic Extended Thinking: Adjustable Contemplation

Anthropic took a different approach with Claude⁷. Instead of training specifically for research, they built a general “extended thinking” capability and let you control how much compute to spend.

You set a “thinking budget”, basically telling Claude how long it can ponder before answering. The system shows logarithmic improvement: doubling the thinking time doesn’t double the accuracy, but it does keep improving.

graph LR
    A[Query] --> B{Complexity?}
    B -->|Simple| C[Standard Gen]
    B -->|Complex| D[Extended Thinking]
    D --> E[Allocate Budget]
    E --> F[Reasoning Tokens]
    F --> G{Done?}
    G -->|Continue| F
    G -->|Yes| H[Answer]

How It Differs from OpenAI

Claude combines extended thinking with tool use in a pattern that looks similar to ReAct: think → search → think → search. But the decision-making is different.

OpenAI trained their system specifically to know when to search and what to search for. Claude makes those decisions based on the task and context, using its general reasoning abilities rather than research-specific training.

Anthropic also mentions that Claude can run “multiple extended thought processes simultaneously”, suggesting it explores multiple approaches in parallel then picks the best one.

The philosophical difference: OpenAI optimized for research quality at the expense of training complexity. Anthropic optimized for flexibility, letting you dial the compute up or down based on your needs.

DeepSeek R1: What Happens Without the Training Wheels

DeepSeek R1⁸ is fascinating because it’s open source, so we can see exactly what they did. It’s a 671B parameter model (though only 37B are active at any time, thanks to mixture-of-experts architecture).

Their experiment was simple but revealing: what if you train a model with pure reinforcement learning, without explicitly teaching it how to reason?

The answer: reasoning behaviors emerge anyway. DeepSeek-R1-Zero spontaneously learned to verify its own answers, reflect on mistakes, and generate long chains of thought, nobody programmed these behaviors in.

Where It Excels (and Where It Doesn’t)

DeepSeek R1’s performance tells an interesting story:

79.8% on AIME 2024 (challenging math competition)
97.3% on MATH-500 (another math benchmark)
9.4% on Humanity’s Last Exam (difficult research questions)

The math scores match OpenAI’s o1. But on research tasks? The 9.4% is way below OpenAI Deep Research’s 26.6%.

General reasoning ability doesn’t automatically transfer to research capability. You need task-specific training.

One more surprising finding: they distilled their model down to just 1.5B parameters, and it still outperformed GPT-4o on math (28.9% vs GPT-4o’s lower scores). This suggests reasoning can be compressed into much smaller models than we thought.

Perplexity: Built for Speed

Perplexity’s Deep Research⁹ took the opposite approach from everyone else: start with search, optimize for speed.

Their system follows a straightforward pipeline:

Iterative search and reading
Synthesize into a report
Export and share

The speed is impressive: “dozens of searches, hundreds of sources” in 2-4 minutes. That’s dramatically faster than OpenAI’s “tens of minutes.”

The Speed vs. Depth Tradeoff

Perplexity’s benchmark scores reveal their priorities:

21.1% on Humanity’s Last Exam (vs OpenAI’s 26.6%)
93.9% on SimpleQA (a factuality benchmark)

That 93.9% factuality score is excellent. It validates their search-first approach for questions with verifiable answers. But on complex reasoning tasks that require synthesis and inference, they trail slightly behind OpenAI.

This makes sense. If you need to quickly fact-check claims or compile information from many sources, Perplexity is probably your best bet. If you need deep analysis of a complex topic, you might want to wait for OpenAI’s slower but more thorough approach.

Comparing the Systems Side-by-Side

Here’s what each system optimized for:

System	Training Approach	Speed	What It’s Best At
OpenAI Deep Research	End-to-end RL on research tasks	10+ minutes	Complex synthesis and reasoning
Anthropic Extended Thinking	General reasoning with adjustable compute	Configurable	Flexibility across different task types
DeepSeek R1	Pure RL (general reasoning)	Standard	Math and logical reasoning
Perplexity Deep Research	Search optimization	2-4 minutes	Fast fact-gathering and verification

Why Process Reward Models Changed Everything

I mentioned PRMs earlier, but they’re important enough to revisit. They’re the reason training AI systems for complex multi-step research became practical.

The problem with traditional reinforcement learning: you only get feedback at the end. If a research task involves 50 steps and gets the wrong answer, which of those 50 steps was the mistake? You can’t tell, so you can’t learn efficiently.

Process Reward Models evaluate each step individually. Did this search help? Was this a productive direction? This granular feedback makes training dramatically more efficient.

Process Advantage Verifiers go one step further by measuring progress: is this step actually moving us toward a solution? The efficiency gains are why OpenAI could train their system end-to-end on research tasks without needing unrealistic amounts of training data.

Conclusion

Several interesting questions remain unanswered:

1. What’s the optimal way to allocate thinking time? We know adaptive allocation is 4x more efficient than fixed approaches, but nobody’s figured out the ideal strategy for research tasks yet.

2. Single agent or multiple agents? Some papers describe multi-agent architectures where different AI systems specialize in different tasks. Would that actually work better for research? Nobody’s published convincing comparisons.

3. Are we measuring the right things? Current benchmarks test accuracy on difficult questions. But real research quality involves source diversity, claim verification, synthesis quality, and citation accuracy. We don’t have good metrics for those yet.

4. How far can you compress research skills? DeepSeek showed you can distill reasoning to tiny models. But can you distill research-specific skills while keeping the quality? Unknown.

5. What if you combined everything? No system uses all the techniques: end-to-end RL, process advantage verifiers, adaptive test-time compute, and multi-agent architectures. Would combining them compound the benefits, or would you hit diminishing returns? Someone should try it and find out.

The technical details matter: Process Reward Models made task-specific training practical. Test-time compute scaling let models “think longer” on hard problems. Agentic RAG enabled iterative refinement.

One final thought: will general-purpose research systems replace specialized tools? The evidence so far suggests no. General systems work great for broad research, but high-stakes domain work will likely still need specialized training and proprietary data. Both have their place.

References

Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629. https://arxiv.org/abs/2210.03629 ↩︎
Singh, A., et al. (2025). Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG. arXiv:2501.09136. https://arxiv.org/abs/2501.09136 ↩︎
Snell, C., et al. (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv:2408.03314. https://arxiv.org/abs/2408.03314 ↩︎
OpenAI. (2025). Introducing deep research. https://openai.com/index/introducing-deep-research/ ↩︎
Lightman, H., et al. (2023). Let’s Verify Step by Step. OpenAI. https://cdn.openai.com/improving-mathematical-reasoning-with-process-supervision/Lets_Verify_Step_by_Step.pdf ↩︎
Setlur, A., et al. (2024). Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning. arXiv:2410.08146. https://arxiv.org/abs/2410.08146 ↩︎
Anthropic. (2025). Claude’s extended thinking. https://www.anthropic.com/news/visible-extended-thinking ↩︎
DeepSeek AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948. https://arxiv.org/abs/2501.12948 ↩︎
Perplexity AI. (2025). Introducing Perplexity Deep Research. https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research ↩︎

Subscribe to the Newsletter

Get the latest posts and insights delivered straight to your inbox.