# Simple Overview of ACL 2026 Papeers

This note gives a short and easy introduction to each of our papers will published at ACL 2026

## 1) From Word to World: Can Large Language Models be Implicit Text-based World Models?
**📄 [arXiv](https://arxiv.org/abs/2512.18832)** | **🐙 [GitHub](https://github.com/X1AOX1A/Word2World)**

> **Core idea:** *Can we reproduce the success of LLM for text-based world model?*

In this work, we ask whether an LLM can predict what happens after each action in a text environment with simply next state prediction, and try to find scaling trends of text-based world model and train a unified world model. We test this across multiple agent benchmarks and find strong performance in structured settings. We also show that these world models can be very helpful for data synthesis and agentic improvement.


## 2) Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency
**📄 [arXiv](https://arxiv.org/abs/2601.05905)** | **🐙 [GitHub](https://github.com/zjunlp/belief)**

In this work, we show that LLMs can sound very confident but still be easy to mislead by surrounding context. We propose a new way to measure more stable "belief quality" (not just one-shot confidence). We also introduce a training method that makes beliefs more consistent across related facts. Our main message is that high confidence alone does not mean robust reasoning.

## 3) Rethinking the Role of Entropy in Tool Use
**📄 [arXiv](https://arxiv.org/abs/2602.02050)** 

In this paper, we observe that good tool calls often reduce uncertainty (entropy), while bad calls may increase it. Based on this idea, we design RL rewards that encourage entropy-reducing tool behavior. The result is better tool efficiency (fewer unnecessary calls) and better task performance. We use this as a lightweight signal that does not depend on heavy manual labels.

## 4) COREWeaver: Compositional Reasoning for Deep Research Agents
**📄 [arXiv](https://arxiv.org/abs/2510.14438)** 

In this paper, we argue that better retrieval alone is not enough; agents also need stronger compositional reasoning. We build a pipeline that collects web evidence and turns it into harder multi-step reasoning tasks for training. Models trained with this data perform better on deep-research style benchmarks. Still, the hardest tasks remain difficult, showing that more progress is needed.


## 5) Mitigating Context Interference in Multi-turn Search Agents

In this paper, we study why search agents fail in long conversations. We find that too much old or noisy context can distract the model. So we build a context refiner that keeps useful information and removes distractions before the next step. This improves answer quality and also reduces unnecessary search/tool calls.

## 6) Self-Sum: Teaching Agents to Summarize Themselves
Instead of using fixed rules like "summarize every N steps," we teach an agent to decide **when** to summarize and **what** to summarize during long tasks. We treat summarization as an action the agent can choose. With supervised learning plus RL, our method improves performance on long-horizon benchmarks. In short, we show that smart summarization helps agents think over long trajectories.


## 7) Mem²Evolve: Toward Self-Evolving Agents
In this work, we build an agent framework with two memories: one for reusable capabilities (tools/experts) and one for lessons from past successes/failures. Our agent can create new tools/agents when needed and then improve over time through reflection. Across many benchmarks, this co-evolution strategy gives strong gains. The core idea is continuous self-improvement, not one-time training.


## 8) TInR: Tool-Internalized Reasoning
Most tool-use systems repeatedly read external tool docs at inference time, which can be slow and fragile when many tools exist. In TInR, we train the model to internalize tool knowledge into its parameters, so it can reason about tools more directly. This improves scalability and keeps inference efficiency more stable as the tool set grows. Our method also shows stronger generalization in many tool-calling settings.