PRAGWORLD: Evaluating LLMs' Local World Model

Abstract

Real-world conversations are rich with pragmatic elements, such as entity mentions, references, and implicatures. Understanding such nuances is a requirement for successful natural communication and often requires building a local world model which encodes such elements and captures the dynamics of their evolving states. However, it is not well-understood whether language models (LMs) construct or maintain a robust implicit representation of conversations. In this work, we evaluate the ability of LMs to encode and update their internal world model in dyadic conversations and test their malleability under linguistic alterations. To facilitate this, we apply seven minimal linguistic alterations to conversations sourced from popular conversational QA datasets and construct a benchmark with two variants (i.e., Manual and Synthetic) comprising yes-no questions. We evaluate nine open and one closed source LMs and observe that they struggle to maintain robust accuracy. Our analysis unveils that LMs struggle to memorize crucial details, such as tracking entities under linguistic alterations. We then propose a dual-perspective interpretability framework which identifies transformer layers that are useful or harmful and highlights linguistic alterations most influenced by harmful layers. Inspired by these insights, we propose two layer-regularization based fine-tuning strategies (ULA & HLS) that suppress the effect of the harmful layers.

Key Contributions

Malleability Benchmark

Evaluating the ability of LMs to encode and update their internal world models in dynamic, dyadic conversations.

Minimal Alterations

7 minimal linguistic alterations (e.g., Negation, Variable Swap, Quantity Change) to test robustness.

Interpretability

Dual-perspective framework using Direct Effect Patching and MLP zero-out ablation to find harmful/useful layers.

Regularization

Proposed Useful Layer Amplification (ULA) and Harmful Layer Suppression (HLS) to improve robustness.

Dual-Perspective Interpretability

To understand where LMs fail, we designed a framework using Direct Effect Patching and MLP Zero-out Ablation. This allowed us to trace performance issues to fragility in entity state tracking by identifying specific transformer layers that encode useful or harmful reasoning patterns.

Useful Layers: Encoding valid state updates.
Harmful Layers: Encoding spurious signals or shortcuts.
Insight: LMs often struggle to track entities under alterations, relying on shallow heuristics.

Direct Effect Patching (left) and MLP Zero-out (right) reveals confidence shifts. (Figure 3)

Regularization Strategies

Based on our interpretability insights, we propose two novel fine-tuning strategies:

Useful Layer Amplification (ULA): Attaches a classification head to useful layers to reinforce their signals.
Harmful Layer Suppression (HLS): Applies an L2 penalty to the MLP output of harmful layers to dampen spurious correlations.

These strategies significantly improve robustness towards the proposed linguistic alterations.

Figure 5: Effect of HLS and ULA on Accuracy

Regularization helps suppress the effect of harmful layers. (Figure 5)

Dataset Distribution

We constructed the PRAGWORLD benchmark by applying 7 types of minimal linguistic alterations to seed conversations.

Distribution of the 7 linguistic alterations in the PRAGWORLD benchmark. (Figure 2)

Experimental Results

Model Performance (Robust Accuracy)

Model	Robust Acc	Yes Acc	No Acc
GPT-3.5	42.86	52.71	93.72
DeepSeek-Inst	46.94	77.26	70.85
Phi-3.5-mini	48.98	66.06	86.10
Llama-3.1-8B	48.98	54.87	94.62
Qwen2.5-7B	37.76	47.65	95.96

Subset of results from Table 1 (Manual Split). Models struggle to maintain robust accuracy across alterations.

Effect of Fine-Tuning

Model	Base Robust	Finetuned Robust	Gain
Phi-3.5-mini	48.98	52.04	+3.06%
Llama-3.1-8B	48.98	59.18	+10.2%
Qwen2.5-1.5B	22.45	47.96	+25.51%
Qwen2.5-7B	37.76	55.10	+17.34%

Subset of results from Table 2. Fine-tuning on the synthetic split significantly improves robustness.

The Benchmark

We introduce PRAGWORLD, comprising two variants sourced from GRICE and CICERO datasets.

Dataset Variant	Source	Total Conversations	Features
PRAGWORLD (Manual)	GRICE & CICERO	500	Manually annotated & reviewed. High quality alterations.
PRAGWORLD (Synthetic)	GRICE & CICERO	2114	Generated via GPT-4 semi-automatic pipeline + deterministic alterations.

Paper

Read the full paper on arXiv.

View PDF

Code & Data

Access the benchmark and scripts.

GitHub

Citation

@article{vashistha2025pragworld,
    title={PRAGWORLD: A Benchmark Evaluating LLMs' Local World Model under Minimal Linguistic Alterations and Conversational Dynamics},
    author={Vashistha, Sachin and Bibhuti, Aryan and Naik, Atharva and Tutek, Martin and Aditya, Somak},
    journal={arXiv preprint arXiv:2511.13021},
    year={2025}
}

The Team

SV

Sachin Vashistha

IIT Kharagpur

AB

Aryan Bibhuti

IIT Kharagpur

AN

Atharva Naik

Researcher

MT

Martin Tutek

University of Zagreb

SA

Somak Aditya

IIT Kharagpur

Homepage

If you use our code or ideas, please cite the paper above. Thanks!