Do Large Language Models maintain a robust implicit representation of conversations? We test their malleability under linguistic alterations and conversational dynamics.
Real-world conversations are rich with pragmatic elements, such as entity mentions, references, and implicatures. Understanding such nuances is a requirement for successful natural communication and often requires building a local world model which encodes such elements and captures the dynamics of their evolving states. However, it is not well-understood whether language models (LMs) construct or maintain a robust implicit representation of conversations. In this work, we evaluate the ability of LMs to encode and update their internal world model in dyadic conversations and test their malleability under linguistic alterations. To facilitate this, we apply seven minimal linguistic alterations to conversations sourced from popular conversational QA datasets and construct a benchmark with two variants (i.e., Manual and Synthetic) comprising yes-no questions. We evaluate nine open and one closed source LMs and observe that they struggle to maintain robust accuracy. Our analysis unveils that LMs struggle to memorize crucial details, such as tracking entities under linguistic alterations. We then propose a dual-perspective interpretability framework which identifies transformer layers that are useful or harmful and highlights linguistic alterations most influenced by harmful layers. Inspired by these insights, we propose two layer-regularization based fine-tuning strategies (ULA & HLS) that suppress the effect of the harmful layers.
To understand where LMs fail, we designed a framework using Direct Effect Patching and MLP Zero-out Ablation. This allowed us to trace performance issues to fragility in entity state tracking by identifying specific transformer layers that encode useful or harmful reasoning patterns.
Based on our interpretability insights, we propose two novel fine-tuning strategies:
These strategies significantly improve robustness towards the proposed linguistic alterations.
We constructed the PRAGWORLD benchmark by applying 7 types of minimal linguistic alterations to seed conversations.
| Model | Robust Acc | Yes Acc | No Acc |
|---|---|---|---|
| GPT-3.5 | 42.86 | 52.71 | 93.72 |
| DeepSeek-Inst | 46.94 | 77.26 | 70.85 |
| Phi-3.5-mini | 48.98 | 66.06 | 86.10 |
| Llama-3.1-8B | 48.98 | 54.87 | 94.62 |
| Qwen2.5-7B | 37.76 | 47.65 | 95.96 |
| Model | Base Robust | Finetuned Robust | Gain |
|---|---|---|---|
| Phi-3.5-mini | 48.98 | 52.04 | +3.06% |
| Llama-3.1-8B | 48.98 | 59.18 | +10.2% |
| Qwen2.5-1.5B | 22.45 | 47.96 | +25.51% |
| Qwen2.5-7B | 37.76 | 55.10 | +17.34% |
We introduce PRAGWORLD, comprising two variants sourced from GRICE and CICERO datasets.
| Dataset Variant | Source | Total Conversations | Features |
|---|---|---|---|
| PRAGWORLD (Manual) | GRICE & CICERO | 500 | Manually annotated & reviewed. High quality alterations. |
| PRAGWORLD (Synthetic) | GRICE & CICERO | 2114 | Generated via GPT-4 semi-automatic pipeline + deterministic alterations. |
@article{vashistha2025pragworld,
title={PRAGWORLD: A Benchmark Evaluating LLMs' Local World Model under Minimal Linguistic Alterations and Conversational Dynamics},
author={Vashistha, Sachin and Bibhuti, Aryan and Naik, Atharva and Tutek, Martin and Aditya, Somak},
journal={arXiv preprint arXiv:2511.13021},
year={2025}
}
If you use our code or ideas, please cite the paper above. Thanks!