PRAGWORLD: A Benchmark Evaluating LLMs' Local World Model under Minimal Linguistic Alterations

Do Large Language Models maintain a robust implicit representation of conversations? We test their malleability under linguistic alterations and conversational dynamics.

1. IIT Kharagpur  •  2. LTI, Carnegie Mellon University  •  3. University of Zagreb

Abstract

Real-world conversations are rich with pragmatic elements, such as entity mentions, references, and implicatures. Understanding such nuances is a requirement for successful natural communication and often requires building a local world model which encodes such elements and captures the dynamics of their evolving states. However, it is not well-understood whether language models (LMs) construct or maintain a robust implicit representation of conversations. In this work, we evaluate the ability of LMs to encode and update their internal world model in dyadic conversations and test their malleability under linguistic alterations. To facilitate this, we apply seven minimal linguistic alterations to conversations sourced from popular conversational QA datasets and construct a benchmark with two variants (i.e., Manual and Synthetic) comprising yes-no questions. We evaluate nine open and one closed source LMs and observe that they struggle to maintain robust accuracy. Our analysis unveils that LMs struggle to memorize crucial details, such as tracking entities under linguistic alterations. We then propose a dual-perspective interpretability framework which identifies transformer layers that are useful or harmful and highlights linguistic alterations most influenced by harmful layers. Inspired by these insights, we propose two layer-regularization based fine-tuning strategies (ULA & HLS) that suppress the effect of the harmful layers.

Key Contributions

Malleability Benchmark
Evaluating the ability of LMs to encode and update their internal world models in dynamic, dyadic conversations.
Minimal Alterations
7 minimal linguistic alterations (e.g., Negation, Variable Swap, Quantity Change) to test robustness.
Interpretability
Dual-perspective framework using Direct Effect Patching and MLP zero-out ablation to find harmful/useful layers.
Regularization
Proposed Useful Layer Amplification (ULA) and Harmful Layer Suppression (HLS) to improve robustness.

Dual-Perspective Interpretability

To understand where LMs fail, we designed a framework using Direct Effect Patching and MLP Zero-out Ablation. This allowed us to trace performance issues to fragility in entity state tracking by identifying specific transformer layers that encode useful or harmful reasoning patterns.

  • Useful Layers: Encoding valid state updates.
  • Harmful Layers: Encoding spurious signals or shortcuts.
  • Insight: LMs often struggle to track entities under alterations, relying on shallow heuristics.
Direct Effect Patching (left) and MLP Zero-out (right) reveals confidence shifts. (Figure 3)

Regularization Strategies

Based on our interpretability insights, we propose two novel fine-tuning strategies:

  1. Useful Layer Amplification (ULA): Attaches a classification head to useful layers to reinforce their signals.
  2. Harmful Layer Suppression (HLS): Applies an L2 penalty to the MLP output of harmful layers to dampen spurious correlations.

These strategies significantly improve robustness towards the proposed linguistic alterations.

Figure 5: Effect of HLS and ULA on Accuracy
Regularization helps suppress the effect of harmful layers. (Figure 5)

Dataset Distribution

We constructed the PRAGWORLD benchmark by applying 7 types of minimal linguistic alterations to seed conversations.

Figure 2: Dataset Distribution
Distribution of the 7 linguistic alterations in the PRAGWORLD benchmark. (Figure 2)

Experimental Results

Model Performance (Robust Accuracy)
ModelRobust AccYes AccNo Acc
GPT-3.542.8652.7193.72
DeepSeek-Inst46.9477.2670.85
Phi-3.5-mini48.9866.0686.10
Llama-3.1-8B48.9854.8794.62
Qwen2.5-7B37.7647.6595.96
Subset of results from Table 1 (Manual Split). Models struggle to maintain robust accuracy across alterations.
Effect of Fine-Tuning
ModelBase RobustFinetuned RobustGain
Phi-3.5-mini48.9852.04+3.06%
Llama-3.1-8B48.9859.18+10.2%
Qwen2.5-1.5B22.4547.96+25.51%
Qwen2.5-7B37.7655.10+17.34%
Subset of results from Table 2. Fine-tuning on the synthetic split significantly improves robustness.

The Benchmark

We introduce PRAGWORLD, comprising two variants sourced from GRICE and CICERO datasets.

Dataset Variant Source Total Conversations Features
PRAGWORLD (Manual) GRICE & CICERO 500 Manually annotated & reviewed. High quality alterations.
PRAGWORLD (Synthetic) GRICE & CICERO 2114 Generated via GPT-4 semi-automatic pipeline + deterministic alterations.
Paper
Read the full paper on arXiv.
View PDF
Code & Data
Access the benchmark and scripts.
GitHub

Citation

@article{vashistha2025pragworld,
    title={PRAGWORLD: A Benchmark Evaluating LLMs' Local World Model under Minimal Linguistic Alterations and Conversational Dynamics},
    author={Vashistha, Sachin and Bibhuti, Aryan and Naik, Atharva and Tutek, Martin and Aditya, Somak},
    journal={arXiv preprint arXiv:2511.13021},
    year={2025}
}

The Team

SV
Sachin Vashistha
IIT Kharagpur
AB
Aryan Bibhuti
IIT Kharagpur
AN
Atharva Naik
Researcher
MT
Martin Tutek
University of Zagreb
SA
Somak Aditya
IIT Kharagpur
Homepage

If you use our code or ideas, please cite the paper above. Thanks!

Get Data