A Geometric Analysis of Small-sized Language Model Hallucinations

Abstract

Hallucinations — fluent but factually incorrect responses — pose a major challenge to the reliability of Large Language Models (LLMs), especially in multi-step or agentic settings.

Existing work largely frames hallucinations as a consequence of missing knowledge; we show instead that, even when the relevant factual knowledge is present, models still produce hallucinated answers, pointing to retrieval instability rather than knowledge gaps.

Building on this observation, we introduce APORIA (Aggregate Prompt-wise Observation Retrieving Instability via Asymmetry — the state of puzzlement-in-contradiction that hallucinations embody), a geometric framework that studies repeated responses to the same prompt in sentence-embedding space. Our central hypothesis is that genuine responses cluster more tightly than hallucinated ones; we empirically validate this and show that, after Fisher projection, the two response classes become consistently separable.

We exploit this geometry in APORIA-LP, an efficient label-propagation method that classifies large collections of responses from as few as 30–50 annotations, achieving F1 scores above 90% across ten small-sized LLMs.

To support further research, we release SOCRATES-300K, a fully labelled dataset of 300,000 responses, together with the code for both dataset generation and result reproduction.

Our key finding — framing hallucinations from a geometric perspective in the embedding space — complements traditional knowledge-centric and single-response evaluation paradigms, paving the way for further research.

Key Contributions

Hallucinated responses exhibit, in the embedding space, weaker semantic cohesion than non-hallucinated ones, displaying and stable geometric signatures.
We introduce APORIA (Aggregate Prompt-wise Observation Retrieving Instability via Asymmetry; a framework echoing the Socratic aporía, attaining strong distributional separability between genuine and hallucinated responses.
We introduce APORIA-LP, a label-propagation framework that transfers hallucination tagging from a small set of judged responses to large collections of generations, achieving F1 scores above 90%.
We release a fully labelled dataset of repeated responses (150 generations for 200 prompts across 10 LLMs) under the name SOCRATES-300K to support structural analyses of hallucinations. We also release the full code base for dataset generation and result reproduction.

Method

Structural Analysis

Under APORIA we test a simple hypothesis: for a given prompt, genuine responses concentrate around a stable semantic core, while hallucinated ones scatter into distinct fabricated explanations.

Distance distributions. For each pair of embedded responses we collect the pairwise Euclidean distance, yielding three distributions: intra-genuine D_GG, intra-hallucinated D_HH, and inter-class D_GH.
Wasserstein comparison. Rather than reducing the distributions to low-order moments, we compare them via the one-dimensional Wasserstein distance, measuring the separation between D_GG and D_HH.

Label Propagation

Building on this geometric structure, APORIA-LP is a supervised procedure that propagates labels from a small annotated subset to the remaining responses.

Point-to-set distances. For an unlabelled response we compute its distances to each labelled class, producing two empirical distributions.
Label assignment. The response is assigned to the class — G or H — whose internal distance structure is most consistent with these distributions.

Results

Distributions of intra-class and inter-class distances across the ten models — Distributions of mutual intra-class distances for **Genuine** (*D_GG*, green) and **Hallucinated** (*D_HH*, red) responses. Inter-class distance distributions *D_GH* are overlaid as blue boxplots, with medians highlighted.

Evolution of the label-propagator F1 score with training-set size — Evolution of the APORIA-LP F1 score on SOCRATES-300K as the training-set size grows from 5 to 100 responses.

Label Propagation Performance

Per-model accuracy and F1 score (mean across prompts, std in parentheses):

Model	Accuracy	F1
Mistral-7B	86.8 (6.6)	92.1 (4.4)
DeepSeek-7B	90.9 (6.2)	94.1 (4.8)
Llama-3-8B	89.1 (6.3)	92.5 (5.7)
Gemma-2-9B	86.1 (7.1)	91.5 (5.0)
Yi-1.5-9B	92.5 (4.8)	95.3 (3.7)
SOLAR-10.7B	82.7 (9.1)	86.9 (8.2)
Phi-4	85.2 (9.6)	84.0 (16.8)
Qwen2.5-14B	85.0 (8.8)	88.7 (9.1)
Gemma-2-27B	88.3 (6.9)	93.0 (4.5)
Qwen2.5-32B	85.5 (8.7)	89.1 (9.0)
Average	87.2 (7.4)	90.7 (7.1)

Insights

Insight 1

Hallucinations are retrieval failures, not knowledge gaps.

Insight 2

Hallucinated responses show reduced semantic cohesion.

Insight 3

Hallucination variance concentrates along one direction.

Insight 4

Structure enables label-efficient classification.

Insight 5

One regularisation scale fits all models.

Dataset

We release a fully labeled dataset of repeated LLM generations designed for structural analyses of hallucinations:

Prompts: 100 base templates × 2 years (2020, 2022) = 200 prompts, targeting precise factual events.
Models: 10 base (non instruction-tuned) LLMs from 7B to 32B parameters.
Responses: 150 independent generations per (prompt, model) pair — 300,000 total, 231,473 retained after filtering.
Labels: Per-response Genuine (0) / Hallucinated (1)
Format: Parquet (≈ 974 MB).

Download from Zenodo

Additionally, the dataset employed to conduct the experimental part of CoQA-89K is available at:

CoQA-89K Dataset

BibTeX

@inproceedings{ricco2026geometric,
  title     = {A Geometric Analysis of Small-sized Language Model Hallucinations},
  author    = {Ricco, Emanuele and Onofri, Elia and Cima, Lorenzo and
               Cresci, Stefano and Di Pietro, Roberto},
  booktitle = {Proceedings of the 43rd International Conference on
               Machine Learning (ICML)},
  year      = {2026}
}