Clusters or Chaos

Mapping the geometry of LLM failure modes (NTNU, in progress)

Explores the geometry of adversarial failure regions inside large language models. The core question is whether inputs that cause an LLM to produce harmful outputs cluster together in the model's embedding space or are scattered randomly, a distinction that dictates how future defenses should be designed. Standard LLM inference is nondeterministic, so the project first establishes a deterministic baseline using batch-invariant GPU kernels that guarantee bitwise reproducibility. From there, a reinforcement learning agent searches the continuous embedding space using soft prompts, trainable vectors injected directly into the model's latent representation, with a toxicity classifier as its reward signal to steer toward regions where safety alignment breaks down. The resulting framework maps the shape and density of the adversarial landscape, not just individual failures.

Method

RL + Soft Prompt Optimization

Focus

Adversarial Prompt Geometry