arxiv:2602.01425

Building Better Deception Probes Using Targeted Instruction Pairs

Published on Feb 1

Authors:

Abstract

Linear probes for detecting deceptive behavior in AI systems perform poorly due to spurious correlations and false positives, with instruction pair selection being the primary factor affecting performance.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Linear probes are a promising approach for monitoring AI systems for deceptive behaviour. Previous work has shown that a linear classifier trained on a contrastive instruction pair and a simple dataset can achieve good performance. However, these probes exhibit notable failures even in straightforward scenarios, including spurious correlations and false positives on non-deceptive responses. In this paper, we identify the importance of the instruction pair used during training. Furthermore, we show that targeting specific deceptive behaviors through a human-interpretable taxonomy of deception leads to improved results on evaluation datasets. Our findings reveal that instruction pairs capture deceptive intent rather than content-specific patterns, explaining why prompt choice dominates probe performance (70.6% of variance). Given the heterogeneity of deception types across datasets, we conclude that organizations should design specialized probes targeting their specific threat models rather than seeking a universal deception detector.