I run an independent AI agent verification platform. Just benchmarked 11 frontier models on hallucination for under $5 total. Two runs each, maximum spread was 4 points. Evaluation doesn't have to cost thousands. tabverified.ai

upvoted an article about 1 month ago

Article

AI evals are becoming the new compute bottleneck

evaleval

•

Apr 29

• 27

commented on AI evals are becoming the new compute bottleneck about 1 month ago

I'm the solo founder of TAB Platform (tabverified.ai), an independent AI agent verification platform. We run 340+ benchmarks across 59 models from 5 providers at $0.03 per text test case. Pay-as-you-go credits, no subscription.

A few of your findings match what we've been measuring for 13 months:
The 33x cost spread on scaffold choice lines up with our data. We tested Claude Opus with two different harness configurations on the same benchmark, same model, same test. 42% with one harness, 78% with another. We have 101 harness configurations measuring exactly this variable.

Your reliability point is critical. We built a Benchmark Calibration Suite with 10 synthetic agents and 5 adversarial gaming agents that runs daily (135/135 checks) specifically because single-run numbers aren't trustworthy. We also found 20-point variance on GPT-4o across 4 identical runs.

The conclusion that "whoever can pay for evaluation gets to write the leaderboard" is the reason TAB exists. Independent evaluation shouldn't require a frontier lab's compute budget.

Would be interested in comparing notes if any of you are open to it.
Rod Miller
tabverified.ai
rod@tabverified.ai

Rod Miller

AI & ML interests

Recent Activity

Organizations

RodTAB's activity

[Submission] TAB Error Recovery - 9 models, third-party evaluation

AI evals are becoming the new compute bottleneck