News from Mars
ARES: Open-Source Infrastructure for Online RL on Coding Agents
We're releasing ARES, our internal framework for training coding agents with online reinforcement learning. ARES enables real-time feedback loops that dramatically outperform static supervised approaches and scales with compute.
K-Steering: Targeted Representation Intervention via Activation Space
How do you change what a model does without breaking everything else? K-Steering offers a surgical approach to representation-level intervention — identifying and modifying key feature directions to achieve precise behavioral control without catastrophic forgetting.
Beyond Static Mechanistic Interpretability: Agentic Long-Horizon Tasks as the Next Frontier
Static circuit analysis reveals important structures in neural networks. But understanding deployed AI agents requires thinking about long-horizon, agentic behavior. We argue that the next phase of mech interp must go dynamic.
Code Review Bench v0: A Rigorous Benchmark for LLM Code Review
We introduce Code Review Bench, a benchmark for evaluating how well large language models perform code review. We measure correctness, depth of analysis, actionability of feedback, and alignment with expert human reviewers.
The Interpretability Prize: Part II — What We Learned from 500+ Submissions
Our second interpretability challenge attracted over 500 submissions from 47 countries. This post reviews the winning approaches, what they reveal about model internals, and what surprised us most about the field's current state.
Why the End of Science Would Begin With AI We Don't Understand
Nobel Prizes are now being awarded for black-box models predicting protein structures. This is remarkable — and terrifying. If neural networks become the future of science while remaining opaque, we've traded explanation for prediction.
Feature Geometry in Large Language Models: What the Topology of Representations Tells Us
We examine the geometric structure of learned representations in frontier LLMs. Using tools from algebraic topology and information geometry, we identify persistent patterns that constrain how meaning is encoded across model scale.
Launching the Interpretability Prize: A Call for the Field
We're putting up $50,000 for the best mechanistic interpretability result of 2024. This post explains our motivation, the prize criteria, and why we think open challenges are essential for accelerating progress in this space.