Mejdi Sadriu
José Wheeler
Identifying and auditing reasoning circuits in LLMs within Algoverse 2026 using Sparse Autoencoders (SAEs).
Matthew Farr
Probing possible limitations and assumptions of interpretability | Articulating evasive risk phenomena arising from adaptive and self modifying AI
Aditya Raj
Current LLM safety methods—treat harmful knowledge as removable chunks. This is controlling a model and it does not work.
Lucy Farnik
6-month salary for interpretability research focusing on probing for goals and "agency" inside large language models