Manifund

Comments

Sandy Tanwisuth

3 months ago

As someone who has worked with multiple organizations in the AI Safety and AI alignment ecosystems, I think the current development of the field seems to indicate that there should perhaps be a clear separation between the Empirical AI Safety track which is some of the most important and valuable work like BlueDot Impact or MATS and a different pipeline for more exploratory (yet valuable in the long term) that defies the systematic incentives and the status quo such as PIBBSS, ILLIAD, and this fellowship.

In other words, the empirical track is essential. But its very success under current incentives creates a risk: we may optimize so hard for what can be measured now that we starve the exploratory work that will only prove its value later. We need both. This project strikes me as exactly the kind of alternative pipeline that embodies the latter: its design prioritizes deep understanding, community, and researcher well-being. These choices directly counteract the pressures that push some current work toward short-term, measurable results, sometimes at the expense of the deeper thinking the foundational problem ultimately requires.

I personally feel like the mismatched in ecosystems did not allow some participants like me to thrive and perhaps find that if I were to be in this ecosystems instead my productivity and research output wouldn't have been negatively impact by the measurement pressures. I encourage people with more resources to fund this work.

[Retroactive] Funding for developing new "substrate-flexible risk" threat model

Sandy Tanwisuth

3 months ago

The MoSSAIC framework is good foundational work that pushes mechanistic interpretability in a more scientifically grounded and principled direction. I recommend funding this work. This paper provides a meta-level intervention while identifies a core assumption underlying most contemporary safety work (the "causal-mechanistic paradigm") and systematically demonstrates why this paradigm will likely fail as AI systems become more capable. It does this not only through abstract speculation alone, but by connecting concrete empirical results (Bailey et al. 2024 on obfuscated activations, McGrath et al. 2023 on self-repair) to a sequence of plausible threat models. I have high respect for Matt Farr for taking the initiative to work on something intrinsically valuable to the Alignment field yet that is misaligned with the incentives of other main players. I believe that this is a diagnostic and constructive work that questions the dominant paradigm that the field is desperately needed yet not being talked about nor discussed enough. The field systematically underfunds this kind of work because it does not produce legible outputs but the MoSSAIC framework is an example of a legible output. I encourage people with more resources to retroactively fund this work.