Manifund

I am concerned about risks from advanced AI systems, which I'm primarily motivated to study safety and cooperation in multi-agent settings. My research interests ponder questions like:

How can we evaluate and detect multi-agent deceptive misalignment?
- Can we effectively model the Bayesian Theory of Mind in multi-LLM-agent settings to study deceptive behavior?
How can we study emergent LLMs' personas, or steer their aspirations towards human values, in open-ended environments?