FragGuard: Cross-Session Malicious Activity Detection for Model APIs

Project summary

I am Linh Le. I have a PhD in NLP from the University of Queensland, and did a postdoc at the University of Technology Sydney (focusing on medical and robotics applications of ML). I began transitioning into AI safety one year ago, with a short postdoc at McGill University on persona-based AI safety. I am looking for bridge funding to upskill in research that can be more impactful, specifically AI control.

I want to create a tool that can auto-generate datasets that can fool current monitors, and then try creating a new AI control protocol on top. I began work on this in a recent Apart Research hackathon, generating synthetic data by iterating on an AI control harness using monitor feedback until I could bypass monitors (even Opus 4.6 in ControlArena). If I receive funding, I will continue this work, and create a control protocol that leverages this infrastructure to train monitors. I will submit a paper on this infrastructure and protocol (perhaps to NeurIPS, AAAI, or ICLR).

I have been attending many conferences in AI safety in the last year, including the FAR.ai Alignment Workshop in San Diego and in London, and EA Global in London and New York. I want to continue narrowing in on AI control by attending ControlConf this April, and possibly visiting Constellation for a week later in the year (to speak with Redwood Research).

What are this project's goals? How will you achieve them?

The goal of this project is to establish myself as an independent AI control researcher. In detail, with the money in this proposal, I would:

Upskill on AI control, by reading and reimplementing papers (especially around ControlArena).
Extend my hackathon project by testing against known control protocols, then creating new protocols and training custom monitors. I would write a paper with the results, and submit it to an AI-safety-relevant workshop or conference.
I want to attend the ControlConf workshop to learn more about the field and meet practitioners from Redwood Research. I would also like to attend EA Global London at the end of May to continue networking in the field.

In the long run, I want to run my own organization or research lab, so that I can involve many other people in my research and explore multiple directions at once. I believe that the first step is establishing myself as an independent researcher by receiving a small amount of money and producing high-impact AI safety work. While I am an experienced researcher, much of my work has been in applications to healthcare, robotics, and NLP. After substantial consideration, I have landed on AI control as a field in which I can be productive but also do high-impact work.

How will this funding be used?

I am asking for $15,520 USD to support me for six months, which is:

$1800 USD/month ($2500 CAD/month) for living expenses. My rent is $950 CAD, utilities $200 CAD, and food and other fixed costs about $500 CAD/month. My tax bracket is around 30%.
$500 USD/month for compute budget. I own a server with a 24GB GPU, but need to pay for larger models on OpenRouter. A lot of AI control work seems to leverage Claude Opus 4.6 as well.
$120 USD/month for Claude Max and ChatGPT.
$1000 USD to travel to ControlConf and/or EA Global London (at both conferences, I can apply for partial funding).

The minimum funding amount is living expenses for one month.

Who is on your team? What's your track record on similar projects?

I have been doing AI safety research for the past year: I did the MARS program in Cambridge, working on chain of thought monitoring, and a 6-month postdoc at McGill University, working on alignment through latent adversarial training on personality traits. I am an active scientist and have contributed to many submissions in the last six months (2x ICLR, 2x ICML, 1x EMNLP, 1x IASEAI, 1x accepted to EAAI journal, 1x accepted to Knowledge Base journal). Here are my LinkedIn and my Google Scholar.

MARS project: Along with Geodesic Research, I investigated chain-of-thought reasoning properties which prevent them from being used for monitoring—we call these properties pathologies. Prior work identified three distinct pathologies: post-hoc rationalization, encoded reasoning, and internalized reasoning. To better understand and discriminate between these pathologies, we present a systematic set of novel health metrics; we create model organisms deliberately trained to exhibit these pathologies, and demonstrate that our metrics can reliably diagnose these conditions. This work is under submission to ICML.

My McGill postdoc project: I used a data-efficient fine-tuning alternative (called latent adversarial training) to improve harmlessness by internalizing beneficial personality traits, without memorizing specific refusal patterns. Using fewer than 100 abstract personality statements, Latent Personality Alignment (LPA) achieves comparable safety performance to methods trained on hundreds of thousands (100,000s) of harmful examples while maintaining superior utility on benign tasks. This work is appearing at an ICLR 2026 workshop.

In terms of collaborations, I currently advise about 5-8 junior researchers through a SPAR project on AI security (also based on an Apart hackathon project), and through FPT University in Vietnam. I want to always be sharing the knowledge I have and helping others upskill as well. I have started collecting the research I’ve done recently under an umbrella called Lida Safety. At Lida Safety, I collaborate frequently with David Williams-King, a Research Manager at ERA and former member of LawZero and Mila.

I am a regular hackathon participant. I’ve won first place and fourth place at different Apart Research hackathons, and recently participated in theAI control hackathon. I also won first place for a red team at the Alignment Faking hackathon by Redwood Research in October 2025.

What are the most likely causes and outcomes if this project fails?

My impression is that AI control is difficult, and it may not be that simple to create a new control protocol even with novel data to train on. However, I believe the learnings I would take away from this upskilling would help me a lot in my ultimate career path.

I have found that peer review at mainstream AI conferences can be difficult for AI safety works. I hope to find collaborators to utilize my datasets, to help prove their usefulness. If that is not possible, it may help to target AI safety relevant venues.

How much money have you raised in the last 12 months, and from where?

I have won $600 from participating in Apart Research hackathons. I have not raised any other money.