Solon: A Reliable Autonomous ML Research Agent

Project summary

Solon is an autonomous machine learning research-agent project.

The goal is to build a system that can carry out small, benchmarkable research loops: search prior work, propose hypotheses, generate experiment code, run evaluations, compare results against baselines, and write up useful findings or negative results.

I am not trying to claim that Solon can replace scientists or produce major breakthroughs on its own. The near-term goal is more realistic: to test whether an AI agent can behave like a careful junior research assistant on small ML problems, especially where the work involves literature search, experiment design, code generation, reproducibility checks, and honest reporting.

The project matters because AI systems are getting better at writing code, reading papers, and generating research ideas, but current systems are still unreliable. They can overstate results, miss prior work, produce broken experiment code, or treat weak findings as important. Solon is my attempt to build an autonomous research workflow that is useful while also being constrained by evidence, repeatability, and human oversight.

The grant would support an 8–12 week sprint to make Solon more reliable and demonstrate one complete research run from question to experiment to write-up.

What are this project's goals? How will you achieve them?

The main goal is to move Solon from a promising prototype toward a credible autonomous research system that can complete one small ML research cycle end-to-end.

During this grant period, I will focus on:

1. Improving topic selection so Solon chooses research questions that are specific, benchmarkable, and not too broad.
2. Improving experiment reliability so generated code is checked before expensive runs.
3. Strengthening literature and prior-work search so Solon does not rediscover obvious existing work.
4. Making the system better at staying on one research question long enough to produce useful results.
5. Producing one documented research artifact: a short technical write-up, negative-results report, or reproducible benchmark finding.

The first research runs will focus on small machine learning questions where experiments are cheap enough to run repeatedly and where results can be checked against public benchmarks or baselines. This makes the project realistic for a solo student researcher and reduces the risk of spending time on questions that require frontier-scale compute.

The expected output is not just a polished demo. I want to produce a transparent record of what Solon did: what question it chose, what prior work it found, what experiments it ran, what failed, what succeeded, and whether the final result is actually worth believing.

How will this funding be used?

Funding will support development time, API usage, and compute for an 8–12 week research sprint.

The money will be used for:

- LLM/API credits for research planning, code generation, literature search, and write-up assistance.
- Search and scholarly API usage for prior-work discovery.
- Cloud or GPU compute for running small ML experiments and reruns.
- Development of reliability checks around generated experiment code.
- Documentation, logs, and preparation of a public demo or write-up.
- Open-source release preparation for the parts of the system that are safe and useful to share.

The minimum version of success is a public demo showing Solon completing one constrained research loop:
question → prior-work search → hypothesis → experiment → result → write-up.

The stronger version of success is a short, reproducible research artifact with code, logs, and an honest explanation of what the system found and what it failed to do.

Who is on your team? What's your track record on similar projects?

I am a second-year AI student at UC Berkeley and currently a solo developer.

I have been building Solon as a serious autonomous AI research-agent project. The current prototype already includes a working research-agent loop, external research tools, experiment-code checks, a dashboard, and mechanisms for making the system more conservative when its own outputs look too optimistic.

I am early in my research career, but I have been actively building and testing this system rather than only writing about the idea. My goal with this grant is to narrow the project into a concrete, testable research sprint with a clear output.

I will remain the main developer and researcher. If funded, I may also seek feedback from university staff, open-source researchers, or AI safety researchers, but the project is currently solo-led.

What are the most likely causes and outcomes if this project fails?

The most likely failure mode is that Solon remains an interesting prototype but does not produce a genuinely useful research finding during the grant period.

The main risks are:

- The system may choose research questions that are too vague or not useful.
- Generated experiment code may still require too much human correction.
- The system may run experiments successfully but produce uninteresting or inconclusive results.
- The final write-up may be technically correct but not novel enough to matter.
- I may discover that the current system needs more engineering work before it can produce research outputs reliably.

To reduce these risks, I will focus on small, benchmarkable ML questions rather than broad open-ended research. I will also treat negative results as valid outputs if they are well-tested and informative.

Even if Solon does not produce a strong research finding in this grant period, the project can still produce useful open-source infrastructure and lessons for autonomous research agents: better research-loop design, better experiment reliability practices, and a clearer understanding of where current AI agents fail at doing real research.

How much money have you raised in the last 12 months, and from where?

I have not raised external funding for this project in the last 12 months.