Fund a detour on my PhD to improve brute force statistical tooling

Longer description of your proposed project

I'm a PhD student at the Australian National University; I started in August 2023. My PhD topic is abstruse and mathematical (machine learning in the absence of manifold structures) and might have some long-term impact on AI alignment and linguistic equity so it's also important, but in the process of working on this, I have had to come up with statistical tests that work in very different universes to our own. (The short answer is that they set up a distribution of similar experiments and then train machine learning models to distinguish successful and unsuccessful experiments.)

When I back-ported these to some real-world statistical analysis in our universe, I was surprised to find that they out-performed some common techniques that everyone uses.

For example, the Welch t-test is often used in small experiments to find out whether (for example) a drug treatment worked. But if the experiment is about 25 participants or less, this underperforms the algorithms I developed. (That is, the Welch t-test is much more likely to say that a successful experiment did nothing, and somewhat more likely to say that an unsuccessful experiment had a result.) There are a lot of areas of science where experiments are often small: large animal studies, drug trials, psychology, economics (just to name a few).

In another example, I have a small side job with the Grain Development Research Corporation helping them get some legacy statistical code up to modernity, and when I put my techniques up against the linear mixed-effects model code that they have me working on, I could get a cross-valided root-mean-squared-error (a measure of the power of the statistical model: smaller numbers are better) that was well under a third of the industry standard, while still preserving the ability to return a probability (p-value) of the model, and to measure the size of the intervention effect. They have offered to give me a larger data set to work with (from an experiment on the effects of genes on plant growth). So this could have larger effects in gene impact studies.

The problem is that (a) I'm not a statistician. The statisticians I have spoken to about this find it intriguing, so I could make this a significant part of my PhD, but it would take a lot of time and I don't think this matches my strengths long term: I just stumbled upon this by accident, not by skill or intelligence (b) the problem in getting a new statistical test to be used and mainstream is mostly about influencing practitioners (people doing experiments), rather than academic paper writing.

I would like funding so that I can take a multi-month detour off my PhD (that is, still be officially doing my PhD, but give me extra runway before my funding runs out). Unlike many USA universities, this arrangement can happen at ANU without the money being siphoned off by the university.

I would use this time to (1) properly write up what I've done in an easy-to-understand paper (2) write some programs that are easy to use (3) find a few experimenters doing small experiments and work with them to use these back-ported techniques.

Describe why you think you're qualified to work on this

Qualifications: being at the top university in my country, I do have access to people who can guide me in any way that's necessary.

Also, my qualification is that I figured out how to make it work. Here's the code I wrote for the algorithms that beat the Welch t-test: https://github.com/solresol/superwelch

The GRDC code is a little harder for me to make public (but sharing it privately wouldn't be a problem).

Other ways I can learn about you

https://substack.com/profile/25595628-greg-baker

https://scholar.google.com/citations?hl=en&user=y10HLcAAAAAJ

How much money do you need?

USD30,000

Links to any supporting documents or information

https://github.com/solresol/superwelch

Estimate your probability of succeeding if you get the amount of money you asked for

(1) Writing up a paper: 90%. Getting it published somewhere that anyone pays attention to: 20%. (2) Writing some programs that are easy to use: 95%. (3) Finding some experimenters who will be willing to put some results into a paper: 85%.