Run a public online Turing Test with a variety of models and prompts

ACX Grants 2024

🐠

camrobjones

$1,999raised

$4,000valuation

Longer description of your proposed project

I recently ran a Turing Test with GPT-4 here (turingtest.live). We got around 6000 games from ~2000 ppts. There's a pre-print of results from the first 2000 games here (https://arxiv.org/abs/2310.20216). The full pop of data is under review and one prompt gets 49.7% after 855 games).

While the TT has important drawbacks as a test of intelligence, I think it's important as a test of deception per se. Can alert and adversarial users detect an LLM vs a human in a 5 minute text-only conversation? Which prompts and models work best? Which interrogation strategies work best? I think these are important and interesting questions to answer from a safety and sociological perspective. Plus lots of people reported finding the game very fun and interesting to play!

Games cost around $0.3 to run w/ GPT-4. We don't have specific funding for the project and we've been using a limited general experiment funding pot. The site gained popularity and we went through $500 in December so we decided to shut it down temporarily. Ideally, I'd like to revive it in 2024 but would need some dedicated funding to do this. If you'd like to test out the interface, you can do it here: turingtest.live/ai_game (please don't share this link widely though!)

As well as getting a better estimate on the success of existing models and allowing more people to play the game, there are a variety of additional questions we'd like to ask.

1. Prompts: We've tried around 60 prompts and there's a lot of variance. I'd be keen to generate more and see how well these do. A priori it seems very likely there are better prompts than the ones we've tried

2. Temperature. We've varied temperature a bit, but not very systematically. It would be useful to try the same prompt at a variety of temperatures.

3. Auxiliary infrastructure. Models often fail due to lack of real-time info. We could address this through browsing/tool-use. They also often make silly errors which we might be able to address through double-checking, and/or CoT scratchpads.

4. User-generated prompts. It would be lovely to let users generate and test their own prompts. But you probably need at least 30-50 games to reliably test a prompt. We would need a good ratio of games played:prompts created, a decent userbase, and some funding to do this well

5. Other models. I'm planning to include another couple of API model endpoints (e.g. Claude), which should be relatively easy to do. Lots of the feedback on Twitter was from e/acc folks who want to see OS/non-RLHF models tested and that seems right to me too. We could probably run some 7B models for < $2/hr and bigger ones for something like $5-10/hr (though I haven't tested this). Some fiddling with the infrastructure would be needed for this. We also might experiment with only running the game for 1-2hrs/day, to minimise server uptime & maximise concurrent human users.

Essentially, my goal would be to make some of these improvements, run several thousand more games, and publish the results.

Describe why you think you're qualified to work on this

I am a PhD student in cognitive science at UCSD. I've implemented the first version of this site and written a paper on the results. I'm pretty familiar with the literature on the Turing Test and I've implemented a range of similar experiments over the last 4 years of my PhD.

I'll also be working with my advisor, Ben Bergen, a professor in the department who has a proven track-record of successful cognitive science research across his career (https://pages.ucsd.edu/~bkbergen/).

Other ways I can learn about you

Website: https://camrobjones.com

Twitter: @camrobjones

Github: camrobjones

Linkedin: https://linkedin.com/in/camrobjones

How much money do you need?

~$5000. at $0.3/game this would buy us ~16000 games. Some additions like browsing and double-checking might increase game cost. Most likely we would use a decent part of this to run servers for OS models (e.g. $5 * 2hr/day * 7 days * 8 weeks = $560).

Links to any supporting documents or information

Site: turingtest.live

Demo: turingtest.live/ai_game (please don't share widely).

preprint: https://arxiv.org/abs/2310.20216

Estimate your probability of succeeding if you get the amount of money you asked for

Running ~5000 games in < 3 months: 95%

Building out auxiliary infrastructure: 90%

Building out OS model infrastructure: 85%

Running ~10000 games in < 3 months: 80%

Finding a prompt/setup that reliably "passes" (I don't know if this is 'success' but an interesting outcome. By "passes" I mean significantly > 50% success*): 40%.

* We discuss this a lot more in the preprint. This seems like the least-worst benchmark to me.

holds 0.05%

🐠

camrobjones

about 2 months ago

Final report

A paper about the project has now been published in PNAS: https://www.pnas.org/doi/10.1073/pnas.2524472123

The entire budget was spent on tokens. The site continues to run at turingtest.live on alternative funding.

Thank you so much to everyone who supported the project.

holds 0.05%

🐠

camrobjones

about 2 months ago

Progress update

A paper about the project has now been published in PNAS: https://www.pnas.org/doi/10.1073/pnas.2524472123

The site continues to run at turingtest.live.

Thank you so much to everyone who supported the project.

holds 0.05%

🐠

camrobjones

over 1 year ago

Progress update

Hi all,

Thank you again so much for your support! After completing the exploratory phase which was funded by this grant we completed two experimental studies based on the results.

We evaluated two prompts on 4 different LLMs and found that GPT-4.5 (with one prompt that we had validated through the exploratory phase) was judged to be human significantly more often than real humans (73% of the time). We believe this to be the first evidence that any system passes a standard (3-party) 5-minute Turing test.

You can read the rest of the results in our preprint:

Jones, C. R., & Bergen, B. K. (2025). Large Language Models Pass the Turing Test. arXiv preprint arXiv:2503.23674. (https://arxiv.org/abs/2503.23674)

The site is still live and will continue to run until we spend the rest of the money from this grant/project.

This project would most likely have languished without your support, so thanks very much for keeping it going and helping us to get to this stage! I will probably close the project soon after the funding runs out. Please let me know if you have any questions or ideas for further experiments to run.

Cameron

holds 12.5%

Austin Chen

over 1 year ago

Congrats @cameron! Great work, I'm impressed by the amount of press coverage (eg 1, 2, 3, 4) this has gotten; even though the AI research community has mostly stopped considering the Turing Test as a meaningful evaluation, I think a proper scientific paper that showcases this result to the public is super important!

I'm also grateful that Manifund got a shout-out here, it might be our first citation:

holds 0.05%

🐠

camrobjones

over 1 year ago

Thanks @Austin! It was very exciting to see the coverage ourselves, and very happy to give Manifund its first citation!

holds 0.05%

🐠

camrobjones

over 1 year ago

The site is finally back up at turingtest.live.

The new site uses a 3-party format where you chat with both a human and an LLM at the same time, and your task is to decide which is which. This setup is closer to Turing's original idea, and we think it will be much harder for the models to pass.

We've also added a range of models, including GPT-4o, Claude, and LLaMA, along with new prompting techniques that allow the models to interact in different ways.

The site will be live every day from 12–1 PM and 8–9 PM GMT (8 AM and 3 PM ET, 5 AM and 12 PM PT). We've done this in the hope of increasing the density of users online at the same time.

Thanks so much to everyone for your help, patience, and support while getting this back going. Please let me know if you have any comments or thoughts on the site as I'll be continuing to make updates. And please feel free to share the site as it works best when we get enough traffic for many people to be live at the same time.

holds 12.5%

Austin Chen

over 1 year ago

Hey @cameron, just wanted to say congrats on the launch! I'm excited to try and play this sometime.

I think restricting the time window to get real users simultaneously is an interesting design choice -- getting folks online together seems important, but it's a bit sad to have to wait to try out the game. I wonder about other viral ways of getting more users online: promote on Hacker News/LW/EA forum? launch to groups of people (eg classrooms)? "share with a friend" feature where you are facing off against the person you sent the app to?

holds 0.05%

🐠

camrobjones

over 1 year ago

@Austin Thanks so much!

This is a great point and something we went back and forth about a lot. I am going to post in more places today and hopefully we'll see a bit more traffic. If we are seeing consistently high traffic in those windows we will extend the times where it's playable.

But because you need other people to be online while it's being played, and currently we're not seeing very high traffic even in those periods, then hopefully this format at least maximises the chance that people can play every day.

I like the idea of adding a share button to the homepage. There is currently one after you complete a game, but it could be good to incentivise people to share so that they can play straight away.

On the point of playing vs a friend, this might undermine the Turing test to some extent. I think if you know that the other person is your friend you have a lot of insider knowledge that would allow you to beat even a very good LLM agent (or another human)!

Thanks very much for your support and these suggestions!

holds 0.05%

🐠

camrobjones

over 1 year ago

We’re planning to relaunch the site soon and we’re running a pilot test on Friday at 8pm GMT (3pm ET, 12pm PT).

You can access the new site here: 3p.turingtest.live. The pilot will be accessible here on Friday.

Any feedback you can provide about how the site and your interactions work would be greatly appreciated!

If you have any questions or comments, please let me know at cameron@ucsd.edu.

holds 0.05%

🐠

camrobjones

almost 2 years ago

Progress update

Hi all,

Thank you for your patience and apologies for the lack of updates. I have had to focus on other things over the last months including finishing my PhD and starting a postdoc. However, I am now able to put my full focus on this for the next couple of months. I'm hoping to have the updated experimental design finished by the end of the month and to start collecting data in October.

Thanks again and please let me know if you have any questions!
Cameron

holds 6.25%

🍉

Chris Leong

over 2 years ago

This is a cool project that might help improve the conversation around these issues.

Some people might be worried about hype, but there's already so much hype, the harms are likely marginal.

You may want to consider linking people to an AI Safety resource if you think your site may get a lot of traffic, then again, you might not if you think that'd make people more suspicious of the results.

Another option to consider would be an ad-supported model. I'm not suggesting Google Ad words, but you might be able to find an AI company to sponsor you.

holds 0%

Tom O’Haire

over 2 years ago

I’m glad to see this reach the threshold.

@camrobjones where will be best to monitor progress and see results?

holds 19.5%

🥨

Dony Christie

over 2 years ago

This sounds really cool and the only AI related ACX Grants cert project I could evaluate to have some legible chance at an impact, potentially a viral one. It already had some success apparently and just needs more funding. The Turing Test is a pretty fundamental concept in AI lore and we should have at least one running.

holds 10%

🌽

Dominic de Bettencourt

over 2 years ago

The Turing Test is definitely the most publicly well-known test of AI abilities, it has always seemed strange to me that the Loebner Prize stopped being awarded in 2020 right before AI started to reach a level where it could potentially get close to passing. I think something like this should definitely exist, I remember playing with it a bit when it was initially released and it was pretty cool.

holds 0.05%

🐠

camrobjones

over 2 years ago

@dominic Thanks very much, Dominic! I'm glad you had a chance to try it out and I appreciate the support!

🐞

Alyssa Riceman

over 2 years ago

This is neat! I'm not hugely expecting it to move the needle of popular understanding of AI deceptiveness very much, but the possibility of its doing so strikes me as sufficiently non-negligible that it nonetheless seems worth tossing some money at just in case.

holds 0.05%

🐠

camrobjones

over 2 years ago

@Alyssa Thanks so much, Alyssa!

holds 5%

🥭

Harvey Powers

over 2 years ago

Similar reasoning to my support here. Cool project, and please share the outcome / data if possible.@Alyssa @camrobjones

holds 31.7%

Anton Makiievskyi

over 2 years ago

Oh, what a cool project!
A few questions:
1. Who does a job of a human witness in this test? How do you make sure that there is a human online when someone wants to play the "interrogation game"?
2. Have you applied for OpenAI or Claude credits?
3. How about asking users to input their chatgpt API key to play?

In any case, I'm happy to offer money to get this project over a minimum funding bar

Can we expect an update here after a month or two? If it goes well, I will likely be glad to provide more funding

holds 0.05%

🐠

camrobjones

over 2 years ago

@AntonMakiievskyi Thanks so much Anton! I really appreciate the support.

1. Participants are randomly assigned to be witnesses or interrogators. The lack of humans online is a definite issue, as there were periods where a player would be repeatedly matched with AI if no humans were online. I'm considering only making the game available for ~1hr a day to maximise the density of humans online while keeping costs down.
2. I have applied for OpenAI but didn't hear back. Will try Claude & OpenAI again.
3. This could be a good backstop if we run out of credits again. I'm a little nervous about handling the data but I'm sure there's a secure way to do this.

Yes, I will probably take a couple of weeks to make changes to the site and then hopefully update just before we relaunch. Thanks again and let me know if you have more questions or would like to chat more.

holds 12.5%

Austin Chen

over 2 years ago

I really like that Cam has already built & shipped this project, and it appears to have gotten viral traction and had to be shut down due to costs; rare qualities for a grant proposal! The project takes a very simple premise and executes well on it; playing with the demo makes me want to poke at the boundaries of AI, and made me a bit sad that it was just an AI demo (no chance to test my discernment skills); I feel like I would have shared this with my friends, had this been live.

Research on AI deception capabilities will be increasingly important, but also like that Cam created a fun game that interactively helps players think a bit about how for the state of the art has come, esp with the proposal to let user generate prompts too!

holds 0.05%

🐠

camrobjones

over 2 years ago

@Austin Thanks very much for your support & thoughtful comments, Austin!