Can 100 Expert Samples Teach a 7B Model Creative Judgment?

Project summary

We want to prove one thing. Expert-level creative judgment can be transferred to a small language model. With very little data. We use 100 expert-annotated samples to LoRA fine-tune a 7B model, and test whether it can match or beat GPT-4 on creative quality evaluation. 8 weeks. Clear result.

What are this project's goals? How will you achieve them?

Prove that expert-level creative judgment can be transferred to a small language model. With very little data.

Specifically: we use 100 expert-annotated samples to LoRA fine-tune a 7B open-source model. We test whether it can match or beat GPT-4 on creative quality evaluation.

Current LLMs have a known weakness in creative quality (see The Atlantic, The Human Skill That Eludes AI). We think the first step is not teaching AI to write better. The first step is giving AI the judgment of a senior human editor. Can it tell what is good? Why it is good? Where it falls short? This judgment is the core input for reward models and RLHF/DPO pipelines. If the judge is weak, the whole pipeline is weak.

Our framework is called Creative Calibration. We do not ask "is this beautiful?" We ask: does each creative choice hit its intended function? Think of model calibration in ML. That measures confidence vs. true probability. Creative Calibration measures creative execution vs. expert-judged effectiveness.

We are doing systematic research in Creative Quality Alignment (CQA). This proposal covers Phase 1 — validate the core hypothesis with minimal data.

Our mid-term goal: make AI match a senior human editor in creative quality judgment. Then plug that judge into reward models. Drive real improvement in RLHF/DPO generation pipelines. Our long-term goal: improve AI's own creative ability.

More broadly, we believe this research path — capturing tacit expert judgment through structured taxonomy — can inform other domains where human expertise is considered impossible to formalize. Creative quality is the first battlefield we chose.

Annotation material: published social mystery novels. Annotation content: systematic expert diagnostics of narrative craft. Deliverables in 8 weeks:

100 golden annotated samples
A LoRA fine-tuned 7B evaluation model
Blind test results: fine-tuned 7B vs. base 7B vs. GPT-4, scored by independent creative professionals outside our team
Full experiment report and methodology documentation

How will this funding be used?

Our 13 years of creative methodology are already built. They are not in this budget. This $18,000 covers the cost of turning proven expertise into a verifiable experiment.

·Researcher stipends (3 people × 8 weeks, full-time)/~$14,000

·Compute (LoRA fine-tuning + evaluation inference)/~$2,500

·Tools, API access, miscellaneous/~$1,500

Who is on your team? What's your track record on similar projects?

Three-person team. We cover creative production, knowledge translation, research methodology, and execution.

Me (project lead): Science communicator. Computer science background. 13 years of knowledge translation experience. Our core skill is structuring complex knowledge and delivering it precisely to the target audience. In the past, the audience was the general public. Now it is AI. We won "Praise of Science Communication China" Top 10 Annual Works and other national-level awards. Fine-tuning scripts and evaluation pipelines are built by me with AI-assisted programming tools.

Partner: Former investigative journalist, managing editor, novelist. He has been writing genre fiction for many years. Science fiction, social mystery, crime. He won the 23rd Chinese Science Fiction Galaxy Award for Best Short Story. Starting August 2025, we launched a crime mystery channel on Bilibili (China's YouTube). Top video reached 2.3 million views (https://www.bilibili.com/video/BV1ZoWEz8EmX/).

Third member: Full-time staff. Handles project execution and operations. Involved in the whole chain from annotation workflow to data processing.

What are the most likely causes and outcomes if this project fails?

Most likely failure (technical): 100 samples are not enough. The fine-tuned model shows high variance. It cannot reliably beat GPT-4. → The methodology still holds. Phase 2 scales up data. The 100 golden samples are released as a standalone contribution. This kind of dataset — creative quality alignment data with expert reasoning chains — does not exist in the open-source community today.

Deeper failure: The model learns the annotator's writing style, not the judgment structure. It learns how to talk, not what to judge. → We adjust the annotation protocol to decouple expression from judgment. The core methodology does not need to be discarded.

How much money have you raised in the last 12 months, and from where?

We have received no external research funding related to this project in the past 12 months. Our team is a self-sustaining commercial studio. This application is the first time our team is redirecting years of creative methodology toward AI alignment research.