You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
What is this project?
This project builds and publishes the first AI safety evaluation (evals)
benchmark specifically targeting Nigerian indigenous livestock knowledge a domain that is systematically underrepresented in LLM training data but
increasingly relevant as AI advisory tools are deployed across sub-Saharan
African agriculture.
AI models are already being used or piloted as advisory tools for smallholder
farmers across Africa. If those models fail silently on African specific
knowledge, the consequences are concrete: wrong disease diagnoses, incorrect
breed management advice, and misguided treatment decisions affecting both
animal welfare and farmer livelihoods. This project quantifies exactly how
much and where current frontier models fail.
Phase 1 (complete): Build a 420-question benchmark covering six knowledge
categories ethnoveterinary practices, indigenous breed characteristics,
disease recognition, production systems, nutrition, and regulatory context
and run baseline evaluation on a leading open-weight model (Meta Llama 3.1 8B
via Groq).
Result: 43% full accuracy rate across 420 questions. Score breakdown by category shows worst performance on breed specific numerical data (e.g. milk yield figures, body weight ranges) and indigenous disease recognition cues that appear in Nigerian field practice but not in Western veterinary literature. Baseline data is available for review on request.
Phase 2 (this grant): Run the same benchmark against Claude Sonnet 4,
GPT-4o, and Gemini 1.5 Pro. Produce a comparative evals paper asking: do
closed frontier models perform meaningfully better than open-weight models
on African agricultural knowledge domains? And does performance variation
across categories reveal systematic gaps relevant to deployment safety?
Phase 3 (post-paper): Publish the benchmark openly on HuggingFace.
Submit findings to an AI safety or AI fairness venue (AfricaNLP, FAccT,
or an AI safety workshop). Make the benchmark available for reuse by other
researchers evaluating agricultural AI tools in African contexts.
| Item | Cost |
| Anthropic API (Claude Sonnet, 3 runs × 420 questions) | ~$8 |
| OpenAI API (GPT-4o, 3 runs × 420 questions) | ~$10 |
| Google Gemini API (1.5 Pro, 3 runs × 420 questions) | ~$5 |
| HuggingFace dataset hosting + versioning | $0 (free tier) |
| ArXiv preprint submission | $0 |
| Buffer for additional model versions and re-runs | ~$77 |
| Total requested (minimum) | $500 |
| Additional runs, extended benchmark (goats, poultry), conference fees | ~$1,000 |
| Funding goal | $1,500 |
With minimum funding ($500): Complete Phase 2 and publish the 4-model
comparative paper.
With full funding ($1,500): Extend the benchmark to small ruminants
(West African Dwarf goat, Red Sokoto), run additional model versions
(GPT-4.1, Claude Haiku vs Sonnet comparison), and cover a conference
submission fee.
Fatika Umar Ibrahim — Veterinary student and ML researcher, Nigeria.
Working at the intersection of veterinary science and applied ML with a
focus on Nigerian and African agricultural systems.
Relevant prior work:
- Milk yield predictor: XGBoost regression model (R² = 0.977) trained
on Nigerian indigenous breed parameters (White Fulani, Sokoto Gudali,
Adamawa Gudali, Red Bororo). Deployed on HuggingFace Spaces.
- Canine tumor classifier: CNN-based image classifier for veterinary
diagnostic support.
- VetDesk AI: Fully offline AI receptionist agent for veterinary clinics
with TF-IDF RAG engine and FastAPI layer, deployed on Streamlit Cloud.
HuggingFace portfolio: huggingface.co/Fatika01
Domain expertise: 4+ years veterinary training with specific knowledge of
Nigerian indigenous breed characteristics, ethnoveterinary systems, and
Nigerian agricultural context the exact domain being evaluated. This
combination of veterinary domain knowledge and ML implementation ability
is the core differentiator of this project; most AI evals work lacks the
domain expertise to know when a model answer is subtly wrong in context.
The benchmark questions are too hard / not representative: Mitigated by
the fact that questions were drawn from Nigerian veterinary curriculum
materials, published ethnoveterinary literature, and field practice not
constructed to trip models up. The baseline result (43%) was not a floor
test; it reflects genuine knowledge gaps.
Models perform well on closed APIs and the paper has no finding:
Unlikely given the baseline but if closed models score significantly
higher (e.g. >80%), that is itself a meaningful safety finding: it
suggests open-weight models should not be deployed for African agricultural
advisory without domain-specific fine tuning.
Paper doesn't get accepted anywhere: The benchmark dataset itself has
standalone value as an open resource regardless of where the paper lands.
HuggingFace publication + preprint ensures the work is accessible and citable.
Funding raised in the last 12 months
$0 in external funding. All prior projects self funded using free-tier
compute (Google Colab, HuggingFace Spaces free tier, Groq free tier).
There are no bids on this project.