African Livestock AI Evals: Benchmarking LLMs on Nigerian Agricultural Knowledge

Project summary

What is this project?

This project builds and publishes the first AI safety evaluation (evals)

benchmark specifically targeting Nigerian indigenous livestock knowledge a domain that is systematically underrepresented in LLM training data but

increasingly relevant as AI advisory tools are deployed across sub-Saharan

African agriculture.

AI models are already being used or piloted as advisory tools for smallholder

farmers across Africa. If those models fail silently on African specific

knowledge, the consequences are concrete: wrong disease diagnoses, incorrect

breed management advice, and misguided treatment decisions affecting both

animal welfare and farmer livelihoods. This project quantifies exactly how

much and where current frontier models fail.

What are this project's goals? How will you achieve them?

Phase 1 (complete): Build a 420-question benchmark covering six knowledge

categories ethnoveterinary practices, indigenous breed characteristics,

disease recognition, production systems, nutrition, and regulatory context

and run baseline evaluation on a leading open-weight model (Meta Llama 3.1 8B

via Groq).

Result: 43% full accuracy rate across 420 questions. Score breakdown by category shows worst performance on breed specific numerical data (e.g. milk yield figures, body weight ranges) and indigenous disease recognition cues that appear in Nigerian field practice but not in Western veterinary literature. Baseline data is available for review on request.

Phase 2 (this grant): Run the same benchmark against Claude Sonnet 4,

GPT-4o, and Gemini 1.5 Pro. Produce a comparative evals paper asking: do

closed frontier models perform meaningfully better than open-weight models

on African agricultural knowledge domains? And does performance variation

across categories reveal systematic gaps relevant to deployment safety?

Phase 3 (post-paper): Publish the benchmark openly on HuggingFace.

Submit findings to an AI safety or AI fairness venue (AfricaNLP, FAccT,

or an AI safety workshop). Make the benchmark available for reuse by other

researchers evaluating agricultural AI tools in African contexts.

How will this funding be used?

| Item | Cost |

| Anthropic API (Claude Sonnet, 3 runs × 420 questions) | ~$8 |

| OpenAI API (GPT-4o, 3 runs × 420 questions) | ~$10 |

| Google Gemini API (1.5 Pro, 3 runs × 420 questions) | ~$5 |

| HuggingFace dataset hosting + versioning | $0 (free tier) |

| ArXiv preprint submission | $0 |

| Buffer for additional model versions and re-runs | ~$77 |

| Total requested (minimum) | $500 |

| Additional runs, extended benchmark (goats, poultry), conference fees | ~$1,000 |

| Funding goal | $1,500 |

With minimum funding ($500): Complete Phase 2 and publish the 4-model

comparative paper.

With full funding ($1,500): Extend the benchmark to small ruminants

(West African Dwarf goat, Red Sokoto), run additional model versions

(GPT-4.1, Claude Haiku vs Sonnet comparison), and cover a conference

submission fee.

Who is on your team? What's your track record on similar projects?

Fatika Umar Ibrahim — Veterinary student and ML researcher, Nigeria.

Working at the intersection of veterinary science and applied ML with a

focus on Nigerian and African agricultural systems.

Relevant prior work:

- Milk yield predictor: XGBoost regression model (R² = 0.977) trained

on Nigerian indigenous breed parameters (White Fulani, Sokoto Gudali,

Adamawa Gudali, Red Bororo). Deployed on HuggingFace Spaces.

- Canine tumor classifier: CNN-based image classifier for veterinary

diagnostic support.

- VetDesk AI: Fully offline AI receptionist agent for veterinary clinics

with TF-IDF RAG engine and FastAPI layer, deployed on Streamlit Cloud.

HuggingFace portfolio: huggingface.co/Fatika01

Domain expertise: 4+ years veterinary training with specific knowledge of

Nigerian indigenous breed characteristics, ethnoveterinary systems, and

Nigerian agricultural context the exact domain being evaluated. This

combination of veterinary domain knowledge and ML implementation ability

is the core differentiator of this project; most AI evals work lacks the

domain expertise to know when a model answer is subtly wrong in context.

What are the most likely causes and outcomes if this project fails?

The benchmark questions are too hard / not representative: Mitigated by

the fact that questions were drawn from Nigerian veterinary curriculum

materials, published ethnoveterinary literature, and field practice not

constructed to trip models up. The baseline result (43%) was not a floor

test; it reflects genuine knowledge gaps.

Models perform well on closed APIs and the paper has no finding:

Unlikely given the baseline but if closed models score significantly

higher (e.g. >80%), that is itself a meaningful safety finding: it

suggests open-weight models should not be deployed for African agricultural

advisory without domain-specific fine tuning.

Paper doesn't get accepted anywhere: The benchmark dataset itself has

standalone value as an open resource regardless of where the paper lands.

HuggingFace publication + preprint ensures the work is accessible and citable.

How much money have you raised in the last 12 months, and from where?

Funding raised in the last 12 months

$0 in external funding. All prior projects self funded using free-tier

compute (Google Colab, HuggingFace Spaces free tier, Groq free tier).