DIMBA: Non-Autoregressive Diffusion Language Model with Mamba-2 SSMs

Project summary

DIMBA is an open-source, non-autoregressive language model that generates text in parallel using diffusion and Mamba-2 State Space Models—no Transformers, no token-by-token generation. Current LLMs (GPT, Claude) are bottlenecked by sequential generation: latency scales with length, errors cascade, and compute costs explode. DIMBA treats text generation like an artist sketching the whole canvas then refining details—iteratively denoising the entire sequence across 8-20 parallel steps. This unlocks 10-100x faster inference for long outputs and enables truly on-device AI without cloud dependency.

What are this project's goals? How will you achieve them?

1. Prove that diffusion + Mamba-2 can generate coherent text at Autoregressive quality levels
2. Train using our open-source training infrastructure for non-autoregressive language models
3. Release d1-small: an edge-capable model (~100M parameters) that runs inference on consumer GPUs
4. Demonstrate that parallel generation enables better long-range coherence than autoregressive models
5. Publish reproducible training recipes so the community can build on this architecture

With Minimum Funding ($1,500):

• Complete toy-scale implementation (1-20M parameters) proving core diffusion+Mamba mechanics
• Train small VAE on text corpus and generate first coherent sequences
• Validate hybrid consensus decoding approach on synthetic tasks
• Publish technical blog post documenting architecture and early results
• Release open-source codebase for community experimentation
• Outcome: Validates the approach but does not produce a useful model for real-world tasks

With Full Funding ($20000):

Phase 1: Foundation & Scaling

• Scale to d1-base (~1-2B parameters) using filtered web corpus + code
• Implement training efficiency optimizations (gradient checkpointing, mixed precision, distributed training)
• Multiple ablation studies on architecture choices (diffusion steps, Mamba-2 config, latent dimensions)
Phase 2: Training & Post-Processing

• Full training run of d1-base with high-quality data mixture
• Stretch goal: Attempt 4B parameter model if efficiency optimizations exceed expectations
• Post-training alignment: RLHF, GRPO, or similar techniques for instruction following and reasoning
• Develop "thinking variant" with extended inference-time computation for complex reasoning tasks
• Rigorous benchmarking against GPT-2-medium, Pythia-1.4B, and other open models
• Comprehensive evaluation: perplexity, downstream tasks, inference latency, human preference eval
Phase 3: Ship & Publish

• Publish peer-reviewed research paper with full architectural details, training logs, and reproducible recipes
• Release trained model weights (d1-base, 4B if achieved, and thinking variants)
• Open-source complete training infrastructure and evaluation suite
• Write accessible blog post for broader ML community
• Build community around non-autoregressive architectures
Outcome: First open-source, production-capable non-autoregressive language model with alignment and reasoning capabilities; validated alternative to transformer autoregression

How will this funding be used?

• Compute credits: 95% ($19,000) — GPU clusters for training d1-base (1-2B params), potential 4B attempt, ablation studies, RLHF/GRPO post-training, and benchmarking. Primary providers: Lambda Labs, RunPod, or CoreWeave.
• Developer tools & infrastructure: 2.5% ($500) — Cloud storage for datasets and checkpoints, experiment tracking, API access for evaluation baselines.
• Misc operational: 2.5% ($500) — Domain registration, documentation hosting, human evaluation for preference data, backup compute for debugging.

Who is on your team? What's your track record on similar projects?

Solo founder. I'm a 13-year-old independent researcher based in UAE. I've been building systems since age 6 and focused on ML/blockchain for the past few years. This is my full attention outside of school—nights, weekends, and breaks. I handle all research, coding, and infrastructure myself. For specialized feedback, I tap into ML research Discords and academic Twitter, but no formal co-founders or employees yet.

Ghost Blockchain (2023-Present(still prerelease) ):

• Built custom blockchain from scratch using Substrate (Rust)
• Implemented hybrid PoW/PoS consensus with post-quantum Dilithium signatures
• Created "Entropy-Steered Consensus" for finalization
• ~15,000 lines of production Rust code, fully open source(not released just yet, alpha is on the way!)

DIMBA Research (2024-Present):

• Published theoretical paper on diffusion-based language modeling
• Built working VAE implementation with proper KL loss and training scripts
• Created dimba-lib-exp repo with 6 experimental tools: • Token Decoder Laboratory (6 decoding strategies)• Latent Space Navigator (VAE exploration)
• Mamba State Visualizer (SSM dynamics)
• And 3 others for training infrastructure

Community:

• Active contributor in Mamba, and ML research community
• Open source contributions to inference optimization tooling

What are the most likely causes and outcomes if this project fails?

1. Training instability at scale

• Likelihood: Low - Medium
• Outcome: Model fails to converge at 1B+ parameters or produces incoherent output; project pivots to publishing negative results with detailed failure analysis
• Mitigation: Extensive toy-scale (10M-100M) ablations before scaling; incremental scaling (100M → 500M → 1B); established diffusion techniques from image domain; checkpoint frequently to resume from failures
2. Quality ceiling vs autoregressive models

• Likelihood: Low - Medium
• Outcome: Non-autoregressive generation cannot match GPT-2/Pythia quality at equivalent parameter counts; architecture insight becomes "interesting but impractical for production"
• Mitigation: Hybrid decoding strategies already prototyped; ability to increase diffusion steps trades speed for quality; continuous latent space may enable better reasoning than discrete AR
3. Compute exhaustion mid-training

• Likelihood: Medium-High
• Outcome: Training run crashes or diverges at 60-80% completion with insufficient funds to restart; forced to ship smaller model (500M-700M) instead of 1B+ target
• Mitigation: Aggressive checkpointing every 10% of training; spot instance automation to reduce costs; $50k buffer includes $10k contingency for restarts; Mamba-2 efficiency keeps costs lower than transformer equivalent
4. Post-training alignment challenges

• Likelihood: Medium (Its been done with diffusion, but not dimba)
• Outcome: RLHF/GRPO on diffusion models proves unstable or ineffective; "thinking variant" fails to show reasoning improvements; models are capable but not instruction-following
• Mitigation: Fallback to simpler SFT (supervised fine-tuning) if RL fails; release base models as research artifacts even if alignment is weak; document what doesn't work for the field
5. Research scooped

• Likelihood: Low
• Outcome: Major lab (OpenAI, DeepMind, Meta) releases diffusion LLM with similar approach before we ship
• Mitigation: Focus on open-source reproducibility niche; big labs rarely release training code or weights; community values independent replication; our "thinking variant" exploration may still be novel
If project fails entirely:
Outcomes include published negative results (valuable for field), open-sourced tooling useful for other diffusion research, validated/invalidated hypotheses about non-autoregressive architectures, and trained smaller models as artifacts for the community.

How much money have you raised in the last 12 months, and from where?

$0.

This project is entirely self-funded through personal time and free-tier resources (Google Colab, GitHub). No grants, no investors, no crowdfunding. I applied to Emergent Ventures and 1517 Fund's Medici Project in February 2026 (pending decisions).