Roadmap (Two-Track)¶

As of 2026-04-20, pccx is developed along two parallel tracks. v002 is the currently active architecture; v003 targets the next-generation model (Gemma 4 E4B) on the same KV260 platform. The two tracks share RTL assets — sparse weight fetcher, SSD dispatcher, tree mask generator, EAGLE training pipeline.

The long-term stretch goal is an Auto-Porting Pipeline α — a compiler that emits a pccx ISA stream from an arbitrary transformer config. It begins in Year 2 once both tracks are stable.

1. Common assumptions¶

Platform: Xilinx Kria KV260 (Zynq UltraScale+ ZU5EV), bare-metal
Quantization: W4A8 (INT4 weights × INT8 activations), KV cache INT4
Clock: AXI 250 MHz, core 400 MHz
VLIW ISA: 5 base opcodes (GEMV, GEMM, MEMCPY, MEMSET, CVO) + SPEC extensions
L2 URAM 2.25 MB budget: 1.0 MB activation pinning, 0.5 MB KV prefetch, 0.25 MB tree mask + scheduler state, 0.5 MB hot-attention tile

2. Track 1 — v002 Extended (Gemma 3N E4B, 20 tok/s)¶

Layers FFN sparsity and a speculative-decoding stack on top of the active v002 to reach the promised 20 tok/s measured throughput.

Tiered targets¶

Tier	tok/s	Gate
Baseline	5–6	Phase A–F done
Viable	10–12	Phase G + H done
Promise	20	Phase G–K done
Stretch	25+	Add Tree EAGLE (Phase J)

Phase plan¶

Phase	Weeks	Main work	Target tok/s
A–F	1–26	Re-parameterization → driver → Gemma 3N app → verification → synthesis → board bring-up	5–6
G	27–30	All-layer Gaussian Top-K sparsity (BW 1.95 → 1.36 GB/token)	8–9
H	31–32	Gemma 3 1B vanilla drafter (fast path, tokenizer-compatible)	11–14
H+	33–38	EAGLE-3 head training & swap-in ($20–30 on Vast.ai RTX 4090)	14–16
I	39–42	SSD async overlap (draft/verify pipelining)	17–19
J	43–46	Tree EAGLE (optional, stretch)	20–23
K	47–49	Final tuning & official benchmark	20

Why the schedule changed: the original Phase H trained EAGLE-3 first. Current plan attaches Gemma 3 1B as a vanilla drafter first (fast path) and swaps in a trained EAGLE-3 head only if the measured acceptance rate is insufficient.

Decision points & fallbacks¶

        flowchart TD
    W26{Week 26:<br>baseline < 5 tok/s?}
    W26 -- Yes --> RCA[Root-cause analysis, hold G–K]
    W26 -- No --> W36{Week 36:<br>EAGLE acc. < 2.0x?}
    
    W36 -- Yes --> PI[Run Phase I only, drop J]
    W36 -- No --> W40{Week 40:<br>< 15 tok/s?}
    
    W40 -- Yes --> PJ[Phase J becomes mandatory]
    W40 -- No --> W47{Week 47:<br>< 20 tok/s?}
    
    W47 -- Yes --> Settle[Settle for 15–18 tok/s, push 20 to v003]
    W47 -- No --> Success([Goal 20 tok/s Met!])

Week	Condition	Action
26	baseline < 5 tok/s	Root-cause analysis, hold G–K
36	EAGLE acceptance < 2.0×	Run Phase I only, drop J
40	< 15 tok/s	Phase J becomes mandatory
47	< 20 tok/s	Settle for 15–18 tok/s, push 20 tok/s to v003

3. Track 2 — v003 (Gemma 4 E4B, 12–15 tok/s)¶

Gemma 4 E4B (42 layers, MQA, sliding + full attention, 128 K context) on the same KV260 platform. Reuses v002 RTL assets to cut Phase 2+ implementation cost by ~30 %.

Tiered targets¶

Tier	tok/s	Gate
Minimum viable	10	Phase 2 done
Acceptable	12	Phase 3 done
Target	12–15	Phase 5 done
Stretch	15+	Add experimental techniques (e.g. DEER)

Phase plan¶

Phase	Weeks	Main work	Target tok/s
1	16–26	Foundation — extend `quantize_and_save`, vocab trim 262K → 50K, re-parameterize RTL	7
2	27–34	EAGLE-3 linear chain baseline ($30–50 or TRC TPU if granted)	10
3	35–39	Tree EAGLE verify (acceptance 3.5–4×)	12
4	40–43	SSD async overlap (reuse v002 Phase I RTL)	13–14
5	44–52	P-EAGLE + LTD (dynamic K, RL policy)	15

Cross-track asset reuse¶

        flowchart LR
  subgraph v002["v002 Extended — Gemma 3N E4B"]
    A[A–F baseline] --> G[G: sparsity] --> H[H/H+: EAGLE-3] --> I[I: SSD] --> J[J: Tree] --> K[K: 20 tok/s benchmark]
  end
  subgraph v003["v003 — Gemma 4 E4B"]
    P1[1: foundation] --> P2[2: EAGLE linear] --> P3[3: Tree] --> P4[4: SSD] --> P5[5: P-EAGLE + LTD]
  end
  G -. sparse weight fetcher .-> P1
  H -. EAGLE training pipeline .-> P2
  J -. tree mask generator .-> P3
  I -. SSD dispatcher .-> P4

4. Integrated timeline (52 weeks)¶

        gantt
    title Integrated Timeline (52 Weeks)
    dateFormat  YYYY-MM-DD
    
    section v002 (Track 1)
    Phase A–C               :v2_1, 2026-01-01, 105d
    Phase D                 :v2_2, after v2_1, 21d
    Phase E–F (5-6 tok/s)   :v2_3, after v2_2, 56d
    Phase G (Sparsity)      :v2_4, after v2_3, 28d
    Phase H / H+ (EAGLE)    :v2_5, after v2_4, 56d
    Phase I (SSD)           :v2_6, after v2_5, 28d
    Phase J (Tree)          :v2_7, after v2_6, 28d
    Phase K (Tuning)        :v2_8, after v2_7, 21d
    20 tok/s benchmark      :milestone, m1, after v2_8, 0d
    
    section v003 (Track 2)
    Phase 1 (Foundation)    :v3_1, after v2_1, 77d
    Phase 2 (EAGLE linear)  :v3_2, after v3_1, 56d
    Phase 3 (Tree EAGLE)    :v3_3, after v3_2, 35d
    Phase 4 (SSD overlap)   :v3_4, after v3_3, 28d
    Phase 5 (P-EAGLE)       :v3_5, after v3_4, 63d
    12-15 tok/s benchmark   :milestone, m2, after v3_5, 0d

Week	v002	v003
1–15	Phase A–C	—
16–18	Phase D	Phase 1 starts
19–26	Phase E–F (baseline 5–6 tok/s)	Phase 1 continues
27–30	Phase G	Phase 2 starts
31–38	Phase H / H+	Phase 2 completes
39–42	Phase I	Phase 3
43–46	Phase J (optional)	Phase 4
47–49	Phase K (20 tok/s benchmark)	Phase 4 continues
50–52	—	Phase 5 (15 tok/s)

Total 52 weeks (~12 months) assuming full-time solo work. Double the duration for a part-time schedule.

5. Compute budget¶

        pie title Estimated Compute Budget Allocation (Total ~$100)
    "EAGLE-3 Gemma 3N (v002)" : 30
    "EAGLE Tree variant (v002)" : 15
    "EAGLE-3 Gemma 4 (v003)" : 45
    "Initial deposits" : 10

Item	Cost	Window
Vast.ai / RunPod sign-up	$10 minimum deposit	Week 0
v002 Phase H+ EAGLE-3 (Gemma 3N)	$20–30	Week 33–38
v002 Phase J EAGLE tree variant	$10–15	Week 43–46
v003 EAGLE-3 (Gemma 4)	$30–50 (or $0 with TRC)	Week 27–34
Total	$70–100 ($40 with TRC)	—

Submit TRC TPU application at Phase 0 (1–2 week approval).
First 3–4 weeks need no cloud compute — local development only.
GPU becomes necessary starting at Phase H+.

6. Year 2 — Auto-Porting Pipeline α (stretch)¶

Kicks off once v002 + v003 are stable. Takes an arbitrary transformer config.json + weight safetensors and emits C code that calls the pccx driver API, plus a quantized weight binary.

Technical shape¶

        flowchart TD
  CFG[Transformer config.json] --> P[Config Parser]
  P --> R[Architecture Resolver]
  R --> F[Special Feature Detector]
  F --> I[ISA Code Generator]
  I --> W[Weight Layout Planner]
  W --> C[C Stub Generator]
  C --> V[Validator: Python golden vs RTL]

Target priority¶

Tier 1: regenerate hand-ported models — Gemma 3N E4B, Gemma 4 E4B
Tier 2: standard architectures — Llama 3.x, Qwen3, Mistral 7B
Tier 3: complex architectures — DeepSeek-V3 (MoE), Gemma 4 26B A4B (MoE), Phi-3/4

Schedule (Week 53+)¶

Week 53–76 (24 weeks / 6 months), full-time:

        gantt
    title Year 2 — Auto-Porting Pipeline α (Weeks 53–76)
    dateFormat  YYYY-MM-DD
    
    section Core Pipeline
    Parser & Resolver       :y2_1, 2027-01-01, 42d
    Tier 2 (Llama, Mistral) :y2_2, after y2_1, 42d
    MoE & Plugins           :y2_3, after y2_2, 42d
    E2E UI / CLI / Docs     :y2_4, after y2_3, 42d

Week 53–58 — Parser + Resolver + Gemma 3N/4 regeneration check
Week 59–64 — Tier 2 support (Llama, Qwen, Mistral)
Week 65–70 — Feature plugin system + MoE support
Week 71–76 — E2E automation, web UI / CLI, docs

Publication angle¶

“Auto-Compilation of Transformer Inference Workloads to Custom NPU ISAs” — target venues ISCA / MICRO / HPCA / FCCM / FPGA.

7. Milestones¶

Year 1 KPIs¶

Week 26 — Coherent Gemma 3N E4B decode on board, 5+ tok/s
Week 38 — EAGLE-3 Gemma 3N checkpoint released on HF (first public)
Week 47 — Gemma 3N E4B 20 tok/s officially measured ← promise met
Week 52 — Gemma 4 E4B at 12+ tok/s
Blog post / paper draft (v002 results)

Year 2 KPIs (Auto-Porting α)¶

Week 76 — Llama 3.1 8B auto-generated + running on KV260
Year 2 end — 5+ model families supported
Academic publication

8. RTL repository¶

Both tracks are implemented in hwkim-dev/pccx-FPGA-NPU-LLM-kv260. At v002 freeze the codes/v002/ snapshot is pinned into this docs repo (see the version-cutover ceremony), then v003 branches off.

Document version: 2026-04-20, first edition. Compiled from local plan drafts (pccx_master_roadmap_final.md, pccx_v002_extended_20toks_plan.md, tinynpu_v003_gemma4_e4b_plan.md). Next update: at v002 Phase F completion (~ Week 26).