Glossary¶

This page defines project terms as they are used in the pccx public documentation. It is intentionally conservative: planned work, throughput targets, and board measurements are labelled as such.

Project And Release Lines¶

pccx: Parallel Compute Core eXecutor. A hardware-software co-design project for NPU architectures targeting edge inference workloads.
v001: Archived experimental pccx architecture line. It remains in the docs as historical context and should not be treated as the active RTL target.
v002: Active KV260 LLM architecture line. In this docs site, v002 usually means the public architecture, ISA, driver, RTL-reference, and verification pages for the current pccx-FPGA-NPU-LLM-kv260 line.
v002.0: Baseline v002 integration line on KV260. Throughput language for this line is measured-only until release evidence is published.
v002.1: Planned continuation of v002 on the same RTL repository. The roadmap scopes sparsity and speculative-decoding work to this line. The 20 tok/s number is a target for this line, not a reported board result.
v003.x: Planned LLM continuation in a separate RTL repository. Public documentation treats v003 as a future line until its repository and release branches are stabilized.
vision-v001: Parallel CNN inference track that reuses the KV260 substrate but targets vision workloads rather than autoregressive LLM decoding.
pccx-lab: Companion verification and profiling environment for pccx traces, reports, and workflow automation. Public claims derived from lab output still need the release evidence gates described in the roadmap.
pccx-llm-launcher: Companion launcher repository for model preparation, runtime contracts, and KV260-facing orchestration. Current public launcher pages describe scaffold, mock, and contract surfaces unless they cite board evidence.

Hardware Target¶

KV260: Xilinx Kria KV260 Starter Kit, based on the Zynq UltraScale+ ZU5EV device. It is the primary board target for v002 public documentation.
kv260: Lowercase slug used in repository names, branch names, build directories, or scripts when a filesystem-safe target identifier is needed.
Zynq UltraScale+: AMD/Xilinx SoC family that combines a Processing System and Programmable Logic fabric. The KV260 target uses a ZU5EV part.
PS: Processing System. The Arm-based host side of the Zynq device.
PL: Programmable Logic. The FPGA fabric side where the pccx NPU RTL is implemented.
AXI: Arm AMBA interconnect protocol family used for host, memory, and streaming interfaces in the design.
AXI-HP: High-Performance AXI ports from the PS to PL. In v002 documentation these ports are used for high-bandwidth weight traffic into the NPU.
ACP: Accelerator Coherency Port. In pccx docs, ACP refers to the coherent path used for activation/result traffic between host memory and the accelerator.
DSP48E2: Xilinx DSP slice available in UltraScale+ devices. pccx v002 uses DSP48E2 packing for the W4A8 GEMM datapath.
BRAM: Block RAM in the FPGA fabric. pccx uses BRAM for smaller local buffers and per-core storage structures.
URAM: UltraRAM in the FPGA fabric. pccx v002 uses URAM for the shared L2 cache and weight buffering structures described in the architecture docs.
CDC: Clock-domain crossing. Used where data moves between the AXI/control clock domain and the core compute clock domain.
Vivado block design: Xilinx Vivado IP-integrator design graph. In the v002.1 docs, a block-design scaffold is build setup material, not proof that implementation or timing has completed.
bitstream: FPGA configuration artifact produced after synthesis and implementation. Public pccx docs should call a bitstream deployable only when the matching evidence page or release checklist links the build, timing, and board artefacts.
SD staging: Packaging step that prepares files for booting or testing the KV260 from SD media. It is a deploy-preparation step and does not by itself establish a hardware run.

Data Types And Numeric Formats¶

W4A8: Weight-4, Activation-8 quantization. In pccx v002 this means INT4 weights multiplied by INT8 activations on the main integer compute path.
W4A8KV4: Shorthand used for an evidence-gated Gemma 3N E4B target configuration: W4A8 compute with 4-bit KV-cache storage. Treat it as a target configuration label unless a page cites measured evidence.
INT4: Signed 4-bit integer value, used for quantized weights in the W4A8 path.
INT8: Signed 8-bit integer value, used for quantized activations in the W4A8 path.
BF16: Brain floating point format with an 8-bit exponent and 7-bit mantissa. pccx docs use BF16 for activation, KV-cache, or SFU paths where integer-only arithmetic is not the intended representation.
FP32: IEEE single-precision floating point. Public docs mention FP32 only where the operation needs a higher-precision software or SFU-side representation.
Precision promotion: Conversion from the integer compute path to BF16 or FP32 for non-linear or numerically sensitive operations such as softmax, RMSNorm, GELU, and RoPE.
Sign recovery: The correction step used when signed low-bit operands are packed into a wider multiply datapath. In pccx docs the term is tied to W4A8 DSP packing, not to model-level accuracy claims.
Activation quantization: Policy for converting activation values into the representation consumed by the integer datapath. The v002.1 decision page names the default policy but does not claim final model accuracy.
e_max: Maximum-exponent summary used by the v002.1 activation-scale policy. Public docs describe it as a scale-selection mechanism, not as measured accuracy or throughput evidence.
BFP: Block floating point. In the v002.1 activation policy, BFP refers to a shared power-of-two activation scale for a block of values.
symmetric INT8: Reviewed activation-scale mode that uses symmetric signed INT8 quantization. The design-decision page keeps it as a mode under review rather than the v002.1 default.
constant-cache scale: Driver-provided activation-scale table or constant path. It remains a reviewed mode until the hardware/software interface and tests make it the chosen default.
ACT_SCALE_POLICY: Public parameter handle for the v002.1 activation scaling policy.
ACT_SCALE_EMAX_BFP: Default v002.1 activation-scale mode named by the design-decision page: e_max plus BFP power-of-two scaling.

Compute Blocks¶

GEMM: General Matrix-Matrix Multiply. In v002 it is the matrix core used mainly for prefill and other matrix-heavy work. The architecture docs describe a 32 x 32 systolic array for the KV260 configuration.
GEMV: General Matrix-Vector Multiply. In v002 it is the vector core used for decode-dominant work where a new token repeatedly multiplies an activation vector by streamed weights.
CVO: Complex Vector Op. ISA opcode family for non-linear vector operations and reductions that execute on the SFU path.
SFU: Special Function Unit. The backend that executes CVO operations such as exp, sqrt, GELU, sin/cos, reduce-sum, scale, and reciprocal.
PE: Processing Element. A compute cell in the systolic array or related datapath.
Systolic array: Regular grid of PEs that moves operands through a fixed pattern. In pccx v002 public docs, this term usually refers to the GEMM array.
Weight Stationary: GEMM dataflow where a weight tile is loaded into the array and reused across many activation steps.
Weight Streaming: GEMV dataflow where weights stream through the vector datapath because each weight is used once for the current token step.
LUT: Lookup table. In the FPGA sense, LUTs are logic resources. In the algorithmic sense, pccx docs also use lookup tables for some dequantization or SFU helper paths; read the local context.
CORDIC: Iterative coordinate-rotation method used for selected transcendental functions. pccx docs mention CORDIC as part of the SFU implementation path.
K-split: Division of the reduction dimension into chunks. v002.1 docs discuss it with drain cadence and accumulator bounds, not as a completed scheduler claim.
drain cadence: Frequency at which partial accumulators are drained from a K-split path. The current v002.1 default is parameterized rather than hardwired into a public performance claim.
K_DRAIN_LIMIT: Public parameter handle for the v002.1 K-split accumulator drain limit. The documented default is 1024.
DSP accounting baseline: Convention for reporting intended compute-core DSP usage separately from implementation extras. Actual utilization still comes from synthesis reports.
DSP_BASELINE_GEMM: GEMM compute-core DSP baseline parameter. The v002.1 decision page sets it to 1024 for the 32 x 32 PE grid.
DSP_BASELINE_GEMV: GEMV compute-core DSP baseline parameter. The v002.1 decision page sets it to 64 for four 16-DSP vector lanes.
DSP_BASELINE_ALPHA: Accounting bucket for implementation extras outside the GEMM/GEMV baseline.

ISA And Runtime Terms¶

ISA: Instruction Set Architecture. pccx v002 uses a custom fixed-width 64-bit ISA for compute, memory, and CVO instructions.
VLIW: Very Long Instruction Word. In pccx docs this describes the fixed-width instruction format and explicit fields used by the NPU dispatcher.
opcode: Operation-code field in an instruction. The v002 ISA pages are the source of truth for opcode values and instruction field layouts.
GEMM instruction: v002 ISA compute instruction that dispatches matrix-matrix work to the GEMM backend.
GEMV instruction: v002 ISA compute instruction that dispatches matrix-vector work to the GEMV backend.
MEMCPY instruction: v002 ISA memory movement instruction. See the ISA reference for supported source and destination paths.
MEMSET instruction: v002 ISA instruction used to write shape or constant-table state rather than to run arithmetic.
CVO instruction: v002 ISA instruction that dispatches an SFU function over a vector or reduction operand.
HAL: Hardware Abstraction Layer. The C/C++ driver layer that wraps register, memory, and instruction-dispatch details for host software.
Sail: ISA-specification language used by the pccx formal model. In pccx docs, Sail models are used to check instruction semantics and field widths against the intended ISA structure.
launcher contract: Data-only interface between the planned KV260 runtime path and launcher software. A contract page describes shapes and guardrails; it is not board execution evidence.
readiness scaffold: Typed placeholder or adapter surface that makes a future hardware path reviewable before device access is implemented.
AXI command/status shapes: Launcher-side data structures for command and status exchange over the future KV260 boundary. Shape validation is contract evidence, not a live MMIO run.
result streaming: Runtime path for returning generated tokens or accelerator results. Public docs should distinguish mock streams, serial test framing, and captured board streams.
serial TTY: Character-device path used by launcher or lab tooling to exchange framed records with a connected target. Tests that skip without a device are not board evidence.
TraceStream: pccx-lab iterator contract for trace records. File replay and serial TTY sources can share this surface while still having different evidence status.
KVFPGA_TTY: Environment or configuration path naming the serial device used by the KV260 trace source.
newline JSON framing: Trace framing style where one JSON payload is carried per line between begin/end markers.
CRC: Cyclic redundancy check. In pccx-lab trace framing docs it is used to detect corrupted payloads; skipped bad frames should not be counted as valid hardware evidence.
sequence gap: Missing trace-frame sequence number reported by the lab pipeline. It is a diagnostic signal that the captured stream may be incomplete.

Memory And Model Terms¶

L1: Local per-core memory or buffer close to a compute backend.
L2: Shared on-chip cache in the v002 architecture. It is backed by URAM and is shared by GEMM, GEMV, SFU, and memory-dispatch paths.
Weight Buffer: On-chip FIFO/buffer path for model weights arriving from external memory. GEMM uses it for weight preload/reuse; GEMV uses it for streaming.
KV cache: Attention key/value storage retained across autoregressive decoding steps. pccx docs distinguish KV-cache design targets from measured board capacity or throughput claims.
Attention Sink: KV-cache policy term for retaining the first tokens of a prompt while using a sliding local window for recent tokens.
Local Window: KV-cache policy term for the recent-token region retained during long-context decoding.
RoPE: Rotary Position Embedding. pccx maps RoPE-related sine and cosine work to CVO operations in the SFU path.
RMSNorm: Root Mean Square Layer Normalization. In pccx docs this is one of the non-linear or reduction-heavy operations associated with the SFU path.
Softmax: Normalization used in attention. pccx docs map its exponential, reduction, reciprocal, and scale steps to CVO/SFU operations.
GELU: Gaussian Error Linear Unit activation. pccx docs map GELU to the CVO/SFU path.
Gemma 3N E4B: Target LLM family named in the v002 public docs. Claims about token rate or board execution remain evidence-gated unless the page cites published verification data.
GemmaArchSpec: Launcher-side configuration object for Gemma shape metadata and packed-size checks. It is a spec-validation surface, not a model execution claim.
W4 prep: Launcher-side preparation of signed W4 packed weights and related metadata. Current docs treat it as a software contract until hardware handoff evidence lands.
manifest metadata: Structured metadata that records prepared weight shapes, scales, packed sizes, or related handoff fields for the launcher path.
tokenizer contract: Offline tokenizer interface used by the launcher scaffold. Placeholder fixtures do not claim real Gemma tokenizer assets.
token streaming: Movement of prompt or generated-token data across a runtime boundary. In the current software-path docs, serial and mock streaming are scaffold evidence until board captures are published.
marker-wrapped chunks: Token-transport records delimited by explicit markers, sometimes with length prefixes. They define framing behavior rather than hardware throughput.
mock orchestration: End-to-end software path that joins prompt encode, W4 prep, mock command polling, output receive, and decode without a real board run.
AltUp: Gemma-specific multi-stream state item named in v002.1 FAQ material. Its effect on throughput or memory pressure still needs measured evidence before public claims.
LAuReL: Gemma-specific mechanism named in model and FAQ pages. Public docs may describe the mapping, but speedup or accuracy claims need evidence.
PLE: Per-Layer Embedding mechanism referenced by Gemma model docs. Treat PLE-related scheduling text as design mapping unless an evidence page links a measurement.
grouped-query attention: Attention variant that shares key/value projections across query groups. pccx docs discuss it as part of the Gemma mapping and KV-cache traffic budget.
cross-layer KV sharing: Gemma-specific KV reuse pattern that affects cache residency and traffic. Public docs should keep it separate from measured throughput claims.
EAGLE-3: Speculative-decoding technique named in the v002.1 roadmap scope. In this repo it is planned work, not a completed v002.0 feature.
SSD: Speculative-decoding roadmap item in the v002.1 scope. Expand or redefine the acronym at the point of use when adding detailed public documentation.
J Tree: Roadmap shorthand associated with the v002.1 speculative-decoding stack. Treat it as planned scope until a design page defines and verifies it.
G sparsity: Roadmap lane for v002.1 sparsity work. It should be described as ramp scope until implementation and evidence pages say more.
H/H+: Roadmap shorthand for EAGLE-3 speculative-decoding phases in the v002.1 ramp.
I SSD: Roadmap shorthand for the SSD phase in the v002.1 ramp.
K benchmark: Roadmap shorthand for benchmark/evidence work after the v002.1 mechanism lanes. Benchmarks become public claims only through the evidence gates.

Metrics And Evidence¶

tok/s: Tokens per second. pccx uses this as the primary user-visible decoding throughput unit.
TT: Throughput target. This is planning shorthand for a target token rate, not a measurement. Public pages should prefer spelling out “throughput target” on first use.
measured-only: Documentation posture for the v002.0 release line: do not quote throughput, timing closure, or board-run claims until the evidence checklist admits those measurements.
bring-up: Hardware integration phase where the bitstream, board setup, host driver, and smoke tests are made to run together. Bring-up logs are evidence inputs, not automatically release claims.
release evidence: Checklist-gated artifacts used to decide whether timing, throughput, or board-execution statements are allowed in public docs.
evidence inventory: Public list of measured, reproducible artefacts and pending gates. It is the place to check whether a value is measured, pending, or only a target.
claim guard: Review rule or scan that prevents public docs from turning targets, scaffolds, mocks, or pending gates into completed hardware claims.
pre-flight: Preparatory state for build, launcher, or deploy work before the full command sequence has been run and evidence has been captured.
smoke capture: Small board or tool run used to collect initial logs. It can support bring-up evidence, but it does not replace release evidence for timing or throughput.
timing report: Vivado report used to justify timing wording. A docs page should not claim timing closure without a linked report or release evidence entry.
utilization report: Vivado report used to justify FPGA resource wording such as DSP, LUT, BRAM, or URAM counts.
throughput target: Planned token-rate goal. It must remain distinct from measured throughput in public wording.
board run: Execution against a connected KV260 or other named target board. Mock tests, type checks, and local software orchestration are not board runs.
trace replay: Analysis of an existing .pccx trace file through pccx-lab tooling. Replay can validate analysis paths without proving new hardware execution.

Documentation And Release Terms¶

spec resolution: Reader step that separates architecture intent, model mapping, ISA source of truth, and measured evidence before quoting a claim.
runbook: Step-by-step command record for a build, local docs check, deploy, or hardware procedure. A runbook is procedure evidence only after the commands and results are captured.
deploy runbook: Documentation path for publishing the Sphinx site through GitHub Pages. A deploy check proves publication, not hardware performance.
release status: Label such as draft, prerelease, latest release, or archived release used by release notes. It should not be overloaded with hardware readiness.
pre-release: GitHub Release state for work that is published before being treated as a final release.
validation status: Release-note field that records which checks passed, failed, or were not run. It should name commands or CI runs where useful.
known limitations: Release-note section for caveats, missing evidence, or deferred capability.
release checklist: Maintainer checklist for release hygiene. For pccx ISA PDF changes, the checklist includes rebuilding the PDF from main.tex.
GitHub Pages deploy: Publication workflow for the documentation site. Passing deploy does not convert a target, mock, or pending gate into measured evidence.
contributors acknowledgement: Public recognition of people who contribute documentation, reviews, bug reports, diagrams, examples, or related code after maintainers accept the entry for publication.
news section: Placeholder area for future project updates, release announcements, and community news. It should not carry release claims without the same evidence gates as the rest of the docs.