A GPU-resident VM for LLM decoding
When an LLM needs to compute something exactly—arithmetic, search, code execution—most systems use tool calls. The model pauses generation, emits a structured request, a host process runs the tool, and generation resumes with the result injected into context.
This works. It's how most production systems handle computation. But every tool call requires parsing, dispatch (possibly over a network), serialization, and a resumed forward pass. For a single call, the overhead is negligible. For tight loops—verification chains, multi-step proofs, iterative refinement—it accumulates.
// Typical tool-use flow
LLM → {"tool":"calc","expr":"847*293"} → Host → Tool → Host → LLM
↑
parse, dispatch, serialize, resume
There's also a subtler cost: tool use treats computation as something outside the model. The model speaks; the host computes; the model continues. This is fine when computation is occasional. It's awkward when you want the model to interleave reasoning and execution fluidly.
Ruffian embeds a small VM into the sampling path of an LLM runner (currently a patched llama.cpp). When the model emits tokens that form a marked code region, the GPU compiles and runs the code before the next token is sampled. The result re-enters the token stream as if the model had generated it.
LLM outputs: "847 × 293 = [C: 847*293]"
↓
VM compiles & executes on GPU (same GPU doing inference)
↓
Token stream continues: "... = [C: 847*293] = VM[248171]"
The model still decides what to compute. But the execution is deterministic and local—no RPCs, no tool servers, no host-side orchestration. The VM runs to completion under bounded instructions (currently 4 billion per evaluation).
Two syntax variants trigger the VM. Inline expressions—[C: 847*293]—work for quick computations. Standard markdown code fences with a c language tag handle multi-line code:
```c
int main() {
int sum = 0;
for (int i = 1; i <= 100; i++) sum += i;
printf("%d", sum);
return 0;
}
```
// VM output: 5050
Code goes through four stages, each a separate CUDA kernel:
Source text → [Tokenize] → [Parse] → [Codegen] → [Execute]
↓
result on stack
The tokenizer converts C source into a token list. The parser builds an AST. The code generator emits stack-machine bytecode. The executor runs bytecode on a stack VM with locals, a call stack, and bounded memory. Each stage is a separate GPU kernel, single-threaded and deterministic.
Functions require two passes: tokenize→parse→codegen for the definition (registers the function), then again for the call expression (emits the invocation). The executor runs the combined bytecode once at the end.
The VM implements a substantial subset of C. This isn't an expression evaluator—it's a compiler that runs on GPU:
| Feature | Status | Example |
|---|---|---|
| Functions & recursion | PASS | int fib(int n) { ... } |
| While/for/do-while loops | PASS | while(i < n) i++; |
| Arrays (up to 2M elements) | PASS | int arr[2000000]; |
| Pointers & subscript | PASS | *p = 42; p[i] = x; |
| Structs | PASS | struct pt { int x, y; }; |
| Floats & math library | PASS | sin(x) + cos(y) |
| 64-bit integers | PASS | long n = 142913828922; |
| Printf/sprintf | PASS | printf("%d, %s", n, s); |
| Switch/case, enums | PASS | switch(x) { case 1: ... } |
| Bitwise ops, hex literals | PASS | x & 0xFF | (y << 8) |
| String operations | PASS | strlen, strcmp, strcpy, strcat |
| Malloc (bump allocator) | PASS | char* buf = malloc(64); |
Project Euler #10—sum of all primes below 2 million—runs on-GPU during inference using a sieve of Eratosthenes with a 2M-element array. It returns the exact answer: 142,913,828,922. The point isn't performance. The point is that a sieve, with 2 million array elements and nested loops, can execute inside a single token-generation step.
int main() {
int limit = 2000000;
int* sieve = malloc(limit * 4);
for (int i = 0; i < limit; i++) sieve[i] = 1;
sieve[0] = 0; sieve[1] = 0;
for (int i = 2; i * i < limit; i++)
if (sieve[i])
for (int j = i*i; j < limit; j += i)
sieve[j] = 0;
long sum = 0;
for (int i = 2; i < limit; i++)
if (sieve[i]) sum += i;
printf("%ld", sum);
return 0;
}
// VM output: 142913828922
Because the VM runs on the same GPU during inference, it has access to the model's live internal state. This isn't a separate tool call—the C code reads directly from the same memory that holds the model's activations.
| API | What it reads |
|---|---|
logit_get(i) / logit_set(i, v) | Next-token logits (full vocabulary, read/write) |
logit_argmax() | Index of highest-logit token |
top_k_token(rank) / top_k_prob(rank) | Top-K token IDs and softmax probabilities |
vocab_entropy() / vocab_top1_prob() | Distribution entropy and top-1 probability |
context_token(pos) / context_len() | Token IDs from the conversation |
tok_str(id, buf) / tok_count() | Decode token IDs to text (full vocabulary) |
kv_read(layer, head, pos, is_key, dim) | KV cache attention values |
emb_read(layer, pos, dim) | Hidden-state embeddings |
Verified output from Qwen 2.5 Coder 14B running on NVIDIA GPU (CUDA):
// What the model predicts next (real logits from live inference)
int top = logit_argmax();
char* buf = malloc(64);
tok_str(top, buf);
printf("top: '%s' logit=%d", buf, logit_get(top));
// → top token decoded with logit score
// Model architecture visible at runtime
printf("layers=%d heads=%d dim=%d seq=%d",
kv_layers(), kv_heads(), kv_head_dim(), kv_seq_len());
// → layers=4 heads=8 dim=128 seq=128 (sampled KV cache)
// Full vocabulary: 151,936 tokens decoded to text
printf("vocab=%d", tok_count()); // → vocab=151936
// Direct logit manipulation before sampling
logit_set(unwanted_token, -100000); // suppress a token
The model's C code executes on the same GPU doing inference, reads live internal state, and returns structured data. No round trips, no mocking, no simulation.
The VM is deliberately limited. No double type—use float. No nested switch. No goto or global variables. String literals are capped at 256 bytes. malloc() uses a bump allocator with no free()—memory is not reclaimed within a single evaluation.
2,116 tests pass on CUDA and CPU. The test suite covers 46 algorithms, Project Euler problems, LeetCode solutions, string parsing, math puzzles, and all introspection APIs. Live inference experiments have verified introspection, self-modification, and algorithm generation with Qwen 2.5 Coder 14B on NVIDIA GPUs.
Current work: scaling to Qwen 30B, few-shot prompt engineering, self-aware branching, and QLoRA training. Next: fine-tuning with execution tokens and grammar constraints for guaranteed-valid C output.