Computation, In-Band

A GPU-resident VM for LLM decoding

Ruffian Project

The overhead of tool use

When an LLM needs to compute something exactly—arithmetic, search, code execution—most systems use tool calls. The model pauses generation, emits a structured request, a host process runs the tool, and generation resumes with the result injected into context.

This works. It's how most production systems handle computation. But every tool call requires parsing, dispatch (possibly over a network), serialization, and a resumed forward pass. For a single call, the overhead is negligible. For tight loops—verification chains, multi-step proofs, iterative refinement—it accumulates.

// Typical tool-use flow
LLM → {"tool":"calc","expr":"847*293"} → Host → Tool → Host → LLM
                                                ↑
                                    parse, dispatch, serialize, resume

There's also a subtler cost: tool use treats computation as something outside the model. The model speaks; the host computes; the model continues. This is fine when computation is occasional. It's awkward when you want the model to interleave reasoning and execution fluidly.

Execution inside decoding

Ruffian embeds a small VM into the sampling path of an LLM runner (currently a patched llama.cpp). When the model emits tokens that form a marked code region, the GPU compiles and runs the code before the next token is sampled. The result re-enters the token stream as if the model had generated it.

LLM outputs: "847 × 293 = [C: 847*293]"
                              ↓
         VM compiles & executes on GPU (same GPU doing inference)
                              ↓
Token stream continues: "... = [C: 847*293] = VM[248171]"

The model still decides what to compute. But the execution is deterministic and local—no RPCs, no tool servers, no host-side orchestration. The VM runs to completion under bounded instructions (currently 4 billion per evaluation).

Two syntax variants trigger the VM. Inline expressions—[C: 847*293]—work for quick computations. Standard markdown code fences with a c language tag handle multi-line code:

```c
int main() {
    int sum = 0;
    for (int i = 1; i <= 100; i++) sum += i;
    printf("%d", sum);
    return 0;
}
```
// VM output: 5050

The GPU pipeline

Code goes through four stages, each a separate CUDA kernel:

Source text → [Tokenize] → [Parse] → [Codegen] → [Execute]
                                                            ↓
                                                     result on stack

The tokenizer converts C source into a token list. The parser builds an AST. The code generator emits stack-machine bytecode. The executor runs bytecode on a stack VM with locals, a call stack, and bounded memory. Each stage is a separate GPU kernel, single-threaded and deterministic.

Functions require two passes: tokenize→parse→codegen for the definition (registers the function), then again for the call expression (emits the invocation). The executor runs the combined bytecode once at the end.

What the VM compiles

The VM implements a substantial subset of C. This isn't an expression evaluator—it's a compiler that runs on GPU:

Feature	Status	Example
Functions & recursion	PASS	`int fib(int n) { ... }`
While/for/do-while loops	PASS	`while(i < n) i++;`
Arrays (up to 2M elements)	PASS	`int arr[2000000];`
Pointers & subscript	PASS	`*p = 42; p[i] = x;`
Structs	PASS	`struct pt { int x, y; };`
Floats & math library	PASS	`sin(x) + cos(y)`
64-bit integers	PASS	`long n = 142913828922;`
Printf/sprintf	PASS	`printf("%d, %s", n, s);`
Switch/case, enums	PASS	`switch(x) { case 1: ... }`
Bitwise ops, hex literals	PASS	`x & 0xFF \| (y << 8)`
String operations	PASS	`strlen, strcmp, strcpy, strcat`
Malloc (bump allocator)	PASS	`char* buf = malloc(64);`

Project Euler #10—sum of all primes below 2 million—runs on-GPU during inference using a sieve of Eratosthenes with a 2M-element array. It returns the exact answer: 142,913,828,922. The point isn't performance. The point is that a sieve, with 2 million array elements and nested loops, can execute inside a single token-generation step.

int main() {
    int limit = 2000000;
    int* sieve = malloc(limit * 4);
    for (int i = 0; i < limit; i++) sieve[i] = 1;
    sieve[0] = 0; sieve[1] = 0;
    for (int i = 2; i * i < limit; i++)
        if (sieve[i])
            for (int j = i*i; j < limit; j += i)
                sieve[j] = 0;
    long sum = 0;
    for (int i = 2; i < limit; i++)
        if (sieve[i]) sum += i;
    printf("%ld", sum);
    return 0;
}
// VM output: 142913828922

The model inspects itself

Because the VM runs on the same GPU during inference, it has access to the model's live internal state. This isn't a separate tool call—the C code reads directly from the same memory that holds the model's activations.

API	What it reads
`logit_get(i) / logit_set(i, v)`	Next-token logits (full vocabulary, read/write)
`logit_argmax()`	Index of highest-logit token
`top_k_token(rank) / top_k_prob(rank)`	Top-K token IDs and softmax probabilities
`vocab_entropy() / vocab_top1_prob()`	Distribution entropy and top-1 probability
`context_token(pos) / context_len()`	Token IDs from the conversation
`tok_str(id, buf) / tok_count()`	Decode token IDs to text (full vocabulary)
`kv_read(layer, head, pos, is_key, dim)`	KV cache attention values
`emb_read(layer, pos, dim)`	Hidden-state embeddings

Verified output from Qwen 2.5 Coder 14B running on NVIDIA GPU (CUDA):

// What the model predicts next (real logits from live inference)
int top = logit_argmax();
char* buf = malloc(64);
tok_str(top, buf);
printf("top: '%s' logit=%d", buf, logit_get(top));
// → top token decoded with logit score

// Model architecture visible at runtime
printf("layers=%d heads=%d dim=%d seq=%d",
       kv_layers(), kv_heads(), kv_head_dim(), kv_seq_len());
// → layers=4 heads=8 dim=128 seq=128  (sampled KV cache)

// Full vocabulary: 151,936 tokens decoded to text
printf("vocab=%d", tok_count());  // → vocab=151936

// Direct logit manipulation before sampling
logit_set(unwanted_token, -100000);  // suppress a token

The model's C code executes on the same GPU doing inference, reads live internal state, and returns structured data. No round trips, no mocking, no simulation.

Limitations

The VM is deliberately limited. No double type—use float. No nested switch. No goto or global variables. String literals are capped at 256 bytes. malloc() uses a bump allocator with no free()—memory is not reclaimed within a single evaluation.

Status

2,116 tests pass on CUDA and CPU. The test suite covers 46 algorithms, Project Euler problems, LeetCode solutions, string parsing, math puzzles, and all introspection APIs. Live inference experiments have verified introspection, self-modification, and algorithm generation with Qwen 2.5 Coder 14B on NVIDIA GPUs.

Current work: scaling to Qwen 30B, few-shot prompt engineering, self-aware branching, and QLoRA training. Next: fine-tuning with execution tokens and grammar constraints for guaranteed-valid C output.

The experiments →