Computation, In-Band

A GPU-resident VM for LLM decoding

Ruffian Project

The overhead of tool use

When an LLM needs to compute something exactly—arithmetic, search, code execution—most systems use tool calls. The model pauses generation, emits a structured request, a host process runs the tool, and generation resumes with the result injected into context.

This works. It's how most production systems handle computation. But every tool call requires parsing, dispatch (possibly over a network), serialization, and a resumed forward pass. For a single call, the overhead is negligible. For tight loops—verification chains, multi-step proofs, iterative refinement—it accumulates.

// Typical tool-use flow
LLM → {"tool":"calc","expr":"847*293"} → Host → Tool → Host → LLM
                                                ↑
                                    parse, dispatch, serialize, resume

There's also a subtler cost: tool use treats computation as something outside the model. The model speaks; the host computes; the model continues. This is fine when computation is occasional. It's awkward when you want the model to interleave reasoning and execution fluidly.

Execution inside decoding

Ruffian embeds a small VM into the sampling path of an LLM runner (currently a patched llama.cpp). When the model emits tokens that form a marked code region, the GPU compiles and runs the code before the next token is sampled. The result re-enters the token stream as if the model had generated it.

LLM outputs: "847 × 293 = [C: 847*293]"
                              ↓
         VM compiles & executes on GPU (same GPU doing inference)
                              ↓
Token stream continues: "... = [C: 847*293] = VM[248171]"

The model still decides what to compute. But the execution is deterministic and local—no RPCs, no tool servers, no host-side orchestration. The VM runs to completion under bounded instructions (currently 4 billion per evaluation).

Two syntax variants trigger the VM. Inline expressions—[C: 847*293]—work for quick computations. Standard markdown code fences with a c language tag handle multi-line code:

```c
int main() {
    int sum = 0;
    for (int i = 1; i <= 100; i++) sum += i;
    printf("%d", sum);
    return 0;
}
```
// VM output: 5050

The GPU pipeline

Code goes through four stages, each a separate CUDA kernel:

Source text → [Tokenize] → [Parse] → [Codegen] → [Execute]
                                                            ↓
                                                     result on stack

The tokenizer converts C source into a token list. The parser builds an AST. The code generator emits stack-machine bytecode. The executor runs bytecode on a stack VM with locals, a call stack, and bounded memory. Each stage is a separate GPU kernel, single-threaded and deterministic.

Functions require two passes: tokenize→parse→codegen for the definition (registers the function), then again for the call expression (emits the invocation). The executor runs the combined bytecode once at the end.

What the VM compiles

The VM implements a substantial subset of C. This isn't an expression evaluator—it's a compiler that runs on GPU:

FeatureStatusExample
Functions & recursionPASSint fib(int n) { ... }
While/for/do-while loopsPASSwhile(i < n) i++;
Arrays (up to 2M elements)PASSint arr[2000000];
Pointers & subscriptPASS*p = 42; p[i] = x;
StructsPASSstruct pt { int x, y; };
Floats & math libraryPASSsin(x) + cos(y)
64-bit integersPASSlong n = 142913828922;
Printf/sprintfPASSprintf("%d, %s", n, s);
Switch/case, enumsPASSswitch(x) { case 1: ... }
Bitwise ops, hex literalsPASSx & 0xFF | (y << 8)
String operationsPASSstrlen, strcmp, strcpy, strcat
Malloc (bump allocator)PASSchar* buf = malloc(64);

Project Euler #10—sum of all primes below 2 million—runs on-GPU during inference using a sieve of Eratosthenes with a 2M-element array. It returns the exact answer: 142,913,828,922. The point isn't performance. The point is that a sieve, with 2 million array elements and nested loops, can execute inside a single token-generation step.

int main() {
    int limit = 2000000;
    int* sieve = malloc(limit * 4);
    for (int i = 0; i < limit; i++) sieve[i] = 1;
    sieve[0] = 0; sieve[1] = 0;
    for (int i = 2; i * i < limit; i++)
        if (sieve[i])
            for (int j = i*i; j < limit; j += i)
                sieve[j] = 0;
    long sum = 0;
    for (int i = 2; i < limit; i++)
        if (sieve[i]) sum += i;
    printf("%ld", sum);
    return 0;
}
// VM output: 142913828922

The model inspects itself

Because the VM runs on the same GPU during inference, it has access to the model's live internal state. This isn't a separate tool call—the C code reads directly from the same memory that holds the model's activations.

APIWhat it reads
logit_get(i) / logit_set(i, v)Next-token logits (full vocabulary, read/write)
logit_argmax()Index of highest-logit token
top_k_token(rank) / top_k_prob(rank)Top-K token IDs and softmax probabilities
vocab_entropy() / vocab_top1_prob()Distribution entropy and top-1 probability
context_token(pos) / context_len()Token IDs from the conversation
tok_str(id, buf) / tok_count()Decode token IDs to text (full vocabulary)
kv_read(layer, head, pos, is_key, dim)KV cache attention values
emb_read(layer, pos, dim)Hidden-state embeddings

Verified output from Qwen 2.5 Coder 14B running on NVIDIA GPU (CUDA):

// What the model predicts next (real logits from live inference)
int top = logit_argmax();
char* buf = malloc(64);
tok_str(top, buf);
printf("top: '%s' logit=%d", buf, logit_get(top));
// → top token decoded with logit score

// Model architecture visible at runtime
printf("layers=%d heads=%d dim=%d seq=%d",
       kv_layers(), kv_heads(), kv_head_dim(), kv_seq_len());
// → layers=4 heads=8 dim=128 seq=128  (sampled KV cache)

// Full vocabulary: 151,936 tokens decoded to text
printf("vocab=%d", tok_count());  // → vocab=151936

// Direct logit manipulation before sampling
logit_set(unwanted_token, -100000);  // suppress a token

The model's C code executes on the same GPU doing inference, reads live internal state, and returns structured data. No round trips, no mocking, no simulation.

Limitations

The VM is deliberately limited. No double type—use float. No nested switch. No goto or global variables. String literals are capped at 256 bytes. malloc() uses a bump allocator with no free()—memory is not reclaimed within a single evaluation.

Status

2,116 tests pass on CUDA and CPU. The test suite covers 46 algorithms, Project Euler problems, LeetCode solutions, string parsing, math puzzles, and all introspection APIs. Live inference experiments have verified introspection, self-modification, and algorithm generation with Qwen 2.5 Coder 14B on NVIDIA GPUs.

Current work: scaling to Qwen 30B, few-shot prompt engineering, self-aware branching, and QLoRA training. Next: fine-tuning with execution tokens and grammar constraints for guaranteed-valid C output.

The experiments →