Computation, in-band.

A GPU-resident virtual machine that executes inside the LLM sampling loop. No tool-call protocol. No network round trips. Deterministic execution where the model already is.

Read the Design See Experiments

The model writes C. The GPU runs it.

When a model outputs code in a marked region, the VM compiles and executes it on the same NVIDIA GPU doing inference. The result re-enters the token stream as if the model had generated it.

// Ruffian: execution inside the decode loop
User: What is 847 times 293?

// Model writes C code inline:
LLM outputs: "Let me compute that: [C: 847*293] = VM[248171]"

// Or uses a code fence for complex programs:
LLM outputs: ```c
int main() {
    int* sieve = malloc(2000000 * 4);
    // ... sieve of Eratosthenes ...
    printf("%ld", sum);  // VM output: 142913828922
}
```

Compiled and executed on CUDA. No tool server. No host round trip.

The VM compiles a substantial subset of C to GPU bytecode: functions, loops, arrays up to 2 million elements, pointers, structs, floats, recursion, 64-bit integers. Project Euler #10—a sieve of Eratosthenes—runs on-GPU during inference and returns the exact answer: 142,913,828,922.

Where this is going

The runtime is the foundation. The hard part is training: will a model learn to use execution tokens well, and will it learn to do so selectively?

Done

2,116 tests passing on CUDA and CPU
C compiler on GPU: functions, arrays, pointers, structs, floats, enums
LLM introspection: logits, KV cache, embeddings, vocabulary, entropy
Live experiments: self-evaluation, self-modification, algorithm generation
Qwen 2.5 Coder 14B running on NVIDIA GPUs via CUDA

Current

Scaling to Qwen 30B on upgraded RunPod infrastructure
Few-shot + self-aware branching (model reads its own entropy)
QLoRA training pipeline for 14B–30B models

Fine-tune with execution tokens
Grammar constraints for guaranteed-valid C output

Computation, in-band.

The model writes C. The GPU runs it.

Articles