A GPU-resident virtual machine that executes inside the LLM sampling loop. No tool-call protocol. No network round trips. Deterministic execution where the model already is.
When a model outputs code in a marked region, the VM compiles and executes it on the same NVIDIA GPU doing inference. The result re-enters the token stream as if the model had generated it.
// Ruffian: execution inside the decode loop User: What is 847 times 293? // Model writes C code inline: LLM outputs: "Let me compute that: [C: 847*293] = VM[248171]" // Or uses a code fence for complex programs: LLM outputs: ```c int main() { int* sieve = malloc(2000000 * 4); // ... sieve of Eratosthenes ... printf("%ld", sum); // VM output: 142913828922 } ``` Compiled and executed on CUDA. No tool server. No host round trip.
The VM compiles a substantial subset of C to GPU bytecode: functions, loops, arrays up to 2 million elements, pointers, structs, floats, recursion, 64-bit integers. Project Euler #10—a sieve of Eratosthenes—runs on-GPU during inference and returns the exact answer: 142,913,828,922.
Why put a VM in the sampling loop. The GPU pipeline, what the C compiler supports, and how the model inspects its own internal state.
Self-evaluation, token suppression, algorithm generation—all executing on NVIDIA GPU during decoding with Qwen 2.5 Coder 14B.
The runtime is the foundation. The hard part is training: will a model learn to use execution tokens well, and will it learn to do so selectively?