Computation, in-band.

A GPU-resident virtual machine that executes inside the LLM sampling loop. No tool-call protocol. No network round trips. Deterministic execution where the model already is.

Read the Design See Experiments

The model writes C. The GPU runs it.

When a model outputs code in a marked region, the VM compiles and executes it on the same NVIDIA GPU doing inference. The result re-enters the token stream as if the model had generated it.

// Ruffian: execution inside the decode loop
User: What is 847 times 293?

// Model writes C code inline:
LLM outputs: "Let me compute that: [C: 847*293] = VM[248171]"

// Or uses a code fence for complex programs:
LLM outputs: ```c
int main() {
    int* sieve = malloc(2000000 * 4);
    // ... sieve of Eratosthenes ...
    printf("%ld", sum);  // VM output: 142913828922
}
```

Compiled and executed on CUDA. No tool server. No host round trip.

The VM compiles a substantial subset of C to GPU bytecode: functions, loops, arrays up to 2 million elements, pointers, structs, floats, recursion, 64-bit integers. Project Euler #10—a sieve of Eratosthenes—runs on-GPU during inference and returns the exact answer: 142,913,828,922.

2,116
Tests Passing
CUDA
GPU Platform
4B
Max Instructions
0
Round Trips

Articles

Where this is going

The runtime is the foundation. The hard part is training: will a model learn to use execution tokens well, and will it learn to do so selectively?

Done

  • 2,116 tests passing on CUDA and CPU
  • C compiler on GPU: functions, arrays, pointers, structs, floats, enums
  • LLM introspection: logits, KV cache, embeddings, vocabulary, entropy
  • Live experiments: self-evaluation, self-modification, algorithm generation
  • Qwen 2.5 Coder 14B running on NVIDIA GPUs via CUDA

Current

  • Scaling to Qwen 30B on upgraded RunPod infrastructure
  • Few-shot + self-aware branching (model reads its own entropy)
  • QLoRA training pipeline for 14B–30B models