Live Inference Experiments

Qwen 2.5 Coder 14B writes C code that reads its own internal state during decoding

Ruffian Project

We ran six rounds of live inference experiments to test whether a language model could learn to use the Ruffian VM during generation—writing C code that reads its own logits, modifies its next-token distribution, and generates algorithms. Experiments run on NVIDIA GPUs via CUDA using Qwen 2.5 Coder 14B.

The model was not fine-tuned. Everything here is pure in-context learning: a few examples in the prompt, and the model figures out the rest.

Self-evaluation

The model can read its own confidence during generation. vocab_entropy() returns the Shannon entropy of the next-token distribution (in millinats), and vocab_top1_prob() returns the probability of the most likely token (scaled ×10000). The model uses these to branch:

int e = vocab_entropy();
int p = vocab_top1_prob();
if (e < 2000)
    printf("CONFIDENT (entropy=%d, top1=%d)", e, p);
else
    printf("UNCERTAIN (entropy=%d, top1=%d)", e, p);
// → CONFIDENT (entropy=1247, top1=4891)

This is the model reading its own uncertainty and making a decision based on it. Low entropy (concentrated distribution) means the model is confident about what comes next. High entropy (flat distribution) means it isn't. The branching happens inside a single VM evaluation during token generation.

Multi-function status reports

A single VM call can execute multiple functions. The model generates a 9-function status report that queries context length, vocabulary size, top predictions, entropy, KV cache architecture, and overall health—all in one shot:

Tokens in context: 482
Vocab size: 151936
Top prediction: ' the' (p=1847)
Entropy: 5765 (millinats)
Status: ACTIVE, HEALTHY
Layers: 4, Heads: 8, Dim: 128

The model discovers its own architecture at runtime. It reads KV cache dimensions via the introspection API—the sampled view shows 4 evenly-spaced layers and 8 KV heads (Qwen uses grouped-query attention). The model didn't know this from training; it read it from the live KV cache on the GPU.

Token suppression

Because the VM can write to the logit buffer before sampling, the model can suppress its own next token:

// Identify the most likely next token
int top = logit_argmax();
char* buf = malloc(64);
tok_str(top, buf);
printf("suppressing: '%s'", buf);

// Set its logit to -100000 (effectively zero probability after softmax)
logit_set(top, -100000);
// → Model forced to pick its second-choice token

This is self-modification during inference. The model writes C code that changes its own output distribution before the sampler runs. It's not a filter applied externally—the model itself chose to suppress the token.

Algorithm generation

With two worked examples in the prompt, Qwen generates correct algorithms on the first try. No fine-tuning—just few-shot prompting. Given examples of factorial and fibonacci, the model independently writes a prime sieve:

// Model generates this from 2 examples in the prompt:
int main() {
    int s[1000]; int c = 0;
    for (int i = 2; i < 1000; i++) s[i] = 1;
    for (int i = 2; i < 1000; i++)
        if (s[i]) {
            c++;
            for (int j = i*2; j < 1000; j += i) s[j] = 0;
        }
    printf("%d", c);  // → 168 (correct: 168 primes below 1000)
    return 0;
}

The model doesn't need to know that there are 168 primes below 1000. It writes the sieve; the VM computes the answer. The model just needs to write valid C.

Results

RoundFocusKey ResultStatus
1–2 Basic introspection Top-K tokens, entropy, logit reads work during live inference PASS
3 Intrinsics + CUDA top_k_token, vocab_entropy intrinsics verified on GPU PASS
4 VM_ON detection Prompt examples no longer trigger VM; few-shot factorial works PASS
5 Self-modification First successful logit_set suppression during live inference PASS
6 Self-evaluation 9-function status report, entropy-based confidence branching PASS

What we learned

Few-shot beats system prompts. Two worked examples in the prompt reliably teach the model to write algorithms. A system prompt that describes the syntax but shows no examples fails consistently. The model needs to see the pattern, not read the documentation.

Bracket syntax is more reliable than fences. The inline syntax [C: ...] works more consistently during live inference than markdown code fences. Our hypothesis: brackets are more compact and less ambiguous to the tokenizer. Fences sometimes get split across multiple tokens in ways that confuse the pattern detector.

Self-awareness is straightforward. The model branches on its own entropy, reads its KV cache dimensions, and reports on its own architecture—all from generated C code. There's nothing exotic about it: the VM just reads from the same GPU memory that holds the model's state. The interesting part is that the model learns to use these APIs from examples alone.

Zero-return confabulation. When a VM call returns 0—from a void function, a missing return statement, or an actual zero result—the model sometimes confabulates a plausible-looking answer instead of using the VM output. Non-zero returns are reliable. This is a training signal: fine-tuning should teach the model to trust VM results unconditionally.