# LLAMA CHICKEN Scheme

A high-performance LLAMA2 inference implementation in CHICKEN Scheme,
based on Andrej Karpathy's
[llama2.c](https://github.com/karpathy/llama2.c) and its OCaml port
[llama2.ml](https://github.com/jackpeck/llama2.ml).

### System Dependencies
- **CHICKEN Scheme 5.0+**: [The Scheme implementation](https://call-cc.org/)
- **BLAS Library**: For optimized linear algebra (OpenBLAS, Intel MKL, or system BLAS)
- **C Compiler**: GCC or Clang for compiling extensions


## 🛠️ Installation

### 1. Install CHICKEN Scheme
```bash
# Ubuntu/Debian
sudo apt-get install chicken-bin libchicken-dev

# macOS with Homebrew
brew install chicken

# From source
wget https://code.call-cc.org/releases/5.3.0/chicken-5.3.0.tar.gz
tar xzf chicken-5.3.0.tar.gz
cd chicken-5.3.0
make PLATFORM=linux PREFIX=/usr/local
sudo make PLATFORM=linux PREFIX=/usr/local install
```

### 2. Install BLAS Library
```bash
# Ubuntu/Debian
sudo apt-get install libopenblas-dev

# macOS with Homebrew
brew install openblas

# CentOS/RHEL
sudo yum install openblas-devel
```

### 3. Install Required CHICKEN Extensions
```bash
chicken-install llama
```

## Quick Start

### Model Checkpoint

Download this 15M parameter model trained on the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset (~60MB download):

```bash
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
```

### Basic Text Generation

Ensure that file tokenizer.bin is the current directory. Then run:

```bash
# Generate text with default settings
llama-cli -c stories15M.bin -p "Once upon a time"

# Creative generation with temperature
llama-cli -c stories15M.bin -t 0.8 -s 100 -p "The meaning of life is"

# Deterministic generation
llama-cli -c stories15M.bin -t 0.0 -s 50 -p "To be or not to be"
```

### Verify Model Checkpoint
```bash
llama-cli -c stories15M.bin --verify-checkpoint
```

## API Documentation

### Core Data Types

#### `config`
Model configuration parameters.
```scheme
(make-config dim hidden-dim n-layers n-heads n-kv-heads vocab-size seq-len shared-weights)
```

**Fields:**
- `dim`: Model embedding dimension
- `hidden-dim`: FFN hidden layer dimension  
- `n-layers`: Number of transformer layers
- `n-heads`: Number of attention heads
- `n-kv-heads`: Number of key-value heads
- `vocab-size`: Vocabulary size
- `seq-len`: Maximum sequence length
- `shared-weights`: Whether to share input/output embeddings

#### `transformer-weights`
Container for all model parameters.
```scheme
(make-transformer-weights token-embedding-table rms-att-weight wq wk wv wo 
                         rms-ffn-weight w1 w2 w3 rms-final-weight 
                         freq-cis-real freq-cis-imag wcls)
```

#### `run-state`
Runtime state for transformer computation.
```scheme
(make-run-state x xb q k v att key-cache value-cache xb2 hb hb2 logits)
```

**Fields:**
- `x`: Current hidden state
- `xb`, `xb2`: Temporary buffers
- `q`, `k`, `v`: Query, Key, Value vectors
- `att`: Attention scores
- `key-cache`, `value-cache`: Attention caches
- `hb`, `hb2`: FFN hidden buffers
- `logits`: Output logits

### High-Level Functions

#### `(run args)`
Main inference function.
```scheme
(define args (make-args "model.bin" "tokenizer.bin" 0.8 100 "Hello world" #f))
(run args)
```

#### `(bpe-encode text vocab vocab-scores)`
Tokenize text using Byte-Pair Encoding.
```scheme
(bpe-encode "Hello world" vocab vocab-scores)
;; => (15496 1776)
```

#### `(transformer token pos config state weights)`
Run transformer forward pass.
```scheme
(transformer token-id position config state weights)
;; => updated state with new logits
```

### Transformer Components

The modular architecture provides fine-grained control over transformer computation:

#### Token Processing
```scheme
;; Load token embedding
(token-embedding-lookup state weights token-id)

;; Get positional frequencies
(let-values (((freq-real freq-imag) 
              (get-rope-frequencies weights position head-size)))
  ...)
```

#### Attention Components
```scheme
;; Attention normalization
(attention-rmsnorm state weights layer-idx config)

;; Compute Q, K, V matrices
(compute-qkv state weights layer-idx config)

;; Apply rotary position embedding
(apply-rope state config freq-real freq-imag)

;; Cache key-value pairs
(cache-kv state layer-idx position config)

;; Compute attention scores and apply
(compute-attention state layer-idx position config)

;; Output projection
(attention-output state weights layer-idx config)
```

#### Feed-Forward Network
```scheme
;; FFN normalization
(ffn-rmsnorm state weights layer-idx config)

;; Compute W1 and W3 projections
(compute-ffn-w1w3 state weights layer-idx config)

;; Apply SwiGLU activation
(apply-swiglu state config)

;; Final projection
(ffn-output state weights layer-idx config)
```

#### Layer Processing
```scheme
;; Process complete transformer layer
(process-transformer-layer state weights layer-idx position config 
                          freq-real freq-imag)
```

### Utility Functions

#### Vector Operations
```scheme
;; RMS normalization
(rmsnorm output input weights)

;; Matrix-vector multiplication
(matmul output input matrix rows cols)

;; Softmax activation
(softmax output input size)

;; Vector accumulation (residual connections)
(accum target source)
```

#### Sampling Functions
```scheme
;; Greedy sampling (argmax)
(argmax logits-vector)

;; Probabilistic sampling
(sample probability-vector random-state)
```

### CLI Options

| Option | Short | Description | Default |
|--------|-------|-------------|---------|
| `--help` | `-h` | Show help message | - |
| `--checkpoint` | `-c` | Model checkpoint file (required) | - |
| `--tokenizer` | `-k` | Tokenizer file | `tokenizer.bin` |
| `--temperature` | `-t` | Sampling temperature (0.0-2.0) | `0.0` |
| `--steps` | `-s` | Number of tokens to generate | `256` |
| `--prompt` | `-p` | Input prompt text | `""` |
| `--seed` | | Random seed for sampling | Random |
| `--verify-checkpoint` | | Verify checkpoint integrity | `false` |


## 🔧 Configuration

### Model Files
- **Checkpoint**: Binary file containing model weights (`.bin`)
- **Tokenizer**: Binary file containing vocabulary and BPE merge rules

### Temperature Guidelines
- **0.0**: Deterministic (greedy sampling)
- **0.1-0.3**: Focused, coherent output
- **0.5-0.8**: Balanced creativity and coherence
- **0.9-1.2**: Creative, diverse output
- **1.5+**: Highly random, experimental

## Examples

### Interactive REPL Usage
```scheme
(import llama)

;; Load model
(define config (make-config 512 2048 8 8 8 32000 2048 #t))
(define weights (load-checkpoint "model.bin"))
(define state (make-run-state ...))

;; Generate single token
(transformer 1 0 config state weights)
(argmax (run-state-logits state))

;; Custom sampling
(define probs (softmax (make-f32vector 32000) (run-state-logits state) 32000))
(sample probs random-state)
```

### Batch Processing
```scheme
;; Process multiple prompts
(define prompts '("Hello world" "The meaning of life" "Once upon a time"))

(for-each (lambda (prompt)
            (printf "Prompt: ~A\n" prompt)
            (let ((args (make-args "model.bin" "tokenizer.bin" 0.5 50 prompt #f)))
              (run args)
              (newline)))
          prompts)
```


## License

MIT License - see LICENSE file for details.

## Acknowledgments

- Original LLAMA2 paper and implementation by Meta AI
- Andrej Karpathy's C implementation of LLAMA2 [llama2.c](https://github.com/karpathy/llama2.c)
- The LLAMA2 Common Lisp port [llama.cl](https://github.com/snunez1/llama.cl) 
- The LLAMA2 OCaml port [llama2.ml](https://github.com/jackpeck/llama2.ml) 
- BLAS library maintainers for high-performance linear algebra
- CHICKEN Scheme community for excellent libraries
 

## Original README.md

For instructions on conversions to/from .bin format, training and
other background, see the [original repo](https://github.com/karpathy/llama2.c).