# NanoGrad: Automatic Differentiation Framework for CHICKEN Scheme

A lightweight, YASOS-based automatic differentiation and neural
network framework for CHICKEN Scheme, featuring BLAS-accelerated
operations, batch processing support, and a clean functional API.

## Features

- **Automatic Differentiation**: Reverse-mode autodiff with topological sorting for correct gradient computation
- **Batch Processing**: Native support for batched operations across layers and loss functions
- **BLAS Integration**: High-performance linear algebra operations using CBLAS
- **YASOS Object System**: Clean, polymorphic object-oriented abstractions
- **Mixed Precision**: Support for both 32-bit (f32) and 64-bit (f64) floating-point
- **Neural Network Layers**: Dense layers with batch support, convolutional layers (3D/4D), batch normalization, and sequential containers
- **Activation Functions**: ReLU, Tanh, Sigmoid, Softmax (with batch support), LeakyReLU, Softplus, SiLU, GeLU
- **Optimizers**: SGD (with momentum), Adam, RMSprop
- **Loss Functions**: MSE, Cross-Entropy (with batch support)
- **Advanced Operations**: Convolution, RMSNorm (1D/2D), Layer Normalization, Batch Normalization (3D/4D), Global Pooling
- **Tensor Operations**: Reduction operations, slicing, reshaping with full gradient support

## Installation

```bash
# Install dependencies
chicken-install yasos blas mathh srfi-1 srfi-4 srfi-42 srfi-69

# Clone the repository
git clone https://github.com/iraikov/nanograd.git
cd nanograd

chicken-install
```

## Quick Start

### Basic Tensor Operations

```scheme
(import nanograd-autograd)

;; Create tensors with automatic differentiation
(define x (make-tensor32 (f32vector 1.0 2.0 3.0) '(3) requires-grad?: #t))
(define y (make-tensor32 (f32vector 4.0 5.0 6.0) '(3) requires-grad?: #t))

;; Element-wise operations
(define z (add x y))  ; z = x + y
(define w (mul x y))  ; w = x * y

;; Matrix operations
(define A (make-tensor32 (f32vector 1.0 2.0 3.0 4.0) '(2 2)))
(define b (make-tensor32 (f32vector 1.0 2.0) '(2)))
(define result (matmul-op A b))  ; Matrix-vector multiplication

;; Compute gradients
(backward! result)
(print-tensor (tensor-grad A))
```

### Batch Processing

```scheme
;; Batch matrix multiplication
(define X (make-tensor32 (make-f32vector 60) '(10 2 3)))  ; 10 samples, 2x3 each
(define W (make-tensor32 (make-f32vector 12) '(3 4)))     ; Weight matrix 3x4

;; Each of the 10 samples is multiplied by W
(define Y (matmul-op X W))  ; Shape: (10, 2, 4)

;; Batch normalization
(define features (make-tensor32 (make-f32vector (* 32 64 8 8)) '(32 64 8 8)))
(define bn-layer (make-batch-norm-2d 64))

;; Training mode: uses batch statistics
(set-training-mode! bn-layer #t)
(define normalized (forward bn-layer features))  ; Normalized across batch

;; Evaluation mode: uses running statistics
(set-eval-mode! bn-layer)
(define test-normalized (forward bn-layer test-features))
```

### Reduction Operations

```scheme
;; Sum all elements
(define total (sum-tensor x))

;; Compute mean
(define avg (mean-tensor x))

;; Compute product
(define prod (product-tensor x))

;; Custom reduction with gradient
(define custom-result
  (reduce-tensor x max
    compute-gradient: (lambda (grad-out idx val all-values)
                       ;; Custom gradient logic
                       (if (= val (apply max all-values))
                           grad-out
                           0.0))))
```

### Tensor Slicing

```scheme
;; Extract slice along first dimension
(define batch (make-tensor32 (make-f32vector 100) '(10 10)))
(define slice (slice-tensor batch 2 5))  ; Extract elements 2-6 along first dim

;; Gradients flow back correctly
(backward! (sum-tensor slice))
(print-tensor (tensor-grad batch))  ; Only positions 2-6 have gradients
```

### Building a Neural Network with Batch Support

```scheme
(import nanograd-layer nanograd-optimizer)

;; Define a simple classification network
(define model
  (make-sequential
   (list
    (make-dense-layer 784 128 activation: (make-relu) name: "Hidden1")
    (make-dense-layer 128 64 activation: (make-relu) name: "Hidden2")
    (make-dense-layer 64 10 activation: (make-identity) name: "Output"))
   name: "Classifier"))

;; Create optimizer
(define optimizer (make-adam (parameters model) learning-rate: 0.001))

;; Training loop with batches
(do ((epoch 1 (+ epoch 1)))
    ((> epoch 10))
  
  ;; Training mode
  (set-training-mode! model #t)
  
  (for-each
   (lambda (batch)
     (let* ((x (car batch))        ; Shape: (batch_size, 784)
            (target (cdr batch))   ; Shape: (batch_size, 10) one-hot
            (pred (forward model x))  ; Shape: (batch_size, 10)
            ;; Softmax and cross-entropy handle batches automatically
            (probs (softmax pred axis: -1))  ; Softmax along last axis
            (loss (cross-entropy-loss probs target reduction: 'mean)))
       
       ;; Backward pass and optimize
       (backward! loss)
       (step! optimizer)
       (zero-grad-layer! model)))
   training-data)
  
  ;; Evaluation mode
  (set-eval-mode! model)
  (evaluate-model model validation-data))
```

### Convolutional Neural Network with Batch Normalization

```scheme
(define cnn
  (make-sequential
   (list
    ;; Handles both 3D (C,H,W) and 4D (N,C,H,W) inputs
    (make-conv2d-layer 3 32 3 stride: 1 padding: 1 
                       activation: (make-relu) name: "Conv1")
    (make-batch-norm-2d 32 name: "BN1")  ; Normalizes across batch
    (make-conv2d-layer 32 64 3 stride: 1 padding: 1 
                       activation: (make-relu) name: "Conv2")
    (make-batch-norm-2d 64 name: "BN2")
    ;; Global average pooling: (N,64,H,W) -> (N,64) or (64,H,W) -> (64,)
    (make-flatten name: "Flatten")
    (make-dense-layer 64 128 activation: (make-relu) name: "FC1")
    (make-dense-layer 128 10 activation: (make-identity) name: "Output"))
   name: "CNN"))

;; Forward pass with batched images
(define batch-images (make-tensor32 batch-data '(32 3 32 32)))  ; 32 RGB images
(define predictions (forward cnn batch-images))  ; Shape: (32, 10)
```

## Architecture

### Module Structure

- **`nanograd-autograd`**: Core automatic differentiation engine
  - Tensor abstraction with YASOS
  - Arithmetic operations (add, sub, mul, div)
  - BLAS operations (matmul, dot, scale) with batch support
  - Activation functions (including batched softmax/log-softmax)
  - Loss functions with batch reduction
  - Reduction operations (sum, mean, product, custom reductions)
  - Tensor manipulation (slice, reshape, flatten)
  - Gradient computation with cycle detection

- **`nanograd-layer`**: Neural network layer abstractions
  - Dense (fully connected) layers with 1D/2D input support
  - Convolutional layers (2D) with 3D/4D input support
  - Batch normalization (2D) with 3D/4D input support
  - Global average pooling with 3D/4D support
  - Sequential containers
  - Activation function objects
  - Training/evaluation mode control

- **`nanograd-optimizer`**: Optimization algorithms
  - SGD with momentum and Nesterov
  - Adam with bias correction
  - RMSprop with momentum

### Design Principles

1. **Functional Programming**: Immutable tensors, pure operations where possible
2. **YASOS Objects**: Clean polymorphic dispatch for operations
3. **BLAS Efficiency**: Leverage optimized linear algebra for performance
4. **Batch-First Design**: Native batch support throughout the stack
5. **Explicit Gradient Management**: Manual control over backward passes
6. **Mixed Precision**: First-class support for both f32 and f64

## API Reference

### Tensor Operations

#### Constructors
```scheme
(make-tensor32 data shape #:key (requires-grad? #t))
(make-tensor64 data shape #:key (requires-grad? #t))
```

#### Accessors
```scheme
(tensor-data tensor)        ; Get underlying data vector
(tensor-grad tensor)        ; Get gradient vector
(tensor-shape tensor)       ; Get shape list
(tensor-dtype tensor)       ; Get dtype ('f32 or 'f64)
(tensor-requires-grad? t)   ; Check if gradients enabled
```

#### Arithmetic
```scheme
(add a b)                   ; Element-wise addition
(sub a b)                   ; Element-wise subtraction
(mul a b)                   ; Element-wise multiplication
(div a b)                   ; Element-wise division
(safe-div a b #:key (epsilon 1e-8))
```

#### Linear Algebra
```scheme
(matmul-op a b)            ; Matrix multiplication (batch-aware)
(dot-op a b)               ; Dot product
(scale-op tensor scalar)   ; Scalar multiplication
```

#### Reduction Operations
```scheme
(reduce-tensor tensor reducer #:key (compute-gradient #f))
  ; Generic reduction with custom gradient
  
(sum-tensor tensor)        ; Sum all elements
(mean-tensor tensor)       ; Mean of all elements
(product-tensor tensor)    ; Product of all elements
```

#### Tensor Manipulation
```scheme
(slice-tensor tensor start length)  ; Extract slice along first dimension
(reshape tensor new-shape)           ; Reshape tensor
(flatten-tensor tensor)              ; Flatten to 1D
```

#### Activations (Batch-Aware)
```scheme
(relu tensor)              ; ReLU activation
(tanh-op tensor)           ; Hyperbolic tangent
(sigmoid tensor)           ; Sigmoid (logistic)
(sigmoid-stable tensor)    ; Numerically stable sigmoid

;; Batch-aware softmax
(softmax tensor #:key (axis -1))
  ; 1D: (n_classes,) -> standard softmax
  ; 2D: (batch_size, n_classes) -> softmax along axis

(log-softmax tensor #:key (axis -1))
  ; More stable than log(softmax(x))

(silu tensor)              ; SiLU
(gelu tensor)              ; GeLU
(leaky-relu tensor #:key (alpha 0.01))
(softplus tensor #:key (beta 1.0))
```

#### Loss Functions (Batch-Aware)
```scheme
(mse-loss pred target #:key (reduction 'mean))
  ; reduction: 'mean (average over batch) or 'sum

(cross-entropy-loss pred target #:key (reduction 'mean) (from-logits #f))
  ; Supports both:
  ;   - 1D: (n_classes,) for single sample
  ;   - 2D: (batch_size, n_classes) for batches
  ; target can be one-hot or class indices
  ; from-logits: if true, applies log-softmax first
```

#### Normalization (Batch-Aware)
```scheme
(rmsnorm x weight #:key (epsilon 1e-5))
  ; 1D: (d_model,) -> standard RMSNorm
  ; 2D: (batch_size, d_model) -> RMSNorm per batch element

(l2-normalize tensor #:key (axis #f) (epsilon 1e-8))
  ; axis=#f: normalize entire tensor
  ; axis=n: normalize along specific axis (for 2D tensors)
```

#### Gradient Operations
```scheme
(zero-grad! tensor)        ; Zero out gradients
(backward! tensor)         ; Compute gradients via backprop
(add-to-grad! tensor delta) ; Accumulate gradients
```

### Layer API

#### Layer Construction
```scheme
(make-dense-layer input-size output-size 
                  #:key (activation (make-identity))
                        (dtype 'f32)
                        (name "Dense"))
  ; Supports:
  ;   1D input: (input_size,) -> (output_size,)
  ;   2D input: (batch_size, input_size) -> (batch_size, output_size)

(make-conv2d-layer in-channels out-channels kernel-size
                   #:key (stride 1) (padding 0)
                         (activation (make-identity))
                         (dtype 'f32)
                         (name "Conv2D"))
  ; Supports:
  ;   3D input: (C, H, W) -> (C_out, H_out, W_out)
  ;   4D input: (N, C, H, W) -> (N, C_out, H_out, W_out)

(make-batch-norm-2d num-features
                    #:key (epsilon 1e-5) (momentum 0.1)
                          (dtype 'f32)
                          (name "BatchNorm2d"))
  ; Supports:
  ;   3D input: (C, H, W) - treats as batch of 1
  ;   4D input: (N, C, H, W) - normalizes across batch dimension

(make-sequential layers #:key (name "Sequential"))
```

#### Global Average Pooling (Batch-Aware)
```scheme
(global-avg-pool2d input)
  ; 3D: (C, H, W) -> (C,)
  ; 4D: (N, C, H, W) -> (N, C)
  ; Averages over spatial dimensions
```

#### Layer Operations
```scheme
(forward layer input)      ; Forward pass (batch-aware)
(parameters layer)         ; Get trainable parameters
(zero-grad-layer! layer)   ; Zero all parameter gradients

;; Training/Evaluation Mode Control
(set-training-mode! layer training?)  ; Set training mode
(set-eval-mode! layer)                ; Set evaluation mode
```

#### Activation Objects
```scheme
(make-relu)                ; ReLU activation
(make-tanh)                ; Tanh activation
(make-sigmoid)             ; Sigmoid activation
(make-silu)                ; SiLU activation
(make-gelu)                ; GeLU activation
(make-identity)            ; No activation
```

### Optimizer API

#### Optimizer Construction
```scheme
(make-sgd parameters 
          #:key (learning-rate 0.01)
                (momentum 0.0)
                (weight-decay 0.0)
                (nesterov #f))

(make-adam parameters
           #:key (learning-rate 0.001)
                 (beta1 0.9)
                 (beta2 0.999)
                 (epsilon 1e-8)
                 (weight-decay 0.0))

(make-rmsprop parameters
              #:key (learning-rate 0.01)
                    (alpha 0.99)
                    (epsilon 1e-8)
                    (weight-decay 0.0)
                    (momentum 0.0))
```

#### Optimizer Operations
```scheme
(step! optimizer)                ; Apply parameter updates
(get-learning-rate optimizer)    ; Get current learning rate
(set-learning-rate! optimizer lr) ; Update learning rate
(optimizer-state optimizer)      ; Get optimizer configuration
```

## Examples

### Batch Processing with Dense Layers

```scheme
(import nanograd-autograd nanograd-layer)

;; Create a batch of inputs
(define batch-size 32)
(define input-dim 784)
(define batch-data (make-f32vector (* batch-size input-dim)))

;; ... fill with data ...

(define batch-input (make-tensor32 batch-data (list batch-size input-dim)))

;; Dense layer automatically handles batches
(define layer (make-dense-layer input-dim 128 activation: (make-relu)))
(define output (forward layer batch-input))  ; Shape: (32, 128)

;; Batch normalization example
(define features (make-tensor32 (make-f32vector (* 32 128)) '(32 128)))
(define normalized (rmsnorm features gamma))  ; Normalized per batch element
```

### Batched Softmax and Cross-Entropy

```scheme
;; Batch of logits
(define logits (make-tensor32 (make-f32vector (* 32 10)) '(32 10)))
(define targets (make-tensor32 target-data '(32 10)))  ; One-hot encoded

;; Softmax along class dimension (axis 1)
(define probs (softmax logits axis: -1))  ; Shape: (32, 10), sums to 1 per row

;; Cross-entropy handles batches automatically
(define loss (cross-entropy-loss probs targets reduction: 'mean))

;; Alternative: use from-logits for numerical stability
(define loss-stable (cross-entropy-loss logits targets 
                                        from-logits: #t 
                                        reduction: 'mean))
```

### Complete Training Example with Batches

```scheme
(import nanograd-autograd nanograd-layer nanograd-optimizer)

;; Define model
(define model
  (make-sequential
   (list
    (make-dense-layer 784 256 activation: (make-relu))
    (make-dense-layer 256 128 activation: (make-relu))
    (make-dense-layer 128 10 activation: (make-identity)))
   name: "BatchMLP"))

(define optimizer (make-adam (parameters model) learning-rate: 0.001))

;; Training with batches
(define (train-epoch train-batches)
  (set-training-mode! model #t)
  
  (for-each
   (lambda (batch)
     (let* ((x (car batch))        ; Shape: (batch_size, 784)
            (y (cdr batch))        ; Shape: (batch_size, 10)
            (logits (forward model x))
            (loss (cross-entropy-loss logits y 
                                      from-logits: #t 
                                      reduction: 'mean)))
       
       (backward! loss)
       (step! optimizer)
       (zero-grad-layer! model)))
   train-batches))

;; Evaluation
(define (evaluate test-batches)
  (set-eval-mode! model)
  
  (let ((total-correct 0)
        (total-samples 0))
    (for-each
     (lambda (batch)
       (let* ((x (car batch))
              (y (cdr batch))
              (batch-size (car (tensor-shape x)))
              (logits (forward model x))
              (probs (softmax logits axis: -1)))
         
         ;; Count correct predictions per batch
         ;; (implementation details omitted)
         ))
     test-batches)
    
    (/ total-correct total-samples)))
```

### Convolutional Network with Batch Support

```scheme
(define cnn
  (make-sequential
   (list
    (make-conv2d-layer 3 32 3 padding: 1 activation: (make-relu))
    (make-batch-norm-2d 32)
    (make-conv2d-layer 32 64 3 padding: 1 activation: (make-relu))
    (make-batch-norm-2d 64)
    (make-flatten)
    (make-dense-layer (* 64 32 32) 10))
   name: "CNN"))

;; Process batch of images
(define batch-images (make-tensor32 image-data '(16 3 32 32)))  ; 16 images
(set-training-mode! cnn #t)
(define predictions (forward cnn batch-images))  ; Shape: (16, 10)
```

## Performance Notes

- NanoGrad uses BLAS for matrix operations, including batched GEMM
- Batch operations are significantly more efficient than processing samples individually
- Use f32 (32-bit) tensors when 64-bit precision is not required
- The framework detects computation graph cycles
- Batch normalization adds minimal overhead and significantly improves training
- Global average pooling reduces parameters without sacrificing performance

## Batch Processing Best Practices

1. Always use batches during training for better performance and stable gradients
2. Set appropriate batch sizes (typically 16-256 depending on memory)
3. Use batch normalization for deeper networks (>10 layers)
4. Switch to eval mode during validation/testing to use running statistics
5. Prefer global average pooling over large fully-connected layers

## Limitations

- CPU-only (no GPU support)
- No automatic batching (must manually create batches)
- Limited built-in layer types
- Single-threaded execution

## Dependencies

- **yasos**: Object system
- **blas**: BLAS bindings for CHICKEN
- **mathh**: Extended math functions
- **srfi-1**: List utilities
- **srfi-4**: Homogeneous numeric vectors
- **srfi-42**: Eager comprehensions
- **srfi-69**: Hash tables

## License

LGPLv3 License - see LICENSE file for details

## Acknowledgments

This framework is inspired by:
- **PyTorch**: Dynamic computation graphs, autograd design, and batch-first conventions
- [micrograd](https://github.com/karpathy/micrograd): Minimalistic autograd engine by Andrej Karpathy
- [tinygrad](https://github.com/tinygrad/tinygrad): Small neural network framework

Built with CHICKEN Scheme and powered by YASOS and BLAS.