PhotonInfer

A High-Performance LLM Inference Engine with vLLM-Style Continuous Batching

🚀 Performance Highlights

PhotonInfer delivers production-grade inference performance for LLMs with advanced batching capabilities. Supports Llama-3.2 and Qwen3 models.

Single Request Inference

| Model | PhotonInfer | llama.cpp | Speedup | |-------|-------------|-----------|---------| | Llama 3.2 1B | 185 tok/s | 252 tok/s | 0.73× (llama.cpp faster) |

TTFT (Time To First Token): 387ms @ 100-token prompt (INT8 quantization)

Batched Inference Throughput

| Batch Size | PhotonInfer | llama.cpp | Speedup | |------------|-------------|-----------|---------| | 4 | 410 tok/s | 252 tok/s | 1.63× | | 8 | 720 tok/s | 255 tok/s | 2.82× | | 16 | 787 tok/s | 253 tok/s | 3.07× |

Tested on: NVIDIA A100, Llama 3.2 1B, Q8/INT8 quantization

✨ Key Features

🎯 vLLM-Style Continuous Batching

Token-level dynamic scheduling: Add new requests mid-generation without waiting for batch completion
Two-phase scheduler: Seamlessly continue running requests while admitting new ones
Request state tracking: Precise num_computed_tokens management for efficient resume
Perfect for production: High-concurrency inference services with real-time responsiveness

⚡ GPU-Optimized Kernels

Batched Paged Attention: Block-level KV cache management with efficient memory utilization
Vectorized Memory Access: float4 loads for 2-4× bandwidth efficiency
Fused Operations: Zero-copy GPU sampling, batched RoPE, and fused normalization
INT8 Quantization: Group-wise quantization with cuBLASLt INT8 GEMM support
Optimized Softmax: CUB BlockReduce for numerically stable attention computation

🏗️ Modern C++20 Architecture

Type-Safe Error Handling: Rust-inspired Result type for explicit error propagation
Zero-Copy Design: Extensive use of std::span and move semantics
Device Agnostic: Unified interface for CPU and CUDA backends
Concepts & Ranges: Compile-time constraints and expressive type safety

🚀 Quick Start

Prerequisites

Compiler: GCC 12+ (C++20 support required)
CMake: 3.20+
CUDA Toolkit: 12.0+ (tested on 12.5)
GPU: NVIDIA GPU with Compute Capability 7.0+

Download Model

Download a pre-quantized model to get started quickly:

https://huggingface.co/Lummy666/llama-3.2-1B-Instruct

Build

#### Option 1: Build from Source

# Clone repository
cd photon_infer
Configure with CUDA
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release -DPHOTON_BUILD_CUDA=ON ..
Build
cmake --build . -j$(nproc)
Install (optional)
sudo cmake --install .

After installation, you can run the web server directly from anywhere:

photon_web_server \
    --port 5728 \
    --model /path/to/llama-3.2-1B-Instruct \
    --tokenizer /path/to/llama-3.2-1B-Instruct/tokenizer.json

The installation will place:

photon_web_server → /usr/local/bin/
Static web files → /photon_infer/web/static/
Core library → /usr/local/lib/

To uninstall:

cd build
sudo cmake --build . --target uninstall

#### Option 2: Use Docker (Recommended)

# Pull the pre-built Docker image
docker pull lumia431/photon_infer:latest
Run the container with GPU support
docker run --rm --gpus all -p 5728:5728 -e PORT=5728 lumia431/photon_infer:latest

The web interface will be available at http://localhost:5728

🔬 Technical Details

INT8 Quantization

Group-wise quantization: Configurable group size (32, 64, 128)
cuBLASLt integration: Hardware-accelerated INT8 GEMM
Minimal accuracy loss: < 1% perplexity degradation on Llama models

Paged Attention

Block-level KV cache: Efficient memory allocation without fragmentation
Dynamic sequence management: Per-sequence cache offsets for flexible scheduling
Batched cache operations: Single kernel for multi-sequence K/V writes

Continuous Batching Scheduler

Two-phase scheduling:
Phase 1: Continue all RUNNING requests (no interruption)
Phase 2: Admit WAITING requests to fill remaining capacity
Request states: WAITING → RUNNING → FINISHED (with PREEMPTED support)
Token-level granularity: num_computed_tokens tracking for precise resume

🛣️ Roadmap

[x] Core Infrastructure: Tensor, operators, memory management
[x] LLaMA Model: Full transformer implementation with CPU/GPU kernels
[x] INT8 Quantization: Group-wise quantization with cuBLASLt
[x] Paged Attention: Block-level KV cache management
[x] Continuous Batching: vLLM-style dynamic request scheduling
[ ] Flash Attention 2: IO-aware attention for long sequences
[ ] Multi-GPU Support: Tensor parallelism for large models
[ ] FP16/BF16 Mixed Precision: Enhanced throughput on modern GPUs
[ ] Speculative Decoding: Multi-token generation with draft model

📖 Documentation

🤝 Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

📝 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

Architecture inspired by vLLM
Kernel optimizations reference llama.cpp
Error handling design from Rust's Result

---

Built with ❤️ for high-performance LLM inference

--- Tranlated By Open Ai Tx | Last indexed: 2026-03-22 ---