Web Analytics

photon_infer

⭐ 98 stars English by lumia431

PhotonInfer

A High-Performance LLM Inference Engine with vLLM-Style Continuous Batching

English | 中文 | Live Demo

License: MIT CUDA C++20


🚀 Performance Highlights

PhotonInfer delivers production-grade inference performance for LLMs with advanced batching capabilities. Supports Llama-3.2 and Qwen3 models.

Single Request Inference

| Model | PhotonInfer | llama.cpp | Speedup | |-------|-------------|-----------|---------| | Llama 3.2 1B | 185 tok/s | 252 tok/s | 0.73× (llama.cpp faster) |

TTFT (Time To First Token): 387ms @ 100-token prompt (INT8 quantization)

Batched Inference Throughput

| Batch Size | PhotonInfer | llama.cpp | Speedup | |------------|-------------|-----------|---------| | 4 | 410 tok/s | 252 tok/s | 1.63× | | 8 | 720 tok/s | 255 tok/s | 2.82× | | 16 | 787 tok/s | 253 tok/s | 3.07× |

Tested on: NVIDIA A100, Llama 3.2 1B, Q8/INT8 quantization

✨ Key Features

🎯 vLLM-Style Continuous Batching

GPU-Optimized Kernels

🏗️ Modern C++20 Architecture

🚀 Quick Start

Prerequisites

Download Model

Download a pre-quantized model to get started quickly:

https://huggingface.co/Lummy666/llama-3.2-1B-Instruct

Build

#### Option 1: Build from Source

# Clone repository
cd photon_infer

Configure with CUDA

mkdir build && cd build cmake -DCMAKE_BUILD_TYPE=Release -DPHOTON_BUILD_CUDA=ON ..

Build

cmake --build . -j$(nproc)

Install (optional)

sudo cmake --install .
After installation, you can run the web server directly from anywhere:

photon_web_server \
    --port 5728 \
    --model /path/to/llama-3.2-1B-Instruct \
    --tokenizer /path/to/llama-3.2-1B-Instruct/tokenizer.json

The installation will place:

To uninstall:
cd build
sudo cmake --build . --target uninstall

#### Option 2: Use Docker (Recommended)

# Pull the pre-built Docker image
docker pull lumia431/photon_infer:latest

Run the container with GPU support

docker run --rm --gpus all -p 5728:5728 -e PORT=5728 lumia431/photon_infer:latest

The web interface will be available at http://localhost:5728

🔬 Technical Details

INT8 Quantization

Paged Attention

Continuous Batching Scheduler

🛣️ Roadmap

📖 Documentation

🤝 Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

📝 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

---

Built with ❤️ for high-performance LLM inference

--- Tranlated By Open Ai Tx | Last indexed: 2026-03-22 ---