Skip to main content

Inference Frameworks Introduction

Text-to-Text Model Inference Framework

The ChanShen community supports the following three text-to-text inference frameworks: VLLM, SGLang, and TGI. Each framework has its unique features and use cases, suitable for different inference tasks. When selecting a framework, you can decide on the most appropriate tool based on specific scenarios and requirements.

1. VLLM

vLLM is a fast and easy-to-use LLM inference and service library. Originally developed by UC Berkeley's Sky Computing Lab, vLLM has evolved into a community-driven project that encompasses contributions from both academia and industry.

Features:

  • High Performance: Delivers leading service throughput, supports paged attention management, fast CUDA/HIP graph execution, quantization support (GPTQ, AWQ, INT4, INT8, FP8), optimized CUDA kernels, and integrates FlashAttention and FlashInfer.
  • Flexibility: Supports popular models from Hugging Face and a variety of decoding algorithms (e.g., parallel sampling, beam search), supports tensor parallelism and pipeline parallelism for distributed inference.
  • Multi-Platform Support: Compatible with NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPUs, and AWS Neuron platforms.
  • Model Support: Seamlessly supports most open-source models from Hugging Face, including Transformer-based LLMs (e.g., Llama), mixture-of-experts LLMs (e.g., Mixtral), and multimodal LLMs (e.g., LLaVA).

Use Cases:

Suitable for scenarios requiring high throughput, high-performance inference, and multi-platform support; especially strong in large-scale inference tasks where vLLM can provide robust computational support.

2. SGLang

SGLang is a fast service framework that supports large language models and visual language models. It enables quicker and more controllable interactions with models through a jointly designed backend runtime and frontend language.

Features:

  • Efficient Backend Runtime: Includes RadixAttention for prefix caching, jump-forward constraint decoding, zero-cost CPU scheduling, batched throughput, tensor parallelism, FlashInfer kernel, etc.
  • Flexible Frontend Language: Provides an intuitive interface for programming LLM applications, including chain generation calls, complex prompts, multimodal inputs, and parallelization.
  • Wide Model Support: Supports generation models like Llama, Gemma, Mistral, QWen, DeepSeek, and LLaVA, as well as embedding models (e.g., e5-mistral) and reward models (e.g., Skywork).
  • Active Community: SGLang is open-source and backed by an active community, widely used across industry.

Use Cases:

Ideal for scenarios requiring a flexible and efficient inference framework, especially in multimodal applications, customized generation, and situations needing fine control over the generation process.

3. TGI (Text Generation Inference)

TGI is a toolkit for deploying and serving large language models, designed to provide high-performance text generation services for the most popular open-source LLMs.

Features:

  • Easy Startup: Supports quick service initialization for popular LLMs like Llama, Falcon, StarCoder, and more.
  • Production Ready: Supports distributed tracing (Open Telemetry) and Prometheus metrics, ensuring stability in production environments.
  • Efficient Inference: Supports tensor parallelism, token streaming (SSE), and batched throughput, optimizing inference performance.
  • Quantization and Customization: Supports various quantization methods (e.g., GPT-Q, AWQ, FP8) and offers features for custom prompt generation and model fine-tuning support.

Use Cases:

Suitable for production environments requiring efficient and low-latency inference, particularly in large-scale text generation tasks and customized text generation tasks, where TGI provides flexible configuration and efficient execution capabilities.