Serving tools

Endpoint provider

AnyScale

Fast, cost-efficient, serverless APIs for LLM Serving and Fine Tuning.
It allows to serve and fine-tune open-models the same way as openai does.

Hugging Face endpoints

With Inference Endpoints (dedicated), easily deploy Transformers, Diffusers or any model on dedicated, fully managed infrastructure. Keep your costs low with our secure, compliant and flexible production solution.

Libraries

vLLM

This is a fast and easy-to-use library for LLM inference and serving, offering:

Text Generation Inference (TGI)

A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power Hugging Chat, the Inference API and Inference Endpoint.

🤗 Model Memory Calculator

This tool help calculate how much vRAM is needed to train and perform big model inference on a model.