Serving tools

Endpoint provider

AnyScale

Fast, cost-efficient, serverless APIs for LLM Serving and Fine Tuning.
It allows to serve and fine-tune open-models the same way as openai does.

Hugging Face endpoints

With Inference Endpoints (dedicated), easily deploy Transformers, Diffusers or any model on dedicated, fully managed infrastructure. Keep your costs low with our secure, compliant and flexible production solution.

Libraries

vLLM

This is a fast and easy-to-use library for LLM inference and serving, offering:

Truss

The simplest way to serve AI/ML models in production

Langcorn

LangCorn is an API server that enables you to serve LangChain models and pipelines with ease, leveraging the power of FastAPI for a robust and efficient experience.

Text Generation Inference (TGI)

A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power Hugging Chat, the Inference API and Inference Endpoint.

🤗 Model Memory Calculator

This tool help calculate how much vRAM is needed to train and perform big model inference on a model.