Hosting AI Models
Guide to hosting AI models for use with Routstr
Hosting AI Models
This guide provides information on how to host AI models for use with the Routstr proxy. As a provider, you’ll need to set up a model server with an OpenAI-compatible API that your Routstr proxy can connect to.
Overview
When hosting models for Routstr, you have several options:
- Self-hosted open-source models: Run open-source models locally using tools like Ollama, LM Studio, or vLLM
- Commercial API proxying: Proxy requests to commercial APIs like OpenAI, Anthropic, etc.
- Cloud-hosted models: Deploy models on cloud infrastructure for better scaling
Self-Hosted Open-Source Models
Ollama
Ollama provides a simple way to run open-source models locally with an OpenAI-compatible API endpoint.
Configure the Routstr proxy to connect to Ollama by setting the appropriate environment variables:
vLLM
vLLM is a high-performance inference engine for large language models that supports the OpenAI API format.
Configure the Routstr proxy to connect to vLLM:
LM Studio
LM Studio provides a GUI for running models locally with an OpenAI-compatible server built-in.
- Download and install LM Studio from their website
- Load your chosen model
- Click “Start Server” to initiate the API server
- Connect your Routstr proxy to http://localhost:1234/v1
Commercial API Proxying
You can also use Routstr to proxy requests to commercial AI providers while adding Cashu payments:
OpenAI
Anthropic
Hardware Requirements
Hardware requirements depend on the model size you’re hosting:
Model Size | RAM Required | GPU VRAM | Recommended Hardware |
---|---|---|---|
7B | 16 GB | 8+ GB | NVIDIA RTX 3060 or greater |
13B | 32 GB | 16+ GB | NVIDIA RTX 3080/4080 or greater |
70B | 64+ GB | 40+ GB | NVIDIA A100 (80GB) or dual GPUs |
Optimizing Model Performance
To improve performance when hosting models:
- Quantization: Use 4-bit or 8-bit quantized models to reduce memory requirements
- GPU Offloading: Configure partial GPU offloading if you have limited VRAM
- Batch Processing: If you expect high throughput, enable batch processing
Monitoring and Management
Monitor your model server’s performance using tools like:
- System monitoring: Check GPU memory, CPU usage, and RAM utilization
- Request metrics: Track request latency, tokens per second, and concurrent requests
- Error rates: Monitor for failed inferences or timeouts
Connecting to Routstr
Once your model server is running, connect it to Routstr by pointing the proxy at your model endpoint:
Security Considerations
When exposing your model endpoint:
- Network security: Only expose necessary ports
- API keys: Use proper authentication if your model server supports it
- Rate limiting: Implement appropriate rate limits to prevent abuse
- TLS encryption: Set up TLS for secure communications