- Published on
Ollama Models via API to Production: Performance and Scaling
- Authors
- Name
- Marek Zebrowski
- @zebrowskidev
Taking Ollama APIs to Production: Performance and Scaling
Summary
This follow-up post builds on the previous article about creating a FastAPI wrapper for Ollama models. It explores what’s needed to move from a dev-friendly API to a more production-grade service focusing on rate limiting, request validation,and load balancing.
APIs are a symbiotic relationship, consumers -- external engineers or ProServ team -- and product engineering team. In my experience being on both sides, performance is a key issue and low-hanging fruit. On the product engineering side, you have to protect your product and lock it down to be used as designed while baking in room to scale. Not implementing a few small changes can lead to deadlocks, larger AWS or Azure bills, and some SecOps people bugging you about availability being a security issue when your service combusts. Covered under the performance, I will walk through rate limiting, multiple worker nodes using Gunicorn, Request Validation & Schema Hygiene.
Rate limiting 🛡️
Rate limiting doubles as a fix for performance and security issues. Rate limiting is the intentional limiting of the number of requests to a resource based on volume, load, or time. This prevents one user from hogging a resource unintentionally or maliciously. For this example we will use slowapi to limit how often users can hit your endpoints.
Installing slowapi
pip install slowapi
Adding rate limits to endpoints requires just a decorator
from slowapi import Limiter
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
@app.get("/ollama_generate/{prompt}")
@limiter.limit("10/minute")
async def limited_generate(prompt: str):
foobar = get_data()
Request Validation & Schema Hygiene ✅
Occasionally malformed data can be sent to your endpoints, eating away at resources or, worse, causing instability. In the wild, this can happen due to network issues, encoding issues on the client side, or attackers trying to do a buffer overflow, just to name a few. This can be easily mitigated using pydantic validator building models off the original blog post.
As pydantic was already installed, appending Field, field_validator
at the end of the import statement is all that needs to be done.
...
from pydantic import BaseModel, Field, field_validator
from typing import Optional, Dict, Any, List
...
class DataInBody(BaseModel):
prompt: str = Field(..., min_length=1, description="The prompt for text generation")
format: Optional[str] = Field(None, description="The format of the response (e.g., 'json')")
options: Optional[Dict[str, Any]] = Field(None, description="Additional options for the Ollama request")
@field_validator("format")
@classmethod
def validate_format(cls, value):
if value is not None and value not in ["json", "text"]:
raise ValueError("Format must be either 'json' or 'text'")
return value
...
Empty strings passed may not be what we expected, we may want the user to make sure these values are valid.
curl -X 'POST' \
'http://127.0.0.1:8000/ollama_generate/' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "What is Json?",
"format": "",
"options": ""
}'
Before changes
Response code : 200 OK
{
"response": "<think>\n\n</think>\n\nA JSON object (or plain object in JavaScript) is a data structure that maps key-value pairs to represent information. It's often used as input or output for services due to its simplicity and portability.\n\n### Key Characteristics of JSON\n1. **Data Structure**: A JSON object consists of key-value pairs enclosed in curly braces `{}`.\n2. **Values**: Values can be any type, including strings, numbers (integers or floats), arrays, objects, null, undefined, booleans, symbols, dates, times, and nested structures.\n3. **Types**: Some types are optional; for example, a key might not have a value.\n\n### Example\nHere's an example of a JSON object:\n```json\n{\n \"name\": \"John\",\n \"age\": 25,\n \"email\": \"john@example.com\"\n}\n```\n\n### Use Cases\n- **Data Exchange**: JSON is widely used to exchange data between systems that don't support each other, like APIs and web servers.\n- **Databases**: It's often used in relational databases for query processing, as it can be parsed and re-serialized into objects.\n- **Authentication**: JSON is commonly used for authentication (e.g., OAuth flow).\n- **Education**: In many subjects like computer science and math, JSON is used to represent data structures.\n\n### Benefits\n- **Portability**: It's a universal format that works across any programming language or web server.\n- **Data Structure**: It serves as both an input and output model for services.\n- **Ease of Use**: Parsing and processing are straightforward."
}
After changes
Response code : 422 Validation Error
{
"detail": [
{
"type": "value_error",
"loc": [
"body",
"format"
],
"msg": "Value error, Format must be either 'json' or 'text'",
"input": "",
"ctx": {
"error": {}
}
},
{
"type": "dict_type",
"loc": [
"body",
"options"
],
"msg": "Input should be a valid dictionary",
"input": ""
}
]
}
TIP
Read more about pydantic validator here, there are a lot of differnet configurations depending on your needs.
Load balancing & Server Config 🌐🚀
There are many ways to configure load balancing and workers and this part will be the most challenging, both from the amount of stuff going on and the amount of tooling required to configure everything. The technologies and tools below are my preference, others may have other opinions and a simple google search will yield a ton of different options that may better suit your project if these are not cutting it.
WARNING
Gunicorn does not natively support windows or any non-Unix based operating system.
Using Uvicorn with Gunicorn you can fully utilize the resources of your server by spinning up multiple instances ( workers ) of your API, or you can even distribute the load onto multiple hosts. Uvicorn -- the default server used to host fastAPI -- is an ASGI (Asynchronous Server Gateway Interface) server. An ASGI is async by nature and offers better performance. Gunicorn is a WSGI (Web Server Gateway Interface) server. Each instance of our api will be an Uvicorn instance but we will use Gunicorn to have them work together across multiple hosts.
TIP
It is HIGHLY recommended to use a reverse proxy when deploying with Gunicorn.
The 'Simplified' Setup Process
In the scenario below we will consider that we have four VMs (Virtual Machines) -- 3 VMs as workers, 1 VM for load balancing -- running Linux that will host our API.
- Using a CI/CD pipeline or terminal commands deploy the same code to each host ( make sure to install your packages and python before hand, I find it easier to take one VM and configure it fully then clone)
- Make sure you configure auto restart for your script using any of the dozen of tools in the linux ecosystem, I prefer
systemd
. - The command to start the workers processes on the worker VMs
gunicorn main:app --workers <number_of_workers> --worker-class uvicorn.workers.UvicornWorker --bind 127.0.0.1:8000
- Setup a reverse proxy on each worker VM for the worker API instances on the hosts, Nginx works well here.
- Set up a reserve proxy on the fourth VM -- the load balancing VM -- with a fourth reverse proxy to feed each of the three worker VMs requests coming in, again Nginx works well here.
Worker VM reverse Proxy Config Nginx
server {
listen 80;
server_name <vm_public_ip_or_internal_ip>; # Or your domain name
location / {
# Proxy to the local Gunicorn instance, if you changed the port in the gunicorn command change it here too
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
Load balancing VM reserve proxy Config Nginx
upstream my_fastapi_backends {
server vm1_private_ip:80;
server vm2_private_ip:80;
server vm3_private_ip:80;
}
server {
listen 80;
server_name your_domain.com;
location / {
proxy_pass http://my_fastapi_backends;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
# ... Any other configs, Remember to set your SSL termination here ...
}
Systemd Config
[Unit]
Description=Ollama Fast API
After=network.target
[Service]
User=<svc_acct>
WorkingDirectory=/opt/my_fastapi_app
ExecStart=/opt/my_fastapi_app/venv/bin/gunicorn main:app --workers <number_of_workers> --worker-class uvicorn.workers.UvicornWorker --bind 127.0.0.1:8000
Restart=on-failure
[Install]
WantedBy=multi-user.target
Number of Instances
In our 3 worker VM scenario with 3 workers on each VM, this gives us 9 effective instances of our API.
- VM 1 : 3 workers : 3 API Instances
- VM 2 : 3 workers : 3 API Instances
- VM 3 : 3 workers : 3 API Instances
Running many instances can shift the bottleneck downstream to the database or to Ollama itself if you only have one instance. Ollama can be an easier fix, using modelfiles as described in my book by running the same file on multiple Ollama instances feeding different workers. Running multiple Ollama instances with multiple workers, it might be easier to switch to DNS based load balancing but that is beyond the scope of this post.
Still want more?
- Want to know more about FastAPI? Try FastAPI: Modern Python Web Development by Bill Lubanovic
- Want to know more about DeepSeek or a more through Ollama guide? Try Diving Into DeepSeek: Running DeepSeek Locally using Ollama by Me
- Want to know more about Nginx and Load balancing? try NGINX Cookbook: Advanced Recipes for High-Performance Load Balancing by Derek DeJonghe
- Need a more holistic book on server configuration? try Practical Internet Server Configuration: Learn to Build a Fully Functional and Well-Secured Enterprise Class Internet Server by Robert La Lau