distributed-llama.cpp

Runs LLM inference across multiple machines. A central server accepts OpenAI-compatible HTTP requests and routes them to client nodes, each running llama.cpp in Docker. All communication is secured with mutual TLS using a single shared certificate.

How it works

Server

The server does two things: it manages clients over gRPC, and it exposes an OpenAI-compatible HTTP API.

When a client connects, the server dials back to it (dial-back pattern), checks whether it already has the correct model cached (by filename and file size), and if not streams the model file to it in 256 KB chunks. It then tells the client to start its llama.cpp Docker container. The connection stays open for the lifetime of the client session. The server uses it to send commands and the client uses it to heartbeat.

When an inference request arrives, the server polls all connected clients for their current capacity with a 1-second timeout, sorts by most available slots and lowest average latency, and forwards the request to the best one. If no client reports available slots it still routes to one, chosen at random. Requests time out after 5 minutes.

The server exposes the following HTTP endpoints:

Endpoint	Description
`POST /v1/chat/completions`	OpenAI-compatible chat completions
`POST /v1/completions`	OpenAI-compatible text completions
`POST /v1/embeddings`	OpenAI-compatible embeddings
`GET /health`	Returns `{"status":"ok","clients":N}`

Client

The client connects to the server, registers itself, and waits for commands. It runs a local gRPC server (the agent) on a separate port so the server can dial back for capacity queries and inference forwarding.

After receiving the model (or confirming it's already cached), the client pulls the appropriate llama.cpp Docker image and starts a container with --parallel MAX_SLOTS, mounting the model file read-only. It polls llama.cpp's /health endpoint every 2 seconds until the model is fully loaded, then sends a single hardcoded warmup prompt. Once the warmup completes the client marks itself ready and begins reporting capacity to the server.

Capacity is queried directly from llama.cpp's /slots endpoint, which returns one object per slot with a state field. The client counts idle slots (state = 0) and returns that number to the server for routing decisions.

The client sends a heartbeat to the server every 10 seconds. After 10 consecutive failures it stops the Docker container and triggers a reconnect. If the server shuts down cleanly (sends StopContainer then closes the stream), the client exits instead of reconnecting.

GPU mode is determined entirely on the client side via nvidia-smi. It can be overridden with the GPU_MODE env var. The server has no involvement in GPU selection.

Security

All gRPC communication uses mutual TLS. Both server and clients share a single .pem file containing a self-signed certificate and private key. Each side verifies the other's certificate against the same CA. The server's dial-back to client agent addresses skips hostname verification (since client IPs are not in the certificate SANs) but still validates the certificate chain.

Setup

1. Install tools

Install protoc, then:

make install-tools

2. Generate gRPC code

make proto

This generates Go code from src/proto/inference.proto into generated/inference/.

3. Generate the shared TLS certificate

make gen-cert SERVER_IP=192.168.1.100

Replace 192.168.1.100 with the LAN IP of the machine running the server. This creates shared.pem. Copy it to every client machine alongside the client binary.

For local testing only:

make gen-cert SERVER_IP=127.0.0.1

4. Configure

Copy the example env files and fill them in:

cp examples/env/.env.server bin/server/.env
cp examples/env/.env.client bin/client/.env

5. Build

make build

Produces bin/server/server(.exe) and bin/client/client(.exe).

Configuration

Server

Variable	Default	Description
`MODEL_PATH`	-	Path to the `.gguf` model file to serve to clients
`PEM_PATH`	`./shared.pem`	Path to the shared mTLS certificate
`GRPC_PORT`	`50051`	Port for the gRPC coordinator (clients connect here)
`HTTP_PORT`	`8181`	Port for the OpenAI-compatible HTTP API

Client

Variable	Default	Description
`SERVER_ADDR`	-	Server address in `host:port` format, e.g. `192.168.1.100:50051`
`PEM_PATH`	`./shared.pem`	Path to the shared mTLS certificate
`MODEL_STORAGE_DIR`	`./models`	Directory where downloaded model files are stored (created automatically)
`GPU_MODE`	`auto`	`auto` detects NVIDIA GPU via `nvidia-smi`, `gpu` forces GPU, `cpu` forces CPU
`AGENT_PORT`	`50052`	Port the client's gRPC agent listens on (server dials back here)
`LLAMA_PORT`	`18080`	Host port mapped to the llama.cpp Docker container
`MAX_SLOTS`	`16`	Maximum number of concurrent inference requests this client will accept

Running

Each binary reads .env from its working directory.

cd bin/server && ./server
cd bin/client && ./client

Place shared.pem at the path specified via the configuration. On client machines, the model is downloaded automatically on first connect.

Testing

The e2e test sends a chat completion request to a running server and verifies the response contains at least one character.

The server (and at least one client) must already be running before the test is executed.

go test ./tst/e2e/ -v -timeout 5m -count=1

By default the test hits http://localhost:8181. Override with SERVER_ADDR:

SERVER_ADDR=http://192.168.1.100:8181 go test ./tst/e2e/ -v -timeout 5m -count=1

The -timeout 5m flag is recommended since inference can take time depending on hardware and model size.

Deployment layout (multi-machine)

Server machine
├── bin/server/server
├── bin/server/.env        (MODEL_PATH points to your .gguf file)
├── bin/server/shared.pem
└── bin/server/models/
    └── your-model.gguf

Client machine(s)
├── bin/client/client
├── bin/client/.env        (SERVER_ADDR points to server machine)
└── bin/client/shared.pem  (same file as on server)

The model is transferred automatically from server to client on first connect. Subsequent restarts skip the transfer if the local file matches the server's model by both filename and file size.

License

MIT - see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
examples/env		examples/env
generated/inference		generated/inference
src		src
tst/e2e		tst/e2e
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

distributed-llama.cpp

How it works

Server

Client

Security

Setup

1. Install tools

2. Generate gRPC code

3. Generate the shared TLS certificate

4. Configure

5. Build

Configuration

Server

Client

Running

Testing

Deployment layout (multi-machine)

License

About

Uh oh!

Releases 1

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

distributed-llama.cpp

How it works

Server

Client

Security

Setup

1. Install tools

2. Generate gRPC code

3. Generate the shared TLS certificate

4. Configure

5. Build

Configuration

Server

Client

Running

Testing

Deployment layout (multi-machine)

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors 1

Languages