- Self-Hosted Qwen Coding Cluster: Run AI Coding Agents Across MacBook and Intel Laptops
- Build Your Own Local AI Coding Cluster with Qwen, Ollama, LiteLLM, and Qwen Code
- How to Use Multiple Laptops as a Self-Hosted AI Coding Cluster
- Qwen Coding Cluster Tutorial: Pool MacBook and Intel Laptops for Local AI Development
- From Spare Laptops to AI Coding Farm: Self-Hosting Qwen for Software Development
- Complete Guide to Building a Local Qwen AI Coding Agent Cluster
- Run Qwen Locally: Multi-Laptop AI Coding Cluster Setup for Developers
- Self-Hosted AI Coding Farm with Qwen: MacBook M4 + Intel Laptop Setup
- How to Create a Private AI Coding Cluster Using Qwen and Ollama
- Build a Local Coding Agent Cluster with Qwen: Step-by-Step Developer Guide
3–4 laptops ≠ one single pooled Qwen brain
3–4 laptops = multiple local Qwen model servers + one router + multiple coding agents
Below is a complete practical tutorial.
Self-Hosted Qwen Coding Cluster Using 3–4 Laptops
0. What we are building
You will have:
Controller Laptop
- LiteLLM router
- Qwen Code CLI
- Git worktrees
- task runner scripts
Worker Laptop 1
- Ollama
- Qwen model
Worker Laptop 2
- Ollama
- Qwen model
Worker Laptop 3
- Ollama
- Qwen model
Final flow:
Qwen Code / Aider / VS Code
↓
LiteLLM Router :4000
↓
┌──────────────────────┬──────────────────────┬──────────────────────┐
│ m4-1.local:11434 │ m4-2.local:11434 │ intel-1.local:11434 │
│ qwen3-coder:30b │ qwen3-coder:30b │ qwen2.5-coder:14b/32b │
└──────────────────────┴──────────────────────┴──────────────────────┘
This gives you a local AI coding cluster, not a fake RAM/CPU pooling trick.
1. Recommended model selection
For your hardware, do not start with Qwen 480B. Ollama lists qwen3-coder:480b as a local model requiring at least 250GB memory/unified memory, so your 24GB MacBooks are not the right machines for that. The practical target is qwen3-coder:30b or qwen2.5-coder:14b. Ollama lists qwen3-coder:30b as around 19GB with a 256K context window, while qwen2.5-coder:14b is around 9GB with 32K context. (ollama.com)
Use this:
| Machine | First model to try | Safer fallback |
|---|---|---|
| M4 MacBook Pro 24GB #1 | qwen3-coder:30b | qwen2.5-coder:14b |
| M4 MacBook Pro 24GB #2 | qwen3-coder:30b | qwen2.5-coder:14b |
| Intel i7 64GB | qwen2.5-coder:32b only if patient | qwen2.5-coder:14b |
| Extra laptop | same as above | qwen2.5-coder:7b |
Important: 24GB Mac can load a 19GB model, but context/KV cache also needs memory. Ollama says larger context increases memory usage, and parallel requests multiply context memory usage. For coding agents on 24GB machines, start with 8K or 16K context, not 64K/256K. (docs.ollama.com)
2. Name your laptops
Use simple hostnames.
Example:
m4-1.local
m4-2.local
intel-1.local
controller.local
On macOS
Run this on each Mac, changing the name:
sudo scutil --set HostName m4-1
sudo scutil --set LocalHostName m4-1
sudo scutil --set ComputerName m4-1
dscacheutil -flushcache
hostname
For the second Mac:
sudo scutil --set HostName m4-2
sudo scutil --set LocalHostName m4-2
sudo scutil --set ComputerName m4-2
dscacheutil -flushcache
hostname
On Ubuntu/Linux Intel laptop
sudo hostnamectl set-hostname intel-1
sudo apt update
sudo apt install -y avahi-daemon
sudo systemctl enable --now avahi-daemon
hostname
Now test from your controller:
ping m4-1.local
ping m4-2.local
ping intel-1.local
3. Install Ollama on every worker laptop
Ollama supports Apple M-series Macs with CPU/GPU support, while x86 Macs are CPU-only according to its macOS requirements. On Linux, Ollama can run as a systemd service, and NVIDIA/AMD GPU setup is optional depending on hardware. (docs.ollama.com)
macOS workers
Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
Verify:
ollama --version
ollama list
Ubuntu/Linux worker
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl status ollama
ollama --version
4. Configure Ollama for LAN cluster mode
By default, Ollama binds to 127.0.0.1:11434. To expose it to your LAN, set OLLAMA_HOST=0.0.0.0:11434. Ollama’s docs specifically say to use launchctl on macOS and systemd environment overrides on Linux. (docs.ollama.com)
On macOS workers
Run this on each Mac:
launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
launchctl setenv OLLAMA_CONTEXT_LENGTH "8192"
launchctl setenv OLLAMA_NUM_PARALLEL "1"
launchctl setenv OLLAMA_MAX_LOADED_MODELS "1"
launchctl setenv OLLAMA_KEEP_ALIVE "30m"
launchctl setenv OLLAMA_NO_CLOUD "1"
Restart Ollama:
osascript -e 'quit app "Ollama"' || true
open -a Ollama
Check:
curl http://localhost:11434/api/tags
On Linux worker
Create systemd override:
sudo systemctl edit ollama
Paste:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_CONTEXT_LENGTH=8192"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_KEEP_ALIVE=30m"
Environment="OLLAMA_NO_CLOUD=1"
Restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama
sudo systemctl status ollama
Allow LAN access on Linux firewall:
sudo ufw allow from 192.168.0.0/16 to any port 11434 proto tcp
sudo ufw reload
Do not expose port 11434 directly to the public internet. Keep it LAN-only or VPN-only.
5. Pull Qwen models on each laptop
On M4 MacBook workers
Try the stronger model first:
ollama pull qwen3-coder:30b
Also pull a safer fallback:
ollama pull qwen2.5-coder:14b
Test:
ollama run qwen3-coder:30b "Reply only: READY"
Check memory/offload:
ollama ps
If the Mac becomes slow, hot, or memory pressure goes red, use:
ollama stop qwen3-coder:30b
ollama run qwen2.5-coder:14b "Reply only: READY"
On Intel i7 64GB worker
Without NVIDIA GPU, Intel will likely be much slower. Start with 14B:
ollama pull qwen2.5-coder:14b
ollama run qwen2.5-coder:14b "Reply only: READY"
Then test 32B only as a deep/slow worker:
ollama pull qwen2.5-coder:32b
ollama run qwen2.5-coder:32b "Reply only: READY"
6. Test each worker from controller
From your controller laptop:
curl http://m4-1.local:11434/api/tags
curl http://m4-2.local:11434/api/tags
curl http://intel-1.local:11434/api/tags
Test OpenAI-compatible chat endpoint. Ollama supports OpenAI-style /v1/chat/completions, and the API key is required by clients but ignored by Ollama. (docs.ollama.com)
curl http://m4-1.local:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-coder:14b",
"messages": [
{"role": "user", "content": "Write a Python hello world in one line."}
]
}'
For Qwen3-Coder:
curl http://m4-1.local:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-coder:30b",
"messages": [
{"role": "user", "content": "Explain what a Kubernetes Deployment does in 3 lines."}
]
}'
7. Install LiteLLM router on controller
LiteLLM is useful here because it gives you one OpenAI-compatible gateway and can load-balance multiple deployments under the same model group. Its docs describe LiteLLM Proxy as an LLM gateway with OpenAI-style Chat Completions and load balancing across multiple model deployments. (docs.litellm.ai)
On controller:
mkdir -p ~/qwen-cluster
cd ~/qwen-cluster
Install uv if you do not have it:
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.zshrc 2>/dev/null || source ~/.bashrc 2>/dev/null || true
Install LiteLLM Proxy:
uv tool install 'litellm[proxy]'
Verify:
litellm --version
8. Create LiteLLM cluster config
Create config:
cd ~/qwen-cluster
nano litellm-config.yaml
Paste this first version:
model_list:
# Fast cluster: two M4 laptops running same model
- model_name: qwen-coder-fast
litellm_params:
model: ollama_chat/qwen2.5-coder:14b
api_base: http://m4-1.local:11434
rpm: 4
- model_name: qwen-coder-fast
litellm_params:
model: ollama_chat/qwen2.5-coder:14b
api_base: http://m4-2.local:11434
rpm: 4
# Stronger but tighter M4 model
- model_name: qwen-coder-30b
litellm_params:
model: ollama_chat/qwen3-coder:30b
api_base: http://m4-1.local:11434
rpm: 2
- model_name: qwen-coder-30b
litellm_params:
model: ollama_chat/qwen3-coder:30b
api_base: http://m4-2.local:11434
rpm: 2
# Intel deep/slow worker
- model_name: qwen-coder-intel
litellm_params:
model: ollama_chat/qwen2.5-coder:14b
api_base: http://intel-1.local:11434
rpm: 1
router_settings:
routing_strategy: simple-shuffle
num_retries: 1
timeout: 600
Why separate model groups?
qwen-coder-fast = stable daily coding
qwen-coder-30b = better coding, heavier memory
qwen-coder-intel = slow fallback / review / tests
Do not mix 14b, 30b, and 32b under the same alias at first. It makes performance unpredictable.
9. Start LiteLLM router
cd ~/qwen-cluster
export LITELLM_MASTER_KEY="sk-local-qwen-router-change-me"
litellm --config litellm-config.yaml --host 0.0.0.0 --port 4000
In another terminal, test:
curl http://localhost:4000/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-local-qwen-router-change-me" \
-d '{
"model": "qwen-coder-fast",
"messages": [
{"role": "user", "content": "Write a Bash command to list files larger than 100MB."}
]
}'
Test the 30B group:
curl http://localhost:4000/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-local-qwen-router-change-me" \
-d '{
"model": "qwen-coder-30b",
"messages": [
{"role": "user", "content": "Review this Terraform snippet conceptually: resource aws_s3_bucket test {}"}
]
}'
Test from another laptop:
curl http://controller.local:4000/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-local-qwen-router-change-me" \
-d '{
"model": "qwen-coder-fast",
"messages": [
{"role": "user", "content": "Reply only: CLUSTER_OK"}
]
}'
10. Install Qwen Code on controller
Qwen Code is now a proper terminal AI coding agent. Its repo says it supports OpenAI-compatible providers and local models such as Ollama/vLLM, and it has interactive, headless, IDE, daemon, SDK, and agent-team style usage. (GitHub)
Install:
brew install qwen-code
Or via npm:
npm install -g @qwen-code/qwen-code@latest
Verify:
qwen --version
11. Configure Qwen Code to use your local cluster
Qwen Code supports ~/.qwen/settings.json with modelProviders, security.auth.selectedType, and model.name. Its docs also show local/self-hosted models via OpenAI-compatible API by setting baseUrl. (Qwen)
Create config:
mkdir -p ~/.qwen
nano ~/.qwen/settings.json
Paste:
{
"modelProviders": {
"openai": [
{
"id": "qwen-coder-fast",
"name": "Local Qwen Coder Fast Cluster",
"baseUrl": "http://localhost:4000",
"envKey": "LOCAL_QWEN_ROUTER_KEY",
"generationConfig": {
"timeout": 600000,
"maxRetries": 1,
"contextWindowSize": 8192,
"samplingParams": {
"temperature": 0.2,
"top_p": 0.9,
"max_tokens": 4096
}
}
},
{
"id": "qwen-coder-30b",
"name": "Local Qwen3 Coder 30B Cluster",
"baseUrl": "http://localhost:4000",
"envKey": "LOCAL_QWEN_ROUTER_KEY",
"generationConfig": {
"timeout": 900000,
"maxRetries": 1,
"contextWindowSize": 8192,
"samplingParams": {
"temperature": 0.2,
"top_p": 0.9,
"max_tokens": 4096
}
}
},
{
"id": "qwen-coder-intel",
"name": "Local Qwen Intel Worker",
"baseUrl": "http://localhost:4000",
"envKey": "LOCAL_QWEN_ROUTER_KEY",
"generationConfig": {
"timeout": 1200000,
"maxRetries": 0,
"contextWindowSize": 8192,
"samplingParams": {
"temperature": 0.2,
"top_p": 0.9,
"max_tokens": 4096
}
}
}
]
},
"security": {
"auth": {
"selectedType": "openai"
}
},
"model": {
"name": "qwen-coder-fast"
}
}
Set key:
echo 'export LOCAL_QWEN_ROUTER_KEY="sk-local-qwen-router-change-me"' >> ~/.zshrc
source ~/.zshrc
Launch:
qwen
Inside Qwen Code:
/model
Select:
Local Qwen Coder Fast Cluster
Test prompt:
Analyze this repository and tell me the tech stack, entry points, and test commands.
12. Use it for real coding
Create a test repo:
mkdir -p ~/ai-lab
cd ~/ai-lab
git clone <YOUR_REPO_URL> app
cd app
Run Qwen Code:
qwen
Good first prompt:
Do not modify files yet. First inspect the repo and tell me:
1. tech stack
2. build command
3. test command
4. risky files
5. suggested first small task
Then:
Create a plan to add unit tests for the authentication module. Do not execute commands without asking.
Then:
Implement the first small test only. Keep changes minimal.
Review:
git diff
git status
Commit only after review:
git add .
git commit -m "test: add auth unit test"
13. Turn it into a real multi-agent coding cluster
This is where your 3–4 laptops become useful.
Use Git worktrees so each AI agent works separately.
cd ~/ai-lab/app
git worktree add ../app-agent-1 -b ai/agent-1-auth-tests
git worktree add ../app-agent-2 -b ai/agent-2-docs
git worktree add ../app-agent-3 -b ai/agent-3-refactor
Open 3 terminals.
Terminal 1
cd ~/ai-lab/app-agent-1
qwen -p "You are agent 1. Add focused unit tests for the authentication module. Keep changes minimal. Run tests if possible. Do not touch unrelated files."
Terminal 2
cd ~/ai-lab/app-agent-2
qwen -p "You are agent 2. Improve README developer setup instructions based on this repository. Do not change source code."
Terminal 3
cd ~/ai-lab/app-agent-3
qwen -p "You are agent 3. Find one small refactor opportunity in the API layer. Make a minimal safe change and explain the risk."
Qwen Code’s docs list headless mode as qwen -p "...", useful for scripts, CI/CD, and batch processing. (GitHub)
Now your agents are parallel:
Agent 1 → test branch
Agent 2 → docs branch
Agent 3 → refactor branch
Check results:
cd ~/ai-lab/app-agent-1
git diff
cd ~/ai-lab/app-agent-2
git diff
cd ~/ai-lab/app-agent-3
git diff
Merge manually after review.
14. Optional: create a simple task runner
Create file:
cd ~/qwen-cluster
nano run-agent-task.sh
Paste:
#!/usr/bin/env bash
set -euo pipefail
REPO="$1"
BRANCH="$2"
TASK="$3"
BASE_DIR="$HOME/ai-lab"
WORKTREE="$BASE_DIR/$BRANCH"
mkdir -p "$BASE_DIR"
cd "$REPO"
git fetch --all --prune
git worktree add "$WORKTREE" -b "$BRANCH" || true
cd "$WORKTREE"
qwen -p "$TASK"
echo
echo "===== STATUS ====="
git status --short
echo
echo "===== DIFF ====="
git diff --stat
Make executable:
chmod +x run-agent-task.sh
Run:
~/qwen-cluster/run-agent-task.sh \
~/ai-lab/app \
ai/add-healthcheck-tests \
"Add tests for the healthcheck endpoint. Keep the change minimal. Run the relevant test command if you can infer it."
Run another:
~/qwen-cluster/run-agent-task.sh \
~/ai-lab/app \
ai/update-deployment-docs \
"Update deployment documentation. Do not change application source code."
This is your basic AI coding farm.
15. Optional: use Aider with the same cluster
Aider can connect to any OpenAI-compatible endpoint, including local endpoints. Its docs show OPENAI_API_BASE and OPENAI_API_KEY for OpenAI-compatible APIs. (aider.chat)
Install:
python3 -m pip install aider-install
aider-install
Configure:
export OPENAI_API_BASE="http://localhost:4000"
export OPENAI_API_KEY="sk-local-qwen-router-change-me"
Run:
cd ~/ai-lab/app
aider --model openai/qwen-coder-fast
Or try 30B:
aider --model openai/qwen-coder-30b
16. Performance tuning
For 24GB M4 MacBooks
Start safe:
launchctl setenv OLLAMA_CONTEXT_LENGTH "8192"
launchctl setenv OLLAMA_NUM_PARALLEL "1"
launchctl setenv OLLAMA_MAX_LOADED_MODELS "1"
For qwen2.5-coder:14b, you can try:
launchctl setenv OLLAMA_CONTEXT_LENGTH "16384"
For qwen3-coder:30b, stay conservative:
launchctl setenv OLLAMA_CONTEXT_LENGTH "8192"
Then restart Ollama:
osascript -e 'quit app "Ollama"' || true
open -a Ollama
Check:
ollama ps
For Linux Intel
If CPU-only, do not expect magic. Use it for:
slow review
small model
test runner
Docker
database
Git worktree storage
Use:
sudo systemctl edit ollama
Keep:
[Service]
Environment="OLLAMA_CONTEXT_LENGTH=8192"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama
17. Security checklist
This part matters. Local coding agents can read files, run shell commands, and modify repos.
Do this:
Use separate repo clones
Use Git worktrees
Use branch per task
Do not mount production secrets
Do not expose Ollama/LiteLLM to internet
Use LAN/VPN only
Use LiteLLM master key
Review git diff before commit
Never let agent push directly to main
Never give AWS/GCP production credentials to agent shell
Also enable local-only Ollama mode:
launchctl setenv OLLAMA_NO_CLOUD "1"
or on Linux:
Environment="OLLAMA_NO_CLOUD=1"
Ollama says local prompts/data are not sent to Ollama when running locally, and cloud features can be disabled with OLLAMA_NO_CLOUD=1 or server config. (docs.ollama.com)
18. Troubleshooting
Problem: curl http://m4-1.local:11434/api/tags fails
Check Ollama is listening:
lsof -i :11434
On macOS:
listening:
```bash
launchctl getenv OLLAMA_HOST
Restart:
osascript -e 'quit app "Ollama"' || true
open -a Ollama
On Linux:
sudo systemctl status ollama
journalctl -u ollama -n 100 --no-pager
Problem: model too slow
Use smaller model:
ollama pull qwen2.5-coder:7b
ollama run qwen2.5-coder:7b "Reply READY"
Then update LiteLLM config:
model: ollama_chat/qwen2.5-coder:7b
Restart LiteLLM.
Problem: Mac memory pressure
Stop heavy model:
ollama stop qwen3-coder:30b
Use 14B:
ollama run qwen2.5-coder:14b
Problem: LiteLLM not routing
Run debug:
export LITELLM_LOG=DEBUG
litellm --config litellm-config.yaml --host 0.0.0.0 --port 4000 --detailed_debug
LiteLLM docs show --detailed_debug for debug(docs.litellm.ai)citeturn843007view1
Problem: Qwen Code opens wrong auth flow
Force OpenAI-compatible auth in:
nano ~/.qwen/settings.json
Ensure:
"security": {
"auth": {
"selectedType": "openai"
}
}
Qwen Code docs say security.auth.selectedType controls which protocol is used at startup, and model.name must match one config(Qwen)citeturn383593view1
Final recommended setup for your exact hardware
Use this first:
M4 MacBook #1:
Ollama + qwen2.5-coder:14b
Later try qwen3-coder:30b
M4 MacBook #2:
Ollama + qwen2.5-coder:14b
Later try qwen3-coder:30b
Intel i7 64GB:
Ollama + qwen2.5-coder:14b
LiteLLM router
Git worktrees
Docker/test runner
Controller:
Prefer Intel if always powered on
Prefer M4 if you want smoother interactive Qwen Code
Best stable model group:
qwen-coder-fast = two M4 laptops running qwen2.5-coder:14b
Best experimental model group:
qwen-coder-30b = two M4 laptops running qwen3-coder:30b with 8K context
The real productivity boost will come from:
1 repo
3 worktrees
3 Qwen Code headless agents
1 LiteLLM router
2–3 local Qwen model workers
human review before merge
That is the practical self-hostedQwen coding cluster.