How to Build a Self-Hosted Qwen AI Coding Cluster Using Multiple Laptops

Uncategorized

  1. Self-Hosted Qwen Coding Cluster: Run AI Coding Agents Across MacBook and Intel Laptops
  2. Build Your Own Local AI Coding Cluster with Qwen, Ollama, LiteLLM, and Qwen Code
  3. How to Use Multiple Laptops as a Self-Hosted AI Coding Cluster
  4. Qwen Coding Cluster Tutorial: Pool MacBook and Intel Laptops for Local AI Development
  5. From Spare Laptops to AI Coding Farm: Self-Hosting Qwen for Software Development
  6. Complete Guide to Building a Local Qwen AI Coding Agent Cluster
  7. Run Qwen Locally: Multi-Laptop AI Coding Cluster Setup for Developers
  8. Self-Hosted AI Coding Farm with Qwen: MacBook M4 + Intel Laptop Setup
  9. How to Create a Private AI Coding Cluster Using Qwen and Ollama
  10. Build a Local Coding Agent Cluster with Qwen: Step-by-Step Developer Guide

3–4 laptops ≠ one single pooled Qwen brain
3–4 laptops = multiple local Qwen model servers + one router + multiple coding agents

Below is a complete practical tutorial.


Self-Hosted Qwen Coding Cluster Using 3–4 Laptops

0. What we are building

You will have:

Controller Laptop
  - LiteLLM router
  - Qwen Code CLI
  - Git worktrees
  - task runner scripts

Worker Laptop 1
  - Ollama
  - Qwen model

Worker Laptop 2
  - Ollama
  - Qwen model

Worker Laptop 3
  - Ollama
  - Qwen model

Final flow:

Qwen Code / Aider / VS Code
        ↓
LiteLLM Router :4000
        ↓
 ┌──────────────────────┬──────────────────────┬──────────────────────┐
 │ m4-1.local:11434      │ m4-2.local:11434      │ intel-1.local:11434   │
 │ qwen3-coder:30b       │ qwen3-coder:30b       │ qwen2.5-coder:14b/32b │
 └──────────────────────┴──────────────────────┴──────────────────────┘

This gives you a local AI coding cluster, not a fake RAM/CPU pooling trick.


1. Recommended model selection

For your hardware, do not start with Qwen 480B. Ollama lists qwen3-coder:480b as a local model requiring at least 250GB memory/unified memory, so your 24GB MacBooks are not the right machines for that. The practical target is qwen3-coder:30b or qwen2.5-coder:14b. Ollama lists qwen3-coder:30b as around 19GB with a 256K context window, while qwen2.5-coder:14b is around 9GB with 32K context. (ollama.com)

Use this:

MachineFirst model to trySafer fallback
M4 MacBook Pro 24GB #1qwen3-coder:30bqwen2.5-coder:14b
M4 MacBook Pro 24GB #2qwen3-coder:30bqwen2.5-coder:14b
Intel i7 64GBqwen2.5-coder:32b only if patientqwen2.5-coder:14b
Extra laptopsame as aboveqwen2.5-coder:7b

Important: 24GB Mac can load a 19GB model, but context/KV cache also needs memory. Ollama says larger context increases memory usage, and parallel requests multiply context memory usage. For coding agents on 24GB machines, start with 8K or 16K context, not 64K/256K. (docs.ollama.com)


2. Name your laptops

Use simple hostnames.

Example:

m4-1.local
m4-2.local
intel-1.local
controller.local

On macOS

Run this on each Mac, changing the name:

sudo scutil --set HostName m4-1
sudo scutil --set LocalHostName m4-1
sudo scutil --set ComputerName m4-1
dscacheutil -flushcache
hostname

For the second Mac:

sudo scutil --set HostName m4-2
sudo scutil --set LocalHostName m4-2
sudo scutil --set ComputerName m4-2
dscacheutil -flushcache
hostname

On Ubuntu/Linux Intel laptop

sudo hostnamectl set-hostname intel-1
sudo apt update
sudo apt install -y avahi-daemon
sudo systemctl enable --now avahi-daemon
hostname

Now test from your controller:

ping m4-1.local
ping m4-2.local
ping intel-1.local

3. Install Ollama on every worker laptop

Ollama supports Apple M-series Macs with CPU/GPU support, while x86 Macs are CPU-only according to its macOS requirements. On Linux, Ollama can run as a systemd service, and NVIDIA/AMD GPU setup is optional depending on hardware. (docs.ollama.com)

macOS workers

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Verify:

ollama --version
ollama list

Ubuntu/Linux worker

curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl status ollama
ollama --version

4. Configure Ollama for LAN cluster mode

By default, Ollama binds to 127.0.0.1:11434. To expose it to your LAN, set OLLAMA_HOST=0.0.0.0:11434. Ollama’s docs specifically say to use launchctl on macOS and systemd environment overrides on Linux. (docs.ollama.com)

On macOS workers

Run this on each Mac:

launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
launchctl setenv OLLAMA_CONTEXT_LENGTH "8192"
launchctl setenv OLLAMA_NUM_PARALLEL "1"
launchctl setenv OLLAMA_MAX_LOADED_MODELS "1"
launchctl setenv OLLAMA_KEEP_ALIVE "30m"
launchctl setenv OLLAMA_NO_CLOUD "1"

Restart Ollama:

osascript -e 'quit app "Ollama"' || true
open -a Ollama

Check:

curl http://localhost:11434/api/tags

On Linux worker

Create systemd override:

sudo systemctl edit ollama

Paste:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_CONTEXT_LENGTH=8192"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_KEEP_ALIVE=30m"
Environment="OLLAMA_NO_CLOUD=1"

Restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama
sudo systemctl status ollama

Allow LAN access on Linux firewall:

sudo ufw allow from 192.168.0.0/16 to any port 11434 proto tcp
sudo ufw reload

Do not expose port 11434 directly to the public internet. Keep it LAN-only or VPN-only.


5. Pull Qwen models on each laptop

On M4 MacBook workers

Try the stronger model first:

ollama pull qwen3-coder:30b

Also pull a safer fallback:

ollama pull qwen2.5-coder:14b

Test:

ollama run qwen3-coder:30b "Reply only: READY"

Check memory/offload:

ollama ps

If the Mac becomes slow, hot, or memory pressure goes red, use:

ollama stop qwen3-coder:30b
ollama run qwen2.5-coder:14b "Reply only: READY"

On Intel i7 64GB worker

Without NVIDIA GPU, Intel will likely be much slower. Start with 14B:

ollama pull qwen2.5-coder:14b
ollama run qwen2.5-coder:14b "Reply only: READY"

Then test 32B only as a deep/slow worker:

ollama pull qwen2.5-coder:32b
ollama run qwen2.5-coder:32b "Reply only: READY"

6. Test each worker from controller

From your controller laptop:

curl http://m4-1.local:11434/api/tags
curl http://m4-2.local:11434/api/tags
curl http://intel-1.local:11434/api/tags

Test OpenAI-compatible chat endpoint. Ollama supports OpenAI-style /v1/chat/completions, and the API key is required by clients but ignored by Ollama. (docs.ollama.com)

curl http://m4-1.local:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-coder:14b",
    "messages": [
      {"role": "user", "content": "Write a Python hello world in one line."}
    ]
  }'

For Qwen3-Coder:

curl http://m4-1.local:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-coder:30b",
    "messages": [
      {"role": "user", "content": "Explain what a Kubernetes Deployment does in 3 lines."}
    ]
  }'

7. Install LiteLLM router on controller

LiteLLM is useful here because it gives you one OpenAI-compatible gateway and can load-balance multiple deployments under the same model group. Its docs describe LiteLLM Proxy as an LLM gateway with OpenAI-style Chat Completions and load balancing across multiple model deployments. (docs.litellm.ai)

On controller:

mkdir -p ~/qwen-cluster
cd ~/qwen-cluster

Install uv if you do not have it:

curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.zshrc 2>/dev/null || source ~/.bashrc 2>/dev/null || true

Install LiteLLM Proxy:

uv tool install 'litellm[proxy]'

Verify:

litellm --version

8. Create LiteLLM cluster config

Create config:

cd ~/qwen-cluster
nano litellm-config.yaml

Paste this first version:

model_list:
  # Fast cluster: two M4 laptops running same model
  - model_name: qwen-coder-fast
    litellm_params:
      model: ollama_chat/qwen2.5-coder:14b
      api_base: http://m4-1.local:11434
      rpm: 4

  - model_name: qwen-coder-fast
    litellm_params:
      model: ollama_chat/qwen2.5-coder:14b
      api_base: http://m4-2.local:11434
      rpm: 4

  # Stronger but tighter M4 model
  - model_name: qwen-coder-30b
    litellm_params:
      model: ollama_chat/qwen3-coder:30b
      api_base: http://m4-1.local:11434
      rpm: 2

  - model_name: qwen-coder-30b
    litellm_params:
      model: ollama_chat/qwen3-coder:30b
      api_base: http://m4-2.local:11434
      rpm: 2

  # Intel deep/slow worker
  - model_name: qwen-coder-intel
    litellm_params:
      model: ollama_chat/qwen2.5-coder:14b
      api_base: http://intel-1.local:11434
      rpm: 1

router_settings:
  routing_strategy: simple-shuffle
  num_retries: 1
  timeout: 600

Why separate model groups?

qwen-coder-fast  = stable daily coding
qwen-coder-30b   = better coding, heavier memory
qwen-coder-intel = slow fallback / review / tests

Do not mix 14b, 30b, and 32b under the same alias at first. It makes performance unpredictable.


9. Start LiteLLM router

cd ~/qwen-cluster
export LITELLM_MASTER_KEY="sk-local-qwen-router-change-me"
litellm --config litellm-config.yaml --host 0.0.0.0 --port 4000

In another terminal, test:

curl http://localhost:4000/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-local-qwen-router-change-me" \
  -d '{
    "model": "qwen-coder-fast",
    "messages": [
      {"role": "user", "content": "Write a Bash command to list files larger than 100MB."}
    ]
  }'

Test the 30B group:

curl http://localhost:4000/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-local-qwen-router-change-me" \
  -d '{
    "model": "qwen-coder-30b",
    "messages": [
      {"role": "user", "content": "Review this Terraform snippet conceptually: resource aws_s3_bucket test {}"}
    ]
  }'

Test from another laptop:

curl http://controller.local:4000/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-local-qwen-router-change-me" \
  -d '{
    "model": "qwen-coder-fast",
    "messages": [
      {"role": "user", "content": "Reply only: CLUSTER_OK"}
    ]
  }'

10. Install Qwen Code on controller

Qwen Code is now a proper terminal AI coding agent. Its repo says it supports OpenAI-compatible providers and local models such as Ollama/vLLM, and it has interactive, headless, IDE, daemon, SDK, and agent-team style usage. (GitHub)

Install:

brew install qwen-code

Or via npm:

npm install -g @qwen-code/qwen-code@latest

Verify:

qwen --version

11. Configure Qwen Code to use your local cluster

Qwen Code supports ~/.qwen/settings.json with modelProviders, security.auth.selectedType, and model.name. Its docs also show local/self-hosted models via OpenAI-compatible API by setting baseUrl. (Qwen)

Create config:

mkdir -p ~/.qwen
nano ~/.qwen/settings.json

Paste:

{
  "modelProviders": {
    "openai": [
      {
        "id": "qwen-coder-fast",
        "name": "Local Qwen Coder Fast Cluster",
        "baseUrl": "http://localhost:4000",
        "envKey": "LOCAL_QWEN_ROUTER_KEY",
        "generationConfig": {
          "timeout": 600000,
          "maxRetries": 1,
          "contextWindowSize": 8192,
          "samplingParams": {
            "temperature": 0.2,
            "top_p": 0.9,
            "max_tokens": 4096
          }
        }
      },
      {
        "id": "qwen-coder-30b",
        "name": "Local Qwen3 Coder 30B Cluster",
        "baseUrl": "http://localhost:4000",
        "envKey": "LOCAL_QWEN_ROUTER_KEY",
        "generationConfig": {
          "timeout": 900000,
          "maxRetries": 1,
          "contextWindowSize": 8192,
          "samplingParams": {
            "temperature": 0.2,
            "top_p": 0.9,
            "max_tokens": 4096
          }
        }
      },
      {
        "id": "qwen-coder-intel",
        "name": "Local Qwen Intel Worker",
        "baseUrl": "http://localhost:4000",
        "envKey": "LOCAL_QWEN_ROUTER_KEY",
        "generationConfig": {
          "timeout": 1200000,
          "maxRetries": 0,
          "contextWindowSize": 8192,
          "samplingParams": {
            "temperature": 0.2,
            "top_p": 0.9,
            "max_tokens": 4096
          }
        }
      }
    ]
  },
  "security": {
    "auth": {
      "selectedType": "openai"
    }
  },
  "model": {
    "name": "qwen-coder-fast"
  }
}

Set key:

echo 'export LOCAL_QWEN_ROUTER_KEY="sk-local-qwen-router-change-me"' >> ~/.zshrc
source ~/.zshrc

Launch:

qwen

Inside Qwen Code:

/model

Select:

Local Qwen Coder Fast Cluster

Test prompt:

Analyze this repository and tell me the tech stack, entry points, and test commands.

12. Use it for real coding

Create a test repo:

mkdir -p ~/ai-lab
cd ~/ai-lab
git clone <YOUR_REPO_URL> app
cd app

Run Qwen Code:

qwen

Good first prompt:

Do not modify files yet. First inspect the repo and tell me:
1. tech stack
2. build command
3. test command
4. risky files
5. suggested first small task

Then:

Create a plan to add unit tests for the authentication module. Do not execute commands without asking.

Then:

Implement the first small test only. Keep changes minimal.

Review:

git diff
git status

Commit only after review:

git add .
git commit -m "test: add auth unit test"

13. Turn it into a real multi-agent coding cluster

This is where your 3–4 laptops become useful.

Use Git worktrees so each AI agent works separately.

cd ~/ai-lab/app

git worktree add ../app-agent-1 -b ai/agent-1-auth-tests
git worktree add ../app-agent-2 -b ai/agent-2-docs
git worktree add ../app-agent-3 -b ai/agent-3-refactor

Open 3 terminals.

Terminal 1

cd ~/ai-lab/app-agent-1
qwen -p "You are agent 1. Add focused unit tests for the authentication module. Keep changes minimal. Run tests if possible. Do not touch unrelated files."

Terminal 2

cd ~/ai-lab/app-agent-2
qwen -p "You are agent 2. Improve README developer setup instructions based on this repository. Do not change source code."

Terminal 3

cd ~/ai-lab/app-agent-3
qwen -p "You are agent 3. Find one small refactor opportunity in the API layer. Make a minimal safe change and explain the risk."

Qwen Code’s docs list headless mode as qwen -p "...", useful for scripts, CI/CD, and batch processing. (GitHub)

Now your agents are parallel:

Agent 1 → test branch
Agent 2 → docs branch
Agent 3 → refactor branch

Check results:

cd ~/ai-lab/app-agent-1
git diff

cd ~/ai-lab/app-agent-2
git diff

cd ~/ai-lab/app-agent-3
git diff

Merge manually after review.


14. Optional: create a simple task runner

Create file:

cd ~/qwen-cluster
nano run-agent-task.sh

Paste:

#!/usr/bin/env bash
set -euo pipefail

REPO="$1"
BRANCH="$2"
TASK="$3"

BASE_DIR="$HOME/ai-lab"
WORKTREE="$BASE_DIR/$BRANCH"

mkdir -p "$BASE_DIR"

cd "$REPO"

git fetch --all --prune
git worktree add "$WORKTREE" -b "$BRANCH" || true

cd "$WORKTREE"

qwen -p "$TASK"

echo
echo "===== STATUS ====="
git status --short

echo
echo "===== DIFF ====="
git diff --stat

Make executable:

chmod +x run-agent-task.sh

Run:

~/qwen-cluster/run-agent-task.sh \
  ~/ai-lab/app \
  ai/add-healthcheck-tests \
  "Add tests for the healthcheck endpoint. Keep the change minimal. Run the relevant test command if you can infer it."

Run another:

~/qwen-cluster/run-agent-task.sh \
  ~/ai-lab/app \
  ai/update-deployment-docs \
  "Update deployment documentation. Do not change application source code."

This is your basic AI coding farm.


15. Optional: use Aider with the same cluster

Aider can connect to any OpenAI-compatible endpoint, including local endpoints. Its docs show OPENAI_API_BASE and OPENAI_API_KEY for OpenAI-compatible APIs. (aider.chat)

Install:

python3 -m pip install aider-install
aider-install

Configure:

export OPENAI_API_BASE="http://localhost:4000"
export OPENAI_API_KEY="sk-local-qwen-router-change-me"

Run:

cd ~/ai-lab/app
aider --model openai/qwen-coder-fast

Or try 30B:

aider --model openai/qwen-coder-30b

16. Performance tuning

For 24GB M4 MacBooks

Start safe:

launchctl setenv OLLAMA_CONTEXT_LENGTH "8192"
launchctl setenv OLLAMA_NUM_PARALLEL "1"
launchctl setenv OLLAMA_MAX_LOADED_MODELS "1"

For qwen2.5-coder:14b, you can try:

launchctl setenv OLLAMA_CONTEXT_LENGTH "16384"

For qwen3-coder:30b, stay conservative:

launchctl setenv OLLAMA_CONTEXT_LENGTH "8192"

Then restart Ollama:

osascript -e 'quit app "Ollama"' || true
open -a Ollama

Check:

ollama ps

For Linux Intel

If CPU-only, do not expect magic. Use it for:

slow review
small model
test runner
Docker
database
Git worktree storage

Use:

sudo systemctl edit ollama

Keep:

[Service]
Environment="OLLAMA_CONTEXT_LENGTH=8192"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"

Restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

17. Security checklist

This part matters. Local coding agents can read files, run shell commands, and modify repos.

Do this:

Use separate repo clones
Use Git worktrees
Use branch per task
Do not mount production secrets
Do not expose Ollama/LiteLLM to internet
Use LAN/VPN only
Use LiteLLM master key
Review git diff before commit
Never let agent push directly to main
Never give AWS/GCP production credentials to agent shell

Also enable local-only Ollama mode:

launchctl setenv OLLAMA_NO_CLOUD "1"

or on Linux:

Environment="OLLAMA_NO_CLOUD=1"

Ollama says local prompts/data are not sent to Ollama when running locally, and cloud features can be disabled with OLLAMA_NO_CLOUD=1 or server config. (docs.ollama.com)


18. Troubleshooting

Problem: curl http://m4-1.local:11434/api/tags fails

Check Ollama is listening:

lsof -i :11434

On macOS:

 listening:

```bash
launchctl getenv OLLAMA_HOST

Restart:

osascript -e 'quit app "Ollama"' || true
open -a Ollama

On Linux:

sudo systemctl status ollama
journalctl -u ollama -n 100 --no-pager

Problem: model too slow

Use smaller model:

ollama pull qwen2.5-coder:7b
ollama run qwen2.5-coder:7b "Reply READY"

Then update LiteLLM config:

model: ollama_chat/qwen2.5-coder:7b

Restart LiteLLM.

Problem: Mac memory pressure

Stop heavy model:

ollama stop qwen3-coder:30b

Use 14B:

ollama run qwen2.5-coder:14b

Problem: LiteLLM not routing

Run debug:

export LITELLM_LOG=DEBUG
litellm --config litellm-config.yaml --host 0.0.0.0 --port 4000 --detailed_debug

LiteLLM docs show --detailed_debug for debug(docs.litellm.ai)citeturn843007view1

Problem: Qwen Code opens wrong auth flow

Force OpenAI-compatible auth in:

nano ~/.qwen/settings.json

Ensure:

"security": {
  "auth": {
    "selectedType": "openai"
  }
}

Qwen Code docs say security.auth.selectedType controls which protocol is used at startup, and model.name must match one config(Qwen)citeturn383593view1


Final recommended setup for your exact hardware

Use this first:

M4 MacBook #1:
  Ollama + qwen2.5-coder:14b
  Later try qwen3-coder:30b

M4 MacBook #2:
  Ollama + qwen2.5-coder:14b
  Later try qwen3-coder:30b

Intel i7 64GB:
  Ollama + qwen2.5-coder:14b
  LiteLLM router
  Git worktrees
  Docker/test runner

Controller:
  Prefer Intel if always powered on
  Prefer M4 if you want smoother interactive Qwen Code

Best stable model group:

qwen-coder-fast = two M4 laptops running qwen2.5-coder:14b

Best experimental model group:

qwen-coder-30b = two M4 laptops running qwen3-coder:30b with 8K context

The real productivity boost will come from:

1 repo
3 worktrees
3 Qwen Code headless agents
1 LiteLLM router
2–3 local Qwen model workers
human review before merge

That is the practical self-hostedQwen coding cluster.

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x