Operations and Maintenance
Operational runbook for keeping KafClaw healthy in dev and production-like setups.
Daily Checks
./kafclaw status
./kafclaw doctor
For env hygiene and permissions:
./kafclaw doctor --fix
Memory + knowledge quick checks:
./kafclaw knowledge status --json
curl -s http://127.0.0.1:18791/api/v1/memory/embedding/healthz
curl -s http://127.0.0.1:18791/api/v1/memory/embedding/status
Runtime Modes and Expectations
local- bind:
127.0.0.1 - group/orchestrator: off
- bind:
local-kafka- bind:
127.0.0.1 - group/orchestrator: on
- bind:
remote- bind: typically
0.0.0.0or specific LAN IP - dashboard auth token: strongly recommended
- with auth token configured, both dashboard API and
POST /chatrequire bearer auth
- bind: typically
If you intentionally run remote, do not force loopback defaults.
Updating Config Safely
Use CLI config commands:
./kafclaw config get gateway.host
./kafclaw config set gateway.host "127.0.0.1"
./kafclaw config unset channels.telegram.token
Bracket paths are supported:
./kafclaw config set channels.telegram.allowFrom[0] "alice"
./kafclaw config get channels.telegram.allowFrom[0]
Subagent limits can be tuned via config:
./kafclaw config get tools.subagents.maxSpawnDepth
./kafclaw config set tools.subagents.maxSpawnDepth 1
./kafclaw config set tools.subagents.maxChildrenPerAgent 5
./kafclaw config set tools.subagents.maxConcurrent 8
./kafclaw config set tools.subagents.archiveAfterMinutes 60
./kafclaw config set tools.subagents.allowAgents '["agent-main","agent-research"]'
./kafclaw config set tools.subagents.model "anthropic/claude-sonnet-4-5"
./kafclaw config set tools.subagents.thinking "medium"
Guided alternative:
./kafclaw configure --subagents-allow-agents agent-main,agent-research --non-interactive
Service Operation (Linux systemd)
Preferred (CLI-managed):
sudo ./kafclaw daemon status
sudo ./kafclaw daemon restart
Direct systemctl fallback:
sudo systemctl daemon-reload
sudo systemctl enable --now kafclaw-gateway.service
sudo systemctl status kafclaw-gateway.service
Restart after config changes:
sudo systemctl restart kafclaw-gateway.service
Backup
Minimum backup set:
~/.kafclaw/config.json~/.config/kafclaw/env~/.kafclaw/timeline.db- workspace and work-repo dirs used by your config
Upgrade Flow
git pull
make build
go test ./...
Then restart runtime/service.
Troubleshooting Quick Map
- Gateway unreachable from network:
- check
gateway.host - check firewall ports
18790,18791
- check
- Remote mode blocked:
- ensure dashboard auth token exists
- verify token is sent to both dashboard API and
/chatcalls
- Kafka group issues:
- verify brokers and group config
- use
kafclaw ksharkfor diagnostics
- LLM errors:
- verify API base, model, token in config/env
- Embedding runtime unhealthy / memory recall degraded:
- run
kafclaw doctorand confirmmemory_embedding_configuredpasses - trigger bootstrap:
curl -X POST http://127.0.0.1:18791/api/v1/memory/embedding/install -d '{"model":"BAAI/bge-small-en-v1.5"}' -H 'Content-Type: application/json' - verify
/api/v1/memory/embedding/healthzreportsready=true
- run
- Embedding model switched and recall quality dropped:
- expected behavior: vector index must be reset when fingerprint changes
- reindex explicitly (destructive):
curl -X POST http://127.0.0.1:18791/api/v1/memory/embedding/reindex -d '{"confirmWipe":true,"reason":"embedding_switch"}' -H 'Content-Type: application/json' - monitor
memory_chunksrefill andknowledge factsconsistency
- Shared knowledge not converging across claws:
- confirm presence/capability heartbeats by checking runtime settings:
kafclaw config get knowledge.enabledkafclaw config get knowledge.topics.presencekafclaw config get knowledge.topics.capabilities
- validate envelope dedup and voting outcomes with:
kafclaw knowledge decisions --status approved --jsonkafclaw knowledge facts --json
- confirm presence/capability heartbeats by checking runtime settings:
- Subagent spawn denied:
- check
tools.subagents.maxSpawnDepth - check
tools.subagents.maxChildrenPerAgent - check
tools.subagents.maxConcurrent
- check
- Subagent steer did not keep old run:
- expected in phase 2: steer replaces execution by killing old run and spawning a new steered run
- Subagent target selection failed:
- use
target=last,target=<numeric index>,target=<runId prefix>,target=<label prefix>, or full child session key
- use
- Subagent controls cannot see target run:
- verify the run belongs to the same root session scope (cross-root controls are denied)
Subagent Audit Events
When trace IDs are active, subagent operations append timeline events:
subagent spawn_acceptedsubagent kill(includeskilled=true|falsein metadata)subagent steer
Runtime resilience:
- subagent runs are persisted in
~/.kafclaw/subagents/runs-<workspace-hash>.json - in-flight runs at restart are marked failed with restart reason
- ended runs are cleaned up after
tools.subagents.archiveAfterMinutes - announce retries/backoff are persisted and resumed on restart
- subagent completion announces are normalized as
Status/Result/Notes; returnANNOUNCE_SKIPto suppress
Use trace views in dashboard to inspect these lifecycle events.