Multi-Node Swarm

In this tutorial, you will set up a distributed Apiary cluster using Docker Compose with MinIO as the shared storage backend. You will write data, observe distributed query execution across multiple nodes, and add a new node to the swarm.

What you will learn:

How to set up MinIO as a shared storage backend
How multiple Apiary nodes form a swarm automatically
How distributed queries work
How to add a new node to the swarm
How to monitor swarm health

Prerequisites:

Docker and Docker Compose installed
Completed Your First Apiary
Apiary installed locally (for the client)

Step 1: Create the Docker Compose File

Create a docker-compose.yml file:

version: "3.8"

services:
  minio:
    image: minio/minio
    command: server /data --console-address ":9001"
    ports:
      - "9000:9000"
      - "9001:9001"
    environment:
      MINIO_ROOT_USER: apiary
      MINIO_ROOT_PASSWORD: apiary123
    volumes:
      - minio-data:/data
    healthcheck:
      test: ["CMD", "mc", "ready", "local"]
      interval: 5s
      timeout: 5s
      retries: 5

  minio-init:
    image: minio/mc
    depends_on:
      minio:
        condition: service_healthy
    entrypoint: >
      /bin/sh -c "
      mc alias set local http://minio:9000 apiary apiary123;
      mc mb --ignore-existing local/apiary-data;
      echo 'Bucket created';
      "

  apiary-node-1:
    image: apiary:latest
    depends_on:
      minio-init:
        condition: service_completed_successfully
    environment:
      AWS_ACCESS_KEY_ID: apiary
      AWS_SECRET_ACCESS_KEY: apiary123
      AWS_ENDPOINT_URL: http://minio:9000
      APIARY_STORAGE: s3://apiary-data/swarm

  apiary-node-2:
    image: apiary:latest
    depends_on:
      minio-init:
        condition: service_completed_successfully
    environment:
      AWS_ACCESS_KEY_ID: apiary
      AWS_SECRET_ACCESS_KEY: apiary123
      AWS_ENDPOINT_URL: http://minio:9000
      APIARY_STORAGE: s3://apiary-data/swarm

  apiary-node-3:
    image: apiary:latest
    depends_on:
      minio-init:
        condition: service_completed_successfully
    environment:
      AWS_ACCESS_KEY_ID: apiary
      AWS_SECRET_ACCESS_KEY: apiary123
      AWS_ENDPOINT_URL: http://minio:9000
      APIARY_STORAGE: s3://apiary-data/swarm

volumes:
  minio-data:

Step 2: Start the Cluster

docker compose up -d

Wait about 10 seconds for the nodes to start and discover each other.

You can check the MinIO Console at http://localhost:9001 (user: apiary, password: apiary123) to see the bucket and files being created.

Step 3: Connect a Client

From your local machine, set up the S3 credentials and connect:

import os
os.environ["AWS_ACCESS_KEY_ID"] = "apiary"
os.environ["AWS_SECRET_ACCESS_KEY"] = "apiary123"
os.environ["AWS_ENDPOINT_URL"] = "http://localhost:9000"

from apiary import Apiary

ap = Apiary("swarm", storage="s3://apiary-data/swarm")
ap.start()

Step 4: Check the Swarm

Verify that all three nodes are visible:

import time
time.sleep(10)  # Wait for heartbeats to propagate

swarm = ap.swarm_status()
print(f"Nodes alive: {swarm['alive']}")
print(f"Total bees: {swarm['total_bees']}")

for node in swarm['nodes']:
    print(f"  {node['node_id']}: {node['state']} "
          f"({node['cores']} cores, {node['memory_gb']:.1f} GB)")

You should see 3 (or 4, including your local client) nodes in the swarm.

Step 5: Write Data

Create a frame and write a larger dataset so that distributed queries have something to work with:

import pyarrow as pa

ap.create_hive("analytics")
ap.create_box("analytics", "events")
ap.create_frame("analytics", "events", "clicks", {
    "event_id": "int64",
    "user_id": "utf8",
    "page": "utf8",
    "duration_ms": "int64",
    "country": "utf8",
}, partition_by=["country"])

# Generate data across multiple countries
countries = ["us", "uk", "de", "fr", "jp", "br", "au", "ca"]
all_rows = {
    "event_id": [],
    "user_id": [],
    "page": [],
    "duration_ms": [],
    "country": [],
}

import random
pages = ["/home", "/products", "/about", "/pricing", "/docs", "/blog"]

for i in range(10000):
    all_rows["event_id"].append(i)
    all_rows["user_id"].append(f"user_{random.randint(1, 500)}")
    all_rows["page"].append(random.choice(pages))
    all_rows["duration_ms"].append(random.randint(100, 30000))
    all_rows["country"].append(random.choice(countries))

table = pa.table(all_rows)

sink = pa.BufferOutputStream()
writer = pa.ipc.new_stream_writer(sink, table.schema)
writer.write_table(table)
writer.close()

result = ap.write_to_frame("analytics", "events", "clicks", sink.getvalue().to_pybytes())
print(f"Cells written: {result['cells_written']}")
print(f"Rows written: {result['rows_written']}")

Step 6: Run Distributed Queries

With multiple nodes and multiple cells, queries can distribute across the swarm:

# Top pages by visit count
result_bytes = ap.sql("""
    SELECT page, COUNT(*) AS visits, AVG(duration_ms) AS avg_duration
    FROM analytics.events.clicks
    GROUP BY page
    ORDER BY visits DESC
""")
reader = pa.ipc.open_stream(result_bytes)
print(reader.read_all().to_pandas())

# Country breakdown
result_bytes = ap.sql("""
    SELECT country, COUNT(*) AS events, AVG(duration_ms) AS avg_duration
    FROM analytics.events.clicks
    GROUP BY country
    ORDER BY events DESC
""")
reader = pa.ipc.open_stream(result_bytes)
print(reader.read_all().to_pandas())

The query planner distributes cells across available nodes based on cache locality and capacity. You can observe the distribution by checking swarm status while queries are running:

# Check swarm status to see node utilization
swarm = ap.swarm_status()
for node in swarm['nodes']:
    print(f"  {node['node_id']}: {node['state']} "
          f"(bees: {node['bees']}, idle: {node['idle_bees']})")

Step 7: Add a Fourth Node

Add a new node by starting another container:

docker compose up -d --scale apiary-node-1=2

Or add a new service entry in docker-compose.yml and run docker compose up -d.

Wait 10 seconds, then check the swarm again:

time.sleep(10)
swarm = ap.swarm_status()
print(f"Nodes alive: {swarm['alive']}")
for node in swarm['nodes']:
    print(f"  {node['node_id']}: {node['state']}")

The new node appears automatically. Future queries will include it in the distribution plan.

Step 8: Monitor the Swarm

Check colony health across the swarm:

colony = ap.colony_status()
print(f"Temperature: {colony['temperature']:.2f}")
print(f"Regulation: {colony['regulation']}")
print(f"Abandoned tasks: {colony['abandoned_tasks']}")

Step 9: Clean Up

ap.shutdown()

docker compose down -v

What You Learned

Nodes form a swarm automatically by connecting to the same S3 bucket
Heartbeats in storage enable node discovery without gossip or seed nodes
Distributed queries assign cells to nodes based on cache locality and capacity
Scaling is done by starting new nodes -- no reconfiguration needed
Colony status provides a health overview of the entire swarm

Next Steps

Add Nodes to the Swarm -- Add Raspberry Pis, Docker containers, or Kubernetes pods
Swarm Coordination -- Understand the heartbeat and world view mechanism
Query Execution -- Deep dive into how distributed queries work
Deploy on Kubernetes -- Production deployment with Kubernetes

Step 1: Create the Docker Compose File​

Step 2: Start the Cluster​

Step 3: Connect a Client​

Step 4: Check the Swarm​

Step 5: Write Data​

Step 6: Run Distributed Queries​

Step 7: Add a Fourth Node​

Step 8: Monitor the Swarm​

Step 9: Clean Up​

What You Learned​

Next Steps​