Your First Apiary

In this tutorial, you will install Apiary, create a namespace, write data, and query it with SQL. By the end, you will have a working local Apiary instance with data you can explore.

What you will learn:

How to install Apiary from source
How to create hives, boxes, and frames (Apiary's namespace hierarchy)
How to write data using PyArrow
How to query data with SQL
How to check node status

Prerequisites:

Rust 1.75+ (install from rustup.rs)
Python 3.9+
pip

Time: About 10 minutes (excluding build time).

Step 1: Install Apiary

Clone the repository and build:

git clone https://github.com/ApiaryData/apiary.git
cd apiary

# Build the Rust workspace
cargo build --workspace

# Install the Python build tool and package
pip install maturin
maturin develop

Verify the installation:

python -c "from apiary import Apiary; print('Apiary installed successfully')"

Step 2: Create an Apiary Instance

Open a Python interpreter or create a new .py file:

from apiary import Apiary

# Create a local Apiary instance
ap = Apiary("my_first_apiary")
ap.start()

# Check what we're working with
status = ap.status()
print(f"Node ID: {status['node_id']}")
print(f"Cores: {status['cores']}")
print(f"Memory: {status['memory_gb']:.1f} GB")
print(f"State: {status['state']}")

This creates a local Apiary instance. Data is stored at ~/.apiary/my_first_apiary/ on your filesystem.

Step 3: Create the Namespace

Apiary organizes data in a three-level hierarchy: Hive (database) > Box (schema) > Frame (table). Create one of each:

# Create a hive (like a database)
ap.create_hive("shop")

# Create a box inside the hive (like a schema)
ap.create_box("shop", "inventory")

# Create a frame inside the box (like a table)
# Define the schema and an optional partition column
ap.create_frame("shop", "inventory", "products", {
    "product_id": "int64",
    "name": "utf8",
    "price": "float64",
    "category": "utf8",
}, partition_by=["category"])

Verify the namespace:

print(ap.list_hives())                          # ["shop"]
print(ap.list_boxes("shop"))                    # ["inventory"]
print(ap.list_frames("shop", "inventory"))      # ["products"]

Step 4: Write Data

Apiary accepts data as Arrow IPC bytes. Use PyArrow to create a table and serialize it:

import pyarrow as pa

# Create sample data
table = pa.table({
    "product_id": [1, 2, 3, 4, 5, 6],
    "name": ["Laptop", "Mouse", "Keyboard", "Monitor", "Headphones", "Webcam"],
    "price": [999.99, 29.99, 79.99, 449.99, 149.99, 69.99],
    "category": ["electronics", "accessories", "accessories", "electronics", "accessories", "electronics"],
})

# Serialize to Arrow IPC format
sink = pa.BufferOutputStream()
writer = pa.ipc.new_stream_writer(sink, table.schema)
writer.write_table(table)
writer.close()

# Write to the frame
result = ap.write_to_frame("shop", "inventory", "products", sink.getvalue().to_pybytes())
print(f"Cells written: {result['cells_written']}")
print(f"Rows written: {result['rows_written']}")

Notice that cells_written is 2, not 1. Apiary partitioned the data by the category column, creating one cell (Parquet file) for electronics and one for accessories.

Step 5: Query with SQL

Now query the data using SQL:

import pyarrow as pa

# Simple SELECT
result_bytes = ap.sql("SELECT * FROM shop.inventory.products ORDER BY price DESC")
reader = pa.ipc.open_stream(result_bytes)
table = reader.read_all()
print(table.to_pandas())

Try an aggregation:

# Average price by category
result_bytes = ap.sql("""
    SELECT category, COUNT(*) AS count, AVG(price) AS avg_price
    FROM shop.inventory.products
    GROUP BY category
    ORDER BY avg_price DESC
""")
reader = pa.ipc.open_stream(result_bytes)
print(reader.read_all().to_pandas())

Try filtering -- this uses partition pruning, reading only the electronics cells:

result_bytes = ap.sql("""
    SELECT name, price
    FROM shop.inventory.products
    WHERE category = 'electronics'
    ORDER BY price DESC
""")
reader = pa.ipc.open_stream(result_bytes)
print(reader.read_all().to_pandas())

Step 6: Use Custom SQL Commands

Apiary adds custom commands on top of standard SQL:

# Set context to avoid repeating hive.box.frame
ap.sql("USE HIVE shop")
ap.sql("USE BOX inventory")

# Now you can use short names
result_bytes = ap.sql("SELECT * FROM products LIMIT 3")

# Inspect the namespace
ap.sql("SHOW HIVES")
ap.sql("SHOW BOXES IN shop")
ap.sql("SHOW FRAMES IN shop.inventory")

# Describe a frame's schema
result_bytes = ap.sql("DESCRIBE shop.inventory.products")
reader = pa.ipc.open_stream(result_bytes)
print(reader.read_all().to_pandas())

Step 7: Check Status and Shut Down

# See per-bee (per-core) status
bees = ap.bee_status()
for bee in bees:
    print(f"Bee {bee['bee_id']}: {bee['state']}")

# Check colony health
colony = ap.colony_status()
print(f"Temperature: {colony['temperature']:.2f} ({colony['regulation']})")

# Clean shutdown
ap.shutdown()
print("Done!")

What You Learned

Apiary uses a three-level namespace: Hive > Box > Frame
Data is written as Arrow IPC bytes and stored as Parquet files
SQL queries run via Apache DataFusion with automatic partition pruning
Each CPU core is a "bee" with its own memory budget
Colony temperature tracks overall system health

Next Steps

Sensor Data Pipeline -- Build a more realistic data pipeline with partitioning and aggregations
Multi-Node Swarm -- Set up a distributed cluster with Docker Compose
Python SDK Reference -- Full API documentation
SQL Reference -- Complete SQL syntax

Step 1: Install Apiary​

Step 2: Create an Apiary Instance​

Step 3: Create the Namespace​

Step 4: Write Data​

Step 5: Query with SQL​

Step 6: Use Custom SQL Commands​

Step 7: Check Status and Shut Down​

What You Learned​

Next Steps​