SQL Reference

Apiary uses Apache DataFusion as its SQL engine. Queries run over Parquet cells with automatic pruning and projection pushdown.

Supported Features

Feature	Status	Notes
SELECT	Supported	Full expression support
WHERE	Supported	Triggers cell and partition pruning
GROUP BY	Supported	With COUNT, SUM, AVG, MIN, MAX
HAVING	Supported	Filter on aggregate results
ORDER BY	Supported	ASC/DESC
LIMIT	Supported
JOIN	Supported	Single-node; distributed joins planned for v2
INSERT	Blocked	Use `write_to_frame()` in the Python SDK
UPDATE	Blocked	Use `overwrite_frame()` in the Python SDK
DELETE	Blocked	Use `overwrite_frame()` in the Python SDK
CREATE TABLE	Blocked	Use `create_frame()` in the Python SDK
DROP TABLE	Not yet supported	Planned for future release
ALTER TABLE	Not yet supported	Planned for future release

Table References

Tables use a three-part name: hive.box.frame.

SELECT * FROM warehouse.sales.orders;

Set context with USE to shorten references:

USE HIVE warehouse;
USE BOX sales;
SELECT * FROM orders;

Two-part names also work after USE HIVE:

USE HIVE warehouse;
SELECT * FROM sales.orders;

SELECT

Standard SQL SELECT with full expression support.

SELECT order_id, customer, amount
FROM warehouse.sales.orders
WHERE amount > 100
ORDER BY amount DESC
LIMIT 10;

WHERE

SELECT * FROM warehouse.sales.orders
WHERE region = 'us' AND amount >= 50.0;

WHERE predicates on partition columns trigger cell pruning -- only matching Parquet files are read. See Query Execution for details.

GROUP BY and Aggregates

Supported aggregate functions: COUNT, SUM, AVG, MIN, MAX.

SELECT customer, COUNT(*) AS order_count, SUM(amount) AS total
FROM warehouse.sales.orders
GROUP BY customer;

SELECT region, AVG(amount) AS avg_amount
FROM warehouse.sales.orders
GROUP BY region
HAVING AVG(amount) > 100;

ORDER BY

SELECT * FROM warehouse.sales.orders
ORDER BY amount DESC, customer ASC;

LIMIT

SELECT * FROM warehouse.sales.orders
LIMIT 25;

JOIN

SELECT o.order_id, o.amount, c.name
FROM warehouse.sales.orders o
JOIN warehouse.sales.customers c ON o.customer_id = c.id;

note

JOINs currently execute on a single node. Distributed join support is planned for v2.

Custom Commands

USE HIVE

Set the current hive context. Subsequent queries can omit the hive prefix.

USE HIVE warehouse;

USE BOX

Set the current box context. Requires a hive to be set first.

USE HIVE warehouse;
USE BOX sales;

SHOW HIVES

List all hives in the apiary.

SHOW HIVES;
-- Returns: name
--          warehouse
--          analytics

SHOW BOXES

List all boxes in a hive.

SHOW BOXES IN warehouse;

If a hive context is set with USE HIVE, you can omit IN:

USE HIVE warehouse;
SHOW BOXES;

SHOW FRAMES

List all frames in a box.

SHOW FRAMES IN warehouse.sales;

If both hive and box context are set, you can omit IN:

USE HIVE warehouse;
USE BOX sales;
SHOW FRAMES;

DESCRIBE

Show metadata for a frame including schema, partitions, cell count, row count, and size.

DESCRIBE warehouse.sales.orders;
-- Returns a property/value table:
-- property     | value
-- schema       | {"order_id": "int64", "customer": "utf8", ...}
-- partition_by | ["region"]
-- cells        | 3
-- total_rows   | 1500
-- total_bytes  | 24576

Blocked Operations

These SQL statements are intentionally blocked and return clear error messages directing you to the Python SDK:

Statement	Error Message	Alternative
`INSERT`	"Use `write_to_frame()` in the Python SDK"	`write_to_frame()`
`UPDATE`	"Use `overwrite_frame()` in the Python SDK"	`overwrite_frame()`
`DELETE`	"Use `overwrite_frame()` to replace data"	`overwrite_frame()`
`CREATE TABLE`	"Use `create_frame()` in the Python SDK"	`create_frame()`
`DROP TABLE`	"DROP is not supported via SQL. Use the registry API for DDL operations."	--
`ALTER TABLE`	"ALTER is not supported via SQL. Use the registry API for DDL operations."	--

Query Execution Pipeline

Parse -- DataFusion parses the SQL statement.
Resolve -- Table names are resolved to hive.box.frame using the current USE context.
Prune -- WHERE predicates and partition filters identify which cells to read.
Plan -- DataFusion builds a physical plan with projection pushdown.
Execute -- Bees process cells in sealed chambers with memory budgets.
Return -- Results are serialized as Arrow IPC bytes.

For distributed queries across multiple nodes, the coordinator assigns cells to workers based on cache locality and capacity. See Query Execution for the full discussion.

Examples

-- Top 5 customers by revenue
USE HIVE warehouse;
USE BOX sales;
SELECT customer, SUM(amount) AS revenue
FROM orders
GROUP BY customer
ORDER BY revenue DESC
LIMIT 5;

-- Count orders per region
SELECT region, COUNT(*) AS cnt
FROM warehouse.sales.orders
GROUP BY region;

-- Inspect a frame
DESCRIBE warehouse.sales.orders;

-- Browse the namespace
SHOW HIVES;
SHOW BOXES IN warehouse;
SHOW FRAMES IN warehouse.sales;

Supported Features​

Table References​

SELECT​

WHERE​

GROUP BY and Aggregates​

ORDER BY​

LIMIT​

JOIN​

Custom Commands​

USE HIVE​

USE BOX​

SHOW HIVES​

SHOW BOXES​

SHOW FRAMES​

DESCRIBE​

Blocked Operations​

Query Execution Pipeline​

Examples​