Skip to main content

SQL Reference

Apiary uses Apache DataFusion as its SQL engine. Queries run over Parquet cells with automatic pruning and projection pushdown.

Supported Features

FeatureStatusNotes
SELECTSupportedFull expression support
WHERESupportedTriggers cell and partition pruning
GROUP BYSupportedWith COUNT, SUM, AVG, MIN, MAX
HAVINGSupportedFilter on aggregate results
ORDER BYSupportedASC/DESC
LIMITSupported
JOINSupportedSingle-node; distributed joins planned for v2
INSERTBlockedUse write_to_frame() in the Python SDK
UPDATEBlockedUse overwrite_frame() in the Python SDK
DELETEBlockedUse overwrite_frame() in the Python SDK
CREATE TABLEBlockedUse create_frame() in the Python SDK
DROP TABLENot yet supportedPlanned for future release
ALTER TABLENot yet supportedPlanned for future release

Table References

Tables use a three-part name: hive.box.frame.

SELECT * FROM warehouse.sales.orders;

Set context with USE to shorten references:

USE HIVE warehouse;
USE BOX sales;
SELECT * FROM orders;

Two-part names also work after USE HIVE:

USE HIVE warehouse;
SELECT * FROM sales.orders;

SELECT

Standard SQL SELECT with full expression support.

SELECT order_id, customer, amount
FROM warehouse.sales.orders
WHERE amount > 100
ORDER BY amount DESC
LIMIT 10;

WHERE

SELECT * FROM warehouse.sales.orders
WHERE region = 'us' AND amount >= 50.0;

WHERE predicates on partition columns trigger cell pruning -- only matching Parquet files are read. See Query Execution for details.

GROUP BY and Aggregates

Supported aggregate functions: COUNT, SUM, AVG, MIN, MAX.

SELECT customer, COUNT(*) AS order_count, SUM(amount) AS total
FROM warehouse.sales.orders
GROUP BY customer;
SELECT region, AVG(amount) AS avg_amount
FROM warehouse.sales.orders
GROUP BY region
HAVING AVG(amount) > 100;

ORDER BY

SELECT * FROM warehouse.sales.orders
ORDER BY amount DESC, customer ASC;

LIMIT

SELECT * FROM warehouse.sales.orders
LIMIT 25;

JOIN

SELECT o.order_id, o.amount, c.name
FROM warehouse.sales.orders o
JOIN warehouse.sales.customers c ON o.customer_id = c.id;
note

JOINs currently execute on a single node. Distributed join support is planned for v2.


Custom Commands

USE HIVE

Set the current hive context. Subsequent queries can omit the hive prefix.

USE HIVE warehouse;

USE BOX

Set the current box context. Requires a hive to be set first.

USE HIVE warehouse;
USE BOX sales;

SHOW HIVES

List all hives in the apiary.

SHOW HIVES;
-- Returns: name
-- warehouse
-- analytics

SHOW BOXES

List all boxes in a hive.

SHOW BOXES IN warehouse;

If a hive context is set with USE HIVE, you can omit IN:

USE HIVE warehouse;
SHOW BOXES;

SHOW FRAMES

List all frames in a box.

SHOW FRAMES IN warehouse.sales;

If both hive and box context are set, you can omit IN:

USE HIVE warehouse;
USE BOX sales;
SHOW FRAMES;

DESCRIBE

Show metadata for a frame including schema, partitions, cell count, row count, and size.

DESCRIBE warehouse.sales.orders;
-- Returns a property/value table:
-- property | value
-- schema | {"order_id": "int64", "customer": "utf8", ...}
-- partition_by | ["region"]
-- cells | 3
-- total_rows | 1500
-- total_bytes | 24576

Blocked Operations

These SQL statements are intentionally blocked and return clear error messages directing you to the Python SDK:

StatementError MessageAlternative
INSERT"Use write_to_frame() in the Python SDK"write_to_frame()
UPDATE"Use overwrite_frame() in the Python SDK"overwrite_frame()
DELETE"Use overwrite_frame() to replace data"overwrite_frame()
CREATE TABLE"Use create_frame() in the Python SDK"create_frame()
DROP TABLE"DROP is not supported via SQL. Use the registry API for DDL operations."--
ALTER TABLE"ALTER is not supported via SQL. Use the registry API for DDL operations."--

Query Execution Pipeline

  1. Parse -- DataFusion parses the SQL statement.
  2. Resolve -- Table names are resolved to hive.box.frame using the current USE context.
  3. Prune -- WHERE predicates and partition filters identify which cells to read.
  4. Plan -- DataFusion builds a physical plan with projection pushdown.
  5. Execute -- Bees process cells in sealed chambers with memory budgets.
  6. Return -- Results are serialized as Arrow IPC bytes.

For distributed queries across multiple nodes, the coordinator assigns cells to workers based on cache locality and capacity. See Query Execution for the full discussion.


Examples

-- Top 5 customers by revenue
USE HIVE warehouse;
USE BOX sales;
SELECT customer, SUM(amount) AS revenue
FROM orders
GROUP BY customer
ORDER BY revenue DESC
LIMIT 5;

-- Count orders per region
SELECT region, COUNT(*) AS cnt
FROM warehouse.sales.orders
GROUP BY region;

-- Inspect a frame
DESCRIBE warehouse.sales.orders;

-- Browse the namespace
SHOW HIVES;
SHOW BOXES IN warehouse;
SHOW FRAMES IN warehouse.sales;