Storage Layout

Apiary stores all state in the configured storage backend -- local filesystem or S3-compatible object storage. This page documents the exact directory structure.

Three-Tier Storage Model

Tier	Medium	Purpose	Persistence
Tier 1	Memory	Arrow RecordBatches in bee chambers during active computation	Ephemeral
Tier 2	Local disk	Cell cache (LRU) + spill files + write buffers	Ephemeral (cache)
Tier 3	Object storage	Canonical data, metadata, and coordination state	Durable

Tier 3 (object storage) is the source of truth. If all compute nodes disappear, the bucket contains everything needed to resume operations.

Bucket Layout

{storage_root}/
├── _registry/
│   ├── state_000001.json            # Registry version 1
│   ├── state_000002.json            # Registry version 2 (latest)
│   └── _checkpoint/
│       └── checkpoint_000100.json   # Registry checkpoint
├── _heartbeats/
│   ├── node_abc123.json             # Node heartbeat files
│   └── node_def456.json
├── _queries/
│   └── {query_id}/                  # Distributed query coordination
│       ├── manifest.json            # Query plan and cell assignments
│       └── partial_{node_id}.arrow  # Partial results from workers
└── {hive}/
    └── {box}/
        └── {frame}/
            ├── _ledger/
            │   ├── 000000.json              # Ledger entry version 0
            │   ├── 000001.json              # Ledger entry version 1
            │   └── _checkpoint/
            │       └── checkpoint_000100.json
            └── {partition_col}={value}/
                ├── cell_{uuid}.parquet      # Data cells
                └── cell_{uuid}.parquet

Directory Details

`_registry/`

The namespace catalog. Contains versioned JSON files tracking which hives, boxes, and frames exist. New versions are committed using conditional writes (put_if_not_exists) for concurrency control.

Checkpoints are written every 100 versions to accelerate loading.

`_heartbeats/`

One JSON file per active node, updated every 5 seconds. Contains the node ID, status, core count, memory, cache contents, and last-updated timestamp. Used by the world view builder to discover nodes and detect failures.

Heartbeat files are deleted on graceful shutdown and cleaned up after 60 seconds for dead nodes.

`_queries/`

Temporary directory for distributed query coordination. Each query gets a subdirectory containing:

manifest.json -- The query plan with cell-to-node assignments
partial_{node_id}.arrow -- Arrow IPC files with partial results from each worker

Query directories are cleaned up after the query completes.

`{hive}/{box}/{frame}/`

Data directory for a single frame, containing:

_ledger/ -- Ordered sequence of JSON files recording every mutation (AddCells, RewriteCells). The ledger is the source of truth for which cells are active. Checkpointed every 100 versions.
{partition_col}={value}/ -- Partition directories following Hive-style partitioning. Each directory contains Parquet cell files.

Cell Files

Each cell file is a Parquet file with:

LZ4 compression
Column-level min/max statistics
UUID-based naming (cell_{uuid}.parquet)
Size governed by the leafcutter sizing policy

Local Storage

When using LocalBackend (the default for solo mode), the storage root is:

~/.apiary/{apiary_name}/

The directory structure is identical to the bucket layout above.

Namespace Hierarchy

Apiary Instance
└── Hive (database)
    └── Box (schema)
        └── Frame (table)
            └── Cell (Parquet file)

See the Glossary for definitions of each term.

Three-Tier Storage Model​

Bucket Layout​

Directory Details​

_registry/​

_heartbeats/​

_queries/​

{hive}/{box}/{frame}/​

Cell Files​

Local Storage​

Namespace Hierarchy​