Block Data Delivery
For clients who have purchased block row-level data, Carbon Arc provides two access methods:
- Iceberg REST Catalog — Query data directly using industry-standard Iceberg table format (Recommended)
- Amazon S3 — Direct file access via AWS S3 buckets (Legacy)
Both methods provide access to the same underlying data. We recommend using Polaris for new integrations as it provides a modern, query-ready interface without the need to manage file ingestion pipelines.
Iceberg REST Catalog (Polaris) — Recommended
Overview
Carbon Arc provides access via an Iceberg REST Catalog, allowing you to connect your data platform directly to Carbon Arc's block data warehouse. This is the recommended approach for new clients as it offers:
- No ETL required — Query tables directly without building ingestion pipelines
- Always up-to-date — Access the latest data without managing incremental updates
- Industry standard — Compatible with Snowflake, Databricks, ClickHouse, Spark, Trino, and more
- Schema evolution — Automatic handling of schema changes
Connection Details
| Parameter | Value |
|---|---|
| Catalog URI | https://bulk.apps.carbonarc.co/api/catalog |
| Warehouse | bulk |
| Auth Scope | PRINCIPAL_ROLE:ALL |
| OAuth Token Endpoint | https://bulk.apps.carbonarc.co/api/catalog/v1/oauth/tokens |
Credentials
Your Client ID and Client Secret will be provided via a secure 1Password link after purchase. Keep these credentials secure and do not share them.
Platform Connection Guides
Select your data platform below for specific connection instructions:
- Snowflake
- ClickHouse
- Databricks
- Apache Spark
- Trino / Starburst
Snowflake Integration
Step 1: Create Catalog Integration
CREATE OR REPLACE CATALOG INTEGRATION carbon_arc
CATALOG_SOURCE = POLARIS
TABLE_FORMAT = ICEBERG
REST_CONFIG = (
CATALOG_URI = 'https://bulk.apps.carbonarc.co/api/catalog'
WAREHOUSE = 'bulk'
ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS
)
REST_AUTHENTICATION = (
TYPE = OAUTH
OAUTH_CLIENT_ID = '<your_client_id>'
OAUTH_CLIENT_SECRET = '<your_client_secret>'
OAUTH_ALLOWED_SCOPES = ('PRINCIPAL_ROLE:ALL')
)
ENABLED = TRUE;
Step 2: Create Linked Database
CREATE DATABASE carc
LINKED_CATALOG = (
CATALOG = 'carbon_arc'
);
Step 3: Query Data
Once connected, you can query tables directly:
SELECT * FROM carc.sloth.app_performance_data_daily LIMIT 100;
Replace <your_client_id> and <your_client_secret> with the credentials provided via 1Password.
ClickHouse Integration
Step 1: Enable Experimental Feature
SET allow_experimental_database_iceberg = 1;
Step 2: Create Database Connection
CREATE DATABASE carc
ENGINE = DataLakeCatalog('https://bulk.apps.carbonarc.co/api/catalog')
SETTINGS
catalog_type = 'rest',
catalog_credential = '<your_client_id>:<your_client_secret>',
warehouse = 'bulk',
auth_scope = 'PRINCIPAL_ROLE:ALL',
oauth_server_uri = 'https://bulk.apps.carbonarc.co/api/catalog/v1/oauth/tokens';
Step 3: Query Data
SELECT * FROM carc.sloth.app_performance_data_daily LIMIT 100;
Replace <your_client_id> and <your_client_secret> with the credentials provided via 1Password.
Databricks Integration
Coming soon — Documentation for Databricks integration is in progress.
For immediate assistance, please contact your Carbon Arc representative.
Apache Spark Integration
Coming soon — Documentation for Apache Spark integration is in progress.
For immediate assistance, please contact your Carbon Arc representative.
Trino / Starburst Integration
Coming soon — Documentation for Trino and Starburst integration is in progress.
For immediate assistance, please contact your Carbon Arc representative.
Tracking Data Updates with Changelog Tables
Every client-facing data table has a companion changelog table in the same namespace, named {table_name}_changelog. Carbon Arc writes a row to the changelog each time a new partition is written to the data table, so you can drive incremental ingestion from a single audit stream instead of scanning the data table itself.
If you have access to a data table, you automatically have access to its changelog — no additional setup is required.
Schema
| Column | Type | Description |
|---|---|---|
update_id | STRING | Unique identifier for the update event |
event_timestamp | TIMESTAMP | UTC timestamp when the partition was written |
action | STRING | FULL_REFRESH (reinstatement) or INCREMENTAL (daily drop) |
drop_partition | STRING | The drop_partition value written to the data table |
dt | DATE | Date partition column — always include in filters for efficient querying |
Example Changelog Tables
| Data Table | Changelog Table |
|---|---|
dalmatian.clickstream_data | dalmatian.clickstream_data_changelog |
sloth.app_performance_data_daily | sloth.app_performance_data_daily_changelog |
Recommended Ingestion Workflow
- Persist a cursor — track the maximum
event_timestampyou have processed so far. - Poll the changelog — on each run, read new rows where
event_timestamp > <cursor>, filtered ondtfor partition pruning. - Handle
INCREMENTALrows — re-read only the listeddrop_partitionvalues from the data table and merge them into your downstream store. - Handle
FULL_REFRESHrows — the upstream vendor data was fully reinstated. Truncate your local copy of the table and re-ingest it from scratch. - Advance the cursor — persist the new maximum
event_timestamponce ingestion succeeds.
Example Queries
Fetch all updates since your cursor:
SELECT update_id, event_timestamp, action, drop_partition
FROM carc.sloth.app_performance_data_daily_changelog
WHERE dt >= DATE '2026-04-01'
AND event_timestamp > TIMESTAMP '2026-04-14 00:00:00'
ORDER BY event_timestamp;
Find the most recent full refresh (use this as a hard reset point):
SELECT MAX(event_timestamp) AS last_full_refresh
FROM carc.sloth.app_performance_data_daily_changelog
WHERE dt >= DATE '2026-01-01'
AND action = 'FULL_REFRESH';
List every incremental partition written since the last full refresh:
SELECT drop_partition, MIN(event_timestamp) AS written_at
FROM carc.sloth.app_performance_data_daily_changelog
WHERE dt >= DATE '2026-01-01'
AND action = 'INCREMENTAL'
AND event_timestamp > (
SELECT MAX(event_timestamp)
FROM carc.sloth.app_performance_data_daily_changelog
WHERE dt >= DATE '2026-01-01'
AND action = 'FULL_REFRESH'
)
GROUP BY drop_partition;
A single update event can produce multiple changelog rows — one per drop_partition written. Group by update_id if you need to collapse them back into a single event.
Amazon S3 — Legacy
S3 file delivery is maintained for existing integrations. For new implementations, we recommend using Polaris instead.
Overview
Block data is delivered to dedicated S3 buckets with a standardized folder structure. Your AWS IAM user or role is granted read-only access to the bucket containing your purchased data assets.
Bucket Access
After purchase, you'll receive:
- Bucket ARN: The S3 bucket location (e.g.,
arn:aws:s3:::carc-ext-{dataset}) - IAM Access: Your AWS principal is granted read access to the bucket
Delivery Structure
Data is organized into two delivery patterns:
Incremental Updates
For ongoing data updates, we follow a standardized incremental delivery pattern:
| Attribute | Value |
|---|---|
| Path Structure | {drop_date}/Incremental/[data_files] |
| Content | Contains only new records received from vendor |
Example path:
s3://carc-ext-sloth/20260203/Incremental/sloth_app_performance_data_daily/
Full Reinstatement Deliveries
Complete data reinstatements are delivered when upstream data is updated:
| Attribute | Value |
|---|---|
| Path Structure | {drop_date}/Full/[data_files] |
| Content | Complete data asset including all historical data ingested to date |
Example paths:
s3://carc-ext-sloth/20260129/Full/sloth_app_performance_data_daily/
s3://carc-ext-sloth/20260129/Full/sloth_app_performance_data_monthly/
When is Data Reinstated?
We generally reinstate data on the first Monday of the month if there are changes to the ontology or significant upstream data corrections.
Data Consumption Guidelines
Recommended Approach:
- Start with Full Reinstatement — Always consume the most recent
Fullreinstatement as your baseline - Append Incrementals — Apply the
Incrementaldeliveries that occurred after the latestFullrefresh - Re-ingest on New Full — When a new
Fullreinstatement is available, delete your existing ingested data and re-ingest the completeFullreinstatement
Monitor the S3 bucket for new Full directories. When one appears, schedule a complete re-ingestion to ensure data consistency.
Available Datasets
The specific tables and feeds available depend on your purchased data package.
Contact your Carbon Arc representative for the complete schema documentation for your purchased data assets.
Support
For questions about block data access or connection issues:
- Email: support@carbonarc.ai