Skip to main content

Block Data Delivery

For clients who have purchased block row-level data, Carbon Arc provides two access methods:

  1. Iceberg REST Catalog — Query data directly using industry-standard Iceberg table format (Recommended)
  2. Amazon S3 — Direct file access via AWS S3 buckets (Legacy)

Both methods provide access to the same underlying data. We recommend using Polaris for new integrations as it provides a modern, query-ready interface without the need to manage file ingestion pipelines.


Overview

Carbon Arc provides access via an Iceberg REST Catalog, allowing you to connect your data platform directly to Carbon Arc's block data warehouse. This is the recommended approach for new clients as it offers:

  • No ETL required — Query tables directly without building ingestion pipelines
  • Always up-to-date — Access the latest data without managing incremental updates
  • Industry standard — Compatible with Snowflake, Databricks, ClickHouse, Spark, Trino, and more
  • Schema evolution — Automatic handling of schema changes

Connection Details

ParameterValue
Catalog URIhttps://bulk.apps.carbonarc.co/api/catalog
Warehousebulk
Auth ScopePRINCIPAL_ROLE:ALL
OAuth Token Endpointhttps://bulk.apps.carbonarc.co/api/catalog/v1/oauth/tokens

Credentials

Your Client ID and Client Secret will be provided via a secure 1Password link after purchase. Keep these credentials secure and do not share them.

Platform Connection Guides

Select your data platform below for specific connection instructions:

Snowflake Integration

Step 1: Create Catalog Integration

CREATE OR REPLACE CATALOG INTEGRATION carbon_arc
CATALOG_SOURCE = POLARIS
TABLE_FORMAT = ICEBERG
REST_CONFIG = (
CATALOG_URI = 'https://bulk.apps.carbonarc.co/api/catalog'
WAREHOUSE = 'bulk'
ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS
)
REST_AUTHENTICATION = (
TYPE = OAUTH
OAUTH_CLIENT_ID = '<your_client_id>'
OAUTH_CLIENT_SECRET = '<your_client_secret>'
OAUTH_ALLOWED_SCOPES = ('PRINCIPAL_ROLE:ALL')
)
ENABLED = TRUE;

Step 2: Create Linked Database

CREATE DATABASE carc
LINKED_CATALOG = (
CATALOG = 'carbon_arc'
);

Step 3: Query Data

Once connected, you can query tables directly:

SELECT * FROM carc.sloth.app_performance_data_daily LIMIT 100;
note

Replace <your_client_id> and <your_client_secret> with the credentials provided via 1Password.

Tracking Data Updates with Changelog Tables

Every client-facing data table has a companion changelog table in the same namespace, named {table_name}_changelog. Carbon Arc writes a row to the changelog each time a new partition is written to the data table, so you can drive incremental ingestion from a single audit stream instead of scanning the data table itself.

If you have access to a data table, you automatically have access to its changelog — no additional setup is required.

Schema

ColumnTypeDescription
update_idSTRINGUnique identifier for the update event
event_timestampTIMESTAMPUTC timestamp when the partition was written
actionSTRINGFULL_REFRESH (reinstatement) or INCREMENTAL (daily drop)
drop_partitionSTRINGThe drop_partition value written to the data table
dtDATEDate partition column — always include in filters for efficient querying

Example Changelog Tables

Data TableChangelog Table
dalmatian.clickstream_datadalmatian.clickstream_data_changelog
sloth.app_performance_data_dailysloth.app_performance_data_daily_changelog
  1. Persist a cursor — track the maximum event_timestamp you have processed so far.
  2. Poll the changelog — on each run, read new rows where event_timestamp > <cursor>, filtered on dt for partition pruning.
  3. Handle INCREMENTAL rows — re-read only the listed drop_partition values from the data table and merge them into your downstream store.
  4. Handle FULL_REFRESH rows — the upstream vendor data was fully reinstated. Truncate your local copy of the table and re-ingest it from scratch.
  5. Advance the cursor — persist the new maximum event_timestamp once ingestion succeeds.

Example Queries

Fetch all updates since your cursor:

SELECT update_id, event_timestamp, action, drop_partition
FROM carc.sloth.app_performance_data_daily_changelog
WHERE dt >= DATE '2026-04-01'
AND event_timestamp > TIMESTAMP '2026-04-14 00:00:00'
ORDER BY event_timestamp;

Find the most recent full refresh (use this as a hard reset point):

SELECT MAX(event_timestamp) AS last_full_refresh
FROM carc.sloth.app_performance_data_daily_changelog
WHERE dt >= DATE '2026-01-01'
AND action = 'FULL_REFRESH';

List every incremental partition written since the last full refresh:

SELECT drop_partition, MIN(event_timestamp) AS written_at
FROM carc.sloth.app_performance_data_daily_changelog
WHERE dt >= DATE '2026-01-01'
AND action = 'INCREMENTAL'
AND event_timestamp > (
SELECT MAX(event_timestamp)
FROM carc.sloth.app_performance_data_daily_changelog
WHERE dt >= DATE '2026-01-01'
AND action = 'FULL_REFRESH'
)
GROUP BY drop_partition;
tip

A single update event can produce multiple changelog rows — one per drop_partition written. Group by update_id if you need to collapse them back into a single event.


Amazon S3 — Legacy

Legacy Access Method

S3 file delivery is maintained for existing integrations. For new implementations, we recommend using Polaris instead.

Overview

Block data is delivered to dedicated S3 buckets with a standardized folder structure. Your AWS IAM user or role is granted read-only access to the bucket containing your purchased data assets.

Bucket Access

After purchase, you'll receive:

  • Bucket ARN: The S3 bucket location (e.g., arn:aws:s3:::carc-ext-{dataset})
  • IAM Access: Your AWS principal is granted read access to the bucket

Delivery Structure

Data is organized into two delivery patterns:

Incremental Updates

For ongoing data updates, we follow a standardized incremental delivery pattern:

AttributeValue
Path Structure{drop_date}/Incremental/[data_files]
ContentContains only new records received from vendor

Example path:

s3://carc-ext-sloth/20260203/Incremental/sloth_app_performance_data_daily/

Full Reinstatement Deliveries

Complete data reinstatements are delivered when upstream data is updated:

AttributeValue
Path Structure{drop_date}/Full/[data_files]
ContentComplete data asset including all historical data ingested to date

Example paths:

s3://carc-ext-sloth/20260129/Full/sloth_app_performance_data_daily/
s3://carc-ext-sloth/20260129/Full/sloth_app_performance_data_monthly/

When is Data Reinstated?

We generally reinstate data on the first Monday of the month if there are changes to the ontology or significant upstream data corrections.

Data Consumption Guidelines

Recommended Approach:

  1. Start with Full Reinstatement — Always consume the most recent Full reinstatement as your baseline
  2. Append Incrementals — Apply the Incremental deliveries that occurred after the latest Full refresh
  3. Re-ingest on New Full — When a new Full reinstatement is available, delete your existing ingested data and re-ingest the complete Full reinstatement
Best Practice

Monitor the S3 bucket for new Full directories. When one appears, schedule a complete re-ingestion to ensure data consistency.


Available Datasets

The specific tables and feeds available depend on your purchased data package.

info

Contact your Carbon Arc representative for the complete schema documentation for your purchased data assets.


Support

For questions about block data access or connection issues: