Skip to main content

Databricks

Overview

Extract connects to Databricks as a source by reading data from tables in your Databricks workspace. The connector uses the Databricks SQL Statement Execution API to query tables through a SQL warehouse and stream results as datasets.

You can configure the connector in two modes:

  • Table Discovery Mode: When no datasets are configured, Extract automatically discovers all accessible catalogs, schemas, and tables in your workspace and creates a stream for each table.
  • Explicit Datasets Mode: Configure specific tables to extract by providing catalog, schema, and table for each dataset. You can also optionally provide a primary_key and an incremental_timestamp_field per dataset.

Queries can run in Full Refresh mode or Incremental Changes mode when you specify an incremental timestamp column.

Prerequisites

  1. A Databricks workspace with a SQL warehouse.
  2. A Databricks Personal Access Token (PAT) for authentication.
  3. Permissions to connect to the SQL warehouse and query the catalogs, schemas, and tables you want to extract.
  4. Network access from Extract to your Databricks workspace. If your workspace restricts inbound traffic, allowlist the required Extract IPs or use approved private networking.

Authentication

Databricks authentication uses a Personal Access Token (PAT) provided as the access_token. For production workloads, Databricks recommends using tokens owned by a service principal rather than a personal user account.

You’ll also need:

  • Your Databricks workspace URL (for example, https://<your-workspace>.cloud.databricks.com) as workspace_url.
  • A Databricks SQL warehouse ID as warehouse_id.

The connector sends the token as a Bearer token to the Databricks SQL Statements API and validates credentials by executing a simple test query during login.

Setup Guide

In your Databricks workspace:

  1. Go to SQL Warehouses in the sidebar.
  2. Select the warehouse you want Extract to use.
  3. Click on the Connection details tab.
  4. Copy the following values:
    • Server Hostname (e.g., adb-1234567890123456.7.azuredatabricks.net)
    • HTTP Path (e.g., /sql/1.0/warehouses/abc123def456)

The Workspace URL for the connector is https://<Server Hostname>.

The Warehouse ID is the last segment of the HTTP Path (e.g., abc123def456).

Step 2 - Create a Personal Access Token

  1. In Databricks, click your username in the top right and select User Settings.
  2. Go to the Developer tab (or Access Tokens in older workspaces).
  3. Click Manage next to Access tokens, then Generate new token.
  4. Give the token a description (e.g., "Extract connector") and set an expiration.
  5. Copy the token immediately — you won't be able to see it again.

For automated production workloads, consider creating a service principal and generating a PAT for that identity.

Step 3 - Grant warehouse and data permissions

The identity associated with your PAT must have:

  • CAN USE permission on the SQL warehouse
  • USE CATALOG on the catalogs you want to read
  • USE SCHEMA on the schemas you want to read
  • SELECT on the tables you want to extract

For Unity Catalog environments, grant permissions like:

GRANT USE CATALOG ON CATALOG <catalog> TO `<user_or_principal>`;
GRANT USE SCHEMA ON SCHEMA <catalog>.<schema> TO `<user_or_principal>`;
GRANT SELECT ON TABLE <catalog>.<schema>.<table> TO `<user_or_principal>`;

Repeat for each catalog, schema, and table that Extract needs to read. You can also grant broader access at the catalog or schema level if that matches your security model.

Step 4 - Configure the connector in Extract

Fill in the connection fields:

  • Workspace URL — Your Databricks workspace URL (e.g., https://adb-1234567890123456.7.azuredatabricks.net)
  • Access Token — The Personal Access Token created in Step 2
  • Warehouse ID — The SQL warehouse ID from the HTTP Path

Optionally, configure Datasets to restrict what Extract syncs. If you leave Datasets empty, Extract will discover available tables automatically.

If you provide Datasets, each entry should include:

  • catalog
  • schema
  • table
  • (optional) primary_key
  • (optional) incremental_timestamp_field

Save the source and run a test sync.

Configuration Parameters

ParameterTypeRequiredDescription
workspace_urlstringYesYour Databricks workspace URL (e.g., https://adb-123.4.azuredatabricks.net)
access_tokenstringYesPersonal Access Token for authentication
warehouse_idstringYesSQL warehouse ID to use for query execution
datasetsarrayNoList of specific tables to extract. If empty, Extract discovers all accessible tables.

Datasets

When configuring explicit datasets, each entry defines one stream:

ParameterTypeRequiredDescription
catalogstringYesUnity Catalog catalog name
schemastringYesSchema name within the catalog
tablestringYesTable name to extract
primary_keystringNoColumn used for deduplication and diffing
incremental_timestamp_fieldstringNoColumn used for incremental syncs

Table Discovery Mode

When no datasets are configured, Extract automatically discovers tables by:

  1. Listing all accessible catalogs (excluding the system catalog)
  2. Listing all schemas in each catalog (excluding information_schema)
  3. Listing all tables in each schema (excluding temporary tables)
  4. Introspecting each table's columns and primary keys

Each discovered table becomes a stream named <catalog>__<schema>__<table>.

Extract also detects primary keys from Unity Catalog metadata (except for tables in hive_metastore, which does not support primary key constraints) and identifies candidate incremental timestamp columns based on common naming patterns such as updated_at, last_modified, modified_at, last_updated, modified_on, last_altered, _fivetran_synced, _airbyte_extracted_at, and _airbyte_emitted_at.

Extract Modes

Full Refresh

Extract runs SELECT * FROM <catalog>.<schema>.<table> and streams all rows.

If diffing is enabled and a primary key is defined, Extract orders results by the primary key and computes changes between runs.

Incremental Changes

If you set an Incremental Timestamp Field, Extract filters results by:

SELECT * FROM <catalog>.<schema>.<table> WHERE <timestamp_field> > <last_cursor_value>

On the first run (no cursor yet), Extract fetches all rows. The cursor updates to the maximum timestamp seen in each run.

Supported incremental timestamp types

The incremental timestamp field must be one of these Databricks types:

  • DATE
  • TIMESTAMP
  • TIMESTAMP_LTZ
  • TIMESTAMP_NTZ
  • TIMESTAMP WITHOUT TIME ZONE (alias for TIMESTAMP_NTZ)

For predictable incremental behavior, prefer a UTC-normalized timestamp column whose values increase monotonically as rows are inserted or updated.

Data Type Mapping

Extract maps Databricks types to schema field types:

Databricks TypeSchema Type
TINYINT, SMALLINT, INT, INTEGER, BIGINT, BYTE, SHORT, LONGinteger
FLOAT, DOUBLE, REAL, NUMERIC, DECIMAL(...)number
BOOLEANboolean
DATEstring (format: date)
TIMESTAMP, TIMESTAMP_LTZstring (format: date-time)
TIMESTAMP_NTZstring (format: date-time, no timezone)
STRING, VARCHAR, CHAR, BINARY, VOIDstring
INTERVAL, GEOGRAPHY, GEOMETRYstring
ARRAY<...>array
MAP<...>, STRUCT<...>, VARIANT, OBJECTjson

Complex nested types (arrays of structs, maps, etc.) are preserved in their JSON representation.

Data Streams

This connector produces one stream per selected Databricks table.

Streams are named using the fully-qualified table name:

  • {catalog}.{schema}.{table}

Stream selection (datasets)

You can either:

  • Let the connector discover tables (when datasets is not provided or is empty), or
  • Explicitly define streams via datasets.

Each datasets entry maps to a single stream and supports:

  • catalog (required)
  • schema (required)
  • table (required)
  • primary_key (optional): a comma-separated list of column names to use as the stream’s primary key
  • incremental_timestamp_field (optional): a column name to use for incremental extraction

If multiple datasets entries would produce the same stream name ({catalog}.{schema}.{table}), the connector will error.

System schemas excluded from discovery

When using discovery, the connector skips system catalogs/schemas:

  • system
  • information_schema

Extract metadata field

Each record includes an additional field:

  • _singular_extract_timestamp: the UTC timestamp when the record was extracted.

Additional Information

Stream naming

Stream names follow the pattern <catalog>__<schema>__<table> using double underscores as separators. This applies to both discovered tables and explicitly configured datasets.

Extract timestamp field

Extract automatically adds a _singular_extract_timestamp field to every stream, recording when each row was extracted.

Primary key detection

In table discovery mode, Extract queries information_schema.table_constraints and information_schema.key_column_usage to detect primary key columns. Tables with a single-column primary key support sorted diffing for change detection. Primary key detection is skipped for tables in the hive_metastore catalog since it does not support primary key constraints.

Diffing support

When a stream has a single-column primary key defined, Extract can compute changes between runs using sorted diffing. The results include a _diff_result field indicating whether each row was added, modified, or removed.

Warehouse availability

If your SQL warehouse is stopped, Databricks may start it automatically when Extract runs a query. This can add latency to the first sync. To avoid delays, ensure the warehouse is running or configure auto-start policies.

Result streaming

Extract uses the Databricks SQL Statement Execution API with Arrow format and external links for efficient streaming of large result sets. Results are fetched in chunks and processed incrementally.

Troubleshooting

  • Authentication failed: Verify the workspace URL and access token are correct. Ensure the token has not expired.
  • Permission denied: Ensure the Databricks identity has warehouse access (CAN USE) plus USE CATALOG, USE SCHEMA, and SELECT permissions on the referenced objects.
  • Table not found: Verify the catalog, schema, and table names are correct. Names are case-sensitive in Unity Catalog.
  • No tables discovered: Ensure the identity has permissions to list catalogs, schemas, and tables. Check that at least one accessible table exists outside of system and information_schema.
  • Incremental sync errors: Ensure the incremental timestamp field exists in the table and is a supported timestamp type (DATE, TIMESTAMP, TIMESTAMP_LTZ, or TIMESTAMP_NTZ).
  • Warehouse timeout: If syncs time out waiting for query completion, ensure the SQL warehouse is running and appropriately sized for your workload.
  • Network connectivity issues: Confirm firewall rules allow outbound access from Extract to your Databricks workspace URL.