Databricks

Overview

Extract connects to Databricks as a source by reading data from tables in your Databricks workspace. The connector uses the Databricks SQL Statement Execution API to query tables through a SQL warehouse and stream results as datasets.

You can configure the connector in two modes:

Table Discovery Mode: When no datasets are configured, Extract automatically discovers all accessible catalogs, schemas, and tables in your workspace and creates a stream for each table.
Explicit Datasets Mode: Configure specific tables to extract by providing catalog, schema, and table for each dataset. You can also optionally provide a primary_key and an incremental_timestamp_field per dataset.

Queries can run in Full Refresh mode or Incremental Changes mode when you specify an incremental timestamp column.

Prerequisites

A Databricks workspace with a SQL warehouse.
A Databricks personal access token (PAT), Databricks user OAuth app connection, or Databricks service principal OAuth secret for authentication.
Permissions to connect to the SQL warehouse and query the catalogs, schemas, and tables you want to extract.
Network access from Extract to your Databricks workspace. If your workspace restricts inbound traffic, allowlist the required Extract IPs or use approved private networking.

Authentication

Databricks authentication supports three options:

Access Token: a Databricks personal access token (PAT).
OAuth (User): a Databricks app connection that issues OAuth tokens for SQL access after a user signs in.
OAuth M2M / Service Principal: a Databricks service principal OAuth secret for machine-to-machine authentication.

Extract determines which method to use via auth_type:

Set auth_type: access_token to use a PAT (requires access_token).
Set auth_type: oauth to use OAuth (requires client_id, client_secret, and the saved OAuth token).
Set auth_type: oauth_m2m to use service principal OAuth M2M (requires m2m_client_id and m2m_client_secret).
If auth_type is not set, Extract defaults to:
- access_token when access_token is provided
- otherwise oauth

You’ll also need:

Your Databricks workspace URL (for example, https://<your-workspace>.cloud.databricks.com) as workspace_url.
A Databricks SQL warehouse ID as warehouse_id.

Extract sends the resulting token as a Bearer token to the Databricks SQL Statements API and validates credentials by executing a simple test query during login.

Setup Guide

In your Databricks workspace:

Go to SQL Warehouses in the sidebar.
Select the warehouse you want Extract to use.
Click on the Connection details tab.
Copy the following values:
- Server Hostname (e.g., adb-1234567890123456.7.azuredatabricks.net)
- HTTP Path (e.g., /sql/1.0/warehouses/abc123def456)

The Workspace URL for the connector is https://<Server Hostname>.

The Warehouse ID is the last segment of the HTTP Path (e.g., abc123def456).

Extract supports Access Token (PAT), OAuth (User), and OAuth M2M / Service Principal authentication.

If you don’t explicitly set an authentication type:

If you provide an Access Token, Extract uses Access Token authentication.
Otherwise, Extract uses OAuth authentication.

Access Token (PAT)

In Databricks, click your username in the top right and select User Settings.
Go to the Developer tab (or Access Tokens in older workspaces).
Click Manage next to Access tokens, then Generate new token.
Give the token a description (e.g., "Extract connector") and set an expiration.
Copy the token immediately — you won't be able to see it again.

For automated production workloads, consider creating a service principal and generating a PAT for that identity.

OAuth (User)

To use OAuth, create a Databricks app connection:

Go to Databricks account settings.
Open App connections.
Click Add connection.
Choose SQL for the Access Scopes.
Set the redirect URL to https://api.extract.to/oauth/callback/databricks.
Copy the client ID and client secret into Extract, then sign in with Databricks.

OAuth M2M / Service Principal

To use OAuth M2M, create an OAuth secret for a Databricks service principal:

In Databricks, go to Settings.
Open Identity and access.
Click Add service principal.
Add an existing service principal or create a new one.
Open the service principal and go to Secrets.
Click Generate secret.
Set Lifetime to 730 days, or your organization's preferred maximum.
Copy the generated Client ID and Client Secret.
In Extract, choose Authentication Type: OAuth M2M / Service Principal.
Paste the values into Service Principal Client ID and Service Principal Client Secret.
In Databricks, go to SQL Warehouses.
Select the warehouse Extract will use.
Open Permissions.
Add the service principal from the previous steps.
Grant Can use.

Also make sure the service principal has access to the Databricks workspace, SQL warehouse, and the required Unity Catalog objects.

Step 3 - Grant warehouse and data permissions

The selected identity must have:

CAN USE permission on the SQL warehouse
USE CATALOG on the catalogs you want to read
USE SCHEMA on the schemas you want to read
SELECT on the tables you want to extract

For Unity Catalog environments, grant permissions like:

In the Databricks UI:

Go to Catalog.
Select the catalog Extract should read from.
Open Permissions.
Click Grant.
Select the required scopes, such as USE CATALOG, USE SCHEMA, and SELECT.
Select the service principal from the OAuth M2M setup.
Save the grant.

Or run SQL grants:

GRANT USE CATALOG ON CATALOG <catalog> TO `<user_or_principal>`;
GRANT USE SCHEMA ON SCHEMA <catalog>.<schema> TO `<user_or_principal>`;
GRANT SELECT ON TABLE <catalog>.<schema>.<table> TO `<user_or_principal>`;

Repeat for each catalog, schema, and table that Extract needs to read. You can also grant broader access at the catalog or schema level if that matches your security model.

Step 4 - Configure the connector in Extract

Fill in the connection fields:

Workspace URL — Your Databricks workspace URL (e.g., https://adb-1234567890123456.7.azuredatabricks.net)
Authentication Type — Choose Access Token, OAuth (User), or OAuth M2M / Service Principal (optional; see defaults in Step 2)
Access Token — The PAT created in Step 2, when using Access Token authentication
OAuth Client ID, OAuth Client Secret, and Authentication — The Databricks app connection credentials and sign-in, when using OAuth (User)
Service Principal Client ID and Service Principal Client Secret — The service principal OAuth secret values, when using OAuth M2M / Service Principal
Warehouse ID — The SQL warehouse ID from the HTTP Path

Optionally, configure Datasets to restrict what Extract syncs. If you leave Datasets empty, Extract will discover available tables automatically.

If you provide Datasets, each entry should include:

catalog
schema
table
(optional) primary_key
(optional) incremental_timestamp_field

Save the source and run a test sync.

Configuration Parameters

Parameter	Type	Required	Description
`workspace_url`	string	Yes	Your Databricks workspace URL (e.g., `https://adb-123.4.azuredatabricks.net`)
`auth_type`	option	No	Choose `access_token`, `oauth`, or `oauth_m2m`. If not set, the connector uses `access_token` when `access_token` is provided; otherwise it defaults to `oauth`.
`access_token`	password	Required for `access_token`	Personal access token for authentication.
`client_id`	string	Required for `oauth`	Client ID from the Databricks app connection.
`client_secret`	password	Required for `oauth`	Client secret from the Databricks app connection.
`token`	OAuth	Required for `oauth`	Databricks OAuth sign-in token.
`m2m_client_id`	string	Required for `oauth_m2m`	Client ID from the Databricks service principal OAuth secret.
`m2m_client_secret`	password	Required for `oauth_m2m`	Client secret from the Databricks service principal OAuth secret.
`warehouse_id`	string	Yes	SQL warehouse ID to use for query execution
`datasets`	array	No	List of specific tables to extract. If empty, Extract discovers all accessible tables.

When configuring explicit datasets, each entry defines one stream:

Parameter	Type	Required	Description
`catalog`	string	Yes	Unity Catalog catalog name
`schema`	string	Yes	Schema name within the catalog
`table`	string	Yes	Table name to extract
`primary_key`	string	No	Column used for deduplication and diffing
`incremental_timestamp_field`	string	No	Column used for incremental syncs

Table Discovery Mode

When no datasets are configured, Extract automatically discovers tables by:

Listing all accessible catalogs (excluding the system catalog)
Listing all schemas in each catalog (excluding information_schema)
Listing all tables in each schema (excluding temporary tables)
Introspecting each table's columns and primary keys

Each discovered table becomes a stream named <catalog>__<schema>__<table>.

Extract also detects primary keys from Unity Catalog metadata (except for tables in hive_metastore, which does not support primary key constraints) and identifies candidate incremental timestamp columns based on common naming patterns such as updated_at, last_modified, modified_at, last_updated, modified_on, last_altered, _fivetran_synced, _airbyte_extracted_at, and _airbyte_emitted_at.

Extract Modes

Full Refresh

Extract runs SELECT * FROM <catalog>.<schema>.<table> and streams all rows.

If diffing is enabled and a primary key is defined, Extract orders results by the primary key and computes changes between runs.

Incremental Changes

If you set an Incremental Timestamp Field, Extract filters results by:

SELECT * FROM <catalog>.<schema>.<table> WHERE <timestamp_field> > <last_cursor_value>

On the first run (no cursor yet), Extract fetches all rows. The cursor updates to the maximum timestamp seen in each run.

Supported incremental timestamp types

The incremental timestamp field must be one of these Databricks types:

DATE
TIMESTAMP
TIMESTAMP_LTZ
TIMESTAMP_NTZ
TIMESTAMP WITHOUT TIME ZONE (alias for TIMESTAMP_NTZ)

For predictable incremental behavior, prefer a UTC-normalized timestamp column whose values increase monotonically as rows are inserted or updated.

Data Type Mapping

Extract maps Databricks types to schema field types:

Databricks Type	Schema Type
`TINYINT`, `SMALLINT`, `INT`, `INTEGER`, `BIGINT`, `BYTE`, `SHORT`, `LONG`	integer
`FLOAT`, `DOUBLE`, `REAL`, `NUMERIC`, `DECIMAL(...)`	number
`BOOLEAN`	boolean
`DATE`	string (format: date)
`TIMESTAMP`, `TIMESTAMP_LTZ`	string (format: date-time)
`TIMESTAMP_NTZ`	string (format: date-time, no timezone)
`STRING`, `VARCHAR`, `CHAR`, `BINARY`, `VOID`	string
`INTERVAL`, `GEOGRAPHY`, `GEOMETRY`	string
`ARRAY<...>`	array
`MAP<...>`, `STRUCT<...>`, `VARIANT`, `OBJECT`	json

Complex nested types (arrays of structs, maps, etc.) are preserved in their JSON representation.

Data Streams

This connector produces one stream per selected Databricks table.

Streams are named using the fully-qualified table name:

{catalog}.{schema}.{table}

Stream selection (datasets)

You can either:

Let the connector discover tables (when datasets is not provided or is empty), or
Explicitly define streams via datasets.

Each datasets entry maps to a single stream and supports:

catalog (required)
schema (required)
table (required)
primary_key (optional): a comma-separated list of column names to use as the stream’s primary key
incremental_timestamp_field (optional): a column name to use for incremental extraction

If multiple datasets entries would produce the same stream name ({catalog}.{schema}.{table}), the connector will error.

System schemas excluded from discovery

When using discovery, the connector skips system catalogs/schemas:

system
information_schema

Extract metadata field

Each record includes an additional field:

_singular_extract_timestamp: the UTC timestamp when the record was extracted.

Additional Information

Stream names follow the pattern <catalog>__<schema>__<table> using double underscores as separators. This applies to both discovered tables and explicitly configured datasets.

Extract timestamp field

Extract automatically adds a _singular_extract_timestamp field to every stream, recording when each row was extracted.

Primary key detection

In table discovery mode, Extract queries information_schema.table_constraints and information_schema.key_column_usage to detect primary key columns. Tables with a single-column primary key support sorted diffing for change detection. Primary key detection is skipped for tables in the hive_metastore catalog since it does not support primary key constraints.

Diffing support

When a stream has a single-column primary key defined, Extract can compute changes between runs using sorted diffing. The results include a _diff_result field indicating whether each row was added, modified, or removed.

Warehouse availability

If your SQL warehouse is stopped, Databricks may start it automatically when Extract runs a query. This can add latency to the first sync. To avoid delays, ensure the warehouse is running or configure auto-start policies.

Result streaming

Extract uses the Databricks SQL Statement Execution API with Arrow format and external links for efficient streaming of large result sets. Results are fetched in chunks and processed incrementally.

Troubleshooting

Authentication failed: Verify the workspace URL and access token are correct. Ensure the token has not expired.
Permission denied: Ensure the Databricks identity has warehouse access (CAN USE) plus USE CATALOG, USE SCHEMA, and SELECT permissions on the referenced objects.
Table not found: Verify the catalog, schema, and table names are correct. Names are case-sensitive in Unity Catalog.
No tables discovered: Ensure the identity has permissions to list catalogs, schemas, and tables. Check that at least one accessible table exists outside of system and information_schema.
Incremental sync errors: Ensure the incremental timestamp field exists in the table and is a supported timestamp type (DATE, TIMESTAMP, TIMESTAMP_LTZ, or TIMESTAMP_NTZ).
Warehouse timeout: If syncs time out waiting for query completion, ensure the SQL warehouse is running and appropriately sized for your workload.
Network connectivity issues: Confirm firewall rules allow outbound access from Extract to your Databricks workspace URL.

Overview​

Prerequisites​

Authentication​

Setup Guide​

Access Token (PAT)​

OAuth (User)​

OAuth M2M / Service Principal​

Step 3 - Grant warehouse and data permissions​

Step 4 - Configure the connector in Extract​

Configuration Parameters​

Table Discovery Mode​

Extract Modes​

Full Refresh​

Incremental Changes​

Supported incremental timestamp types​

Data Type Mapping​

Data Streams​

Stream selection (datasets)​

System schemas excluded from discovery​

Extract metadata field​

Additional Information​

Extract timestamp field​

Primary key detection​

Diffing support​

Warehouse availability​

Result streaming​

Troubleshooting​