Databricks
Overview
Extract connects to Databricks as a source by reading data from tables in your Databricks workspace. The connector uses the Databricks SQL Statement Execution API to query tables through a SQL warehouse and stream results as datasets.
You can configure the connector in two modes:
- Table Discovery Mode: When no datasets are configured, Extract automatically discovers all accessible catalogs, schemas, and tables in your workspace and creates a stream for each table.
- Explicit Datasets Mode: Configure specific tables to extract by providing
catalog,schema, andtablefor each dataset. You can also optionally provide aprimary_keyand anincremental_timestamp_fieldper dataset.
Queries can run in Full Refresh mode or Incremental Changes mode when you specify an incremental timestamp column.
Prerequisites
- A Databricks workspace with a SQL warehouse.
- A Databricks Personal Access Token (PAT) for authentication.
- Permissions to connect to the SQL warehouse and query the catalogs, schemas, and tables you want to extract.
- Network access from Extract to your Databricks workspace. If your workspace restricts inbound traffic, allowlist the required Extract IPs or use approved private networking.
Authentication
Databricks authentication uses a Personal Access Token (PAT) provided as the access_token. For production workloads, Databricks recommends using tokens owned by a service principal rather than a personal user account.
You’ll also need:
- Your Databricks workspace URL (for example,
https://<your-workspace>.cloud.databricks.com) asworkspace_url. - A Databricks SQL warehouse ID as
warehouse_id.
The connector sends the token as a Bearer token to the Databricks SQL Statements API and validates credentials by executing a simple test query during login.
Setup Guide
In your Databricks workspace:
- Go to SQL Warehouses in the sidebar.
- Select the warehouse you want Extract to use.
- Click on the Connection details tab.
- Copy the following values:
- Server Hostname (e.g.,
adb-1234567890123456.7.azuredatabricks.net) - HTTP Path (e.g.,
/sql/1.0/warehouses/abc123def456)
- Server Hostname (e.g.,
The Workspace URL for the connector is https://<Server Hostname>.
The Warehouse ID is the last segment of the HTTP Path (e.g., abc123def456).
Step 2 - Create a Personal Access Token
- In Databricks, click your username in the top right and select User Settings.
- Go to the Developer tab (or Access Tokens in older workspaces).
- Click Manage next to Access tokens, then Generate new token.
- Give the token a description (e.g., "Extract connector") and set an expiration.
- Copy the token immediately — you won't be able to see it again.
For automated production workloads, consider creating a service principal and generating a PAT for that identity.
Step 3 - Grant warehouse and data permissions
The identity associated with your PAT must have:
- CAN USE permission on the SQL warehouse
- USE CATALOG on the catalogs you want to read
- USE SCHEMA on the schemas you want to read
- SELECT on the tables you want to extract
For Unity Catalog environments, grant permissions like:
GRANT USE CATALOG ON CATALOG <catalog> TO `<user_or_principal>`;
GRANT USE SCHEMA ON SCHEMA <catalog>.<schema> TO `<user_or_principal>`;
GRANT SELECT ON TABLE <catalog>.<schema>.<table> TO `<user_or_principal>`;
Repeat for each catalog, schema, and table that Extract needs to read. You can also grant broader access at the catalog or schema level if that matches your security model.
Step 4 - Configure the connector in Extract
Fill in the connection fields:
- Workspace URL — Your Databricks workspace URL (e.g.,
https://adb-1234567890123456.7.azuredatabricks.net) - Access Token — The Personal Access Token created in Step 2
- Warehouse ID — The SQL warehouse ID from the HTTP Path
Optionally, configure Datasets to restrict what Extract syncs. If you leave Datasets empty, Extract will discover available tables automatically.
If you provide Datasets, each entry should include:
catalogschematable- (optional)
primary_key - (optional)
incremental_timestamp_field
Save the source and run a test sync.
Configuration Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
workspace_url | string | Yes | Your Databricks workspace URL (e.g., https://adb-123.4.azuredatabricks.net) |
access_token | string | Yes | Personal Access Token for authentication |
warehouse_id | string | Yes | SQL warehouse ID to use for query execution |
datasets | array | No | List of specific tables to extract. If empty, Extract discovers all accessible tables. |
Datasets
When configuring explicit datasets, each entry defines one stream:
| Parameter | Type | Required | Description |
|---|---|---|---|
catalog | string | Yes | Unity Catalog catalog name |
schema | string | Yes | Schema name within the catalog |
table | string | Yes | Table name to extract |
primary_key | string | No | Column used for deduplication and diffing |
incremental_timestamp_field | string | No | Column used for incremental syncs |
Table Discovery Mode
When no datasets are configured, Extract automatically discovers tables by:
- Listing all accessible catalogs (excluding the
systemcatalog) - Listing all schemas in each catalog (excluding
information_schema) - Listing all tables in each schema (excluding temporary tables)
- Introspecting each table's columns and primary keys
Each discovered table becomes a stream named <catalog>__<schema>__<table>.
Extract also detects primary keys from Unity Catalog metadata (except for tables in hive_metastore, which does not support primary key constraints) and identifies candidate incremental timestamp columns based on common naming patterns such as updated_at, last_modified, modified_at, last_updated, modified_on, last_altered, _fivetran_synced, _airbyte_extracted_at, and _airbyte_emitted_at.
Extract Modes
Full Refresh
Extract runs SELECT * FROM <catalog>.<schema>.<table> and streams all rows.
If diffing is enabled and a primary key is defined, Extract orders results by the primary key and computes changes between runs.
Incremental Changes
If you set an Incremental Timestamp Field, Extract filters results by:
SELECT * FROM <catalog>.<schema>.<table> WHERE <timestamp_field> > <last_cursor_value>
On the first run (no cursor yet), Extract fetches all rows. The cursor updates to the maximum timestamp seen in each run.
Supported incremental timestamp types
The incremental timestamp field must be one of these Databricks types:
DATETIMESTAMPTIMESTAMP_LTZTIMESTAMP_NTZTIMESTAMP WITHOUT TIME ZONE(alias forTIMESTAMP_NTZ)
For predictable incremental behavior, prefer a UTC-normalized timestamp column whose values increase monotonically as rows are inserted or updated.
Data Type Mapping
Extract maps Databricks types to schema field types:
| Databricks Type | Schema Type |
|---|---|
TINYINT, SMALLINT, INT, INTEGER, BIGINT, BYTE, SHORT, LONG | integer |
FLOAT, DOUBLE, REAL, NUMERIC, DECIMAL(...) | number |
BOOLEAN | boolean |
DATE | string (format: date) |
TIMESTAMP, TIMESTAMP_LTZ | string (format: date-time) |
TIMESTAMP_NTZ | string (format: date-time, no timezone) |
STRING, VARCHAR, CHAR, BINARY, VOID | string |
INTERVAL, GEOGRAPHY, GEOMETRY | string |
ARRAY<...> | array |
MAP<...>, STRUCT<...>, VARIANT, OBJECT | json |
Complex nested types (arrays of structs, maps, etc.) are preserved in their JSON representation.
Data Streams
This connector produces one stream per selected Databricks table.
Streams are named using the fully-qualified table name:
{catalog}.{schema}.{table}
Stream selection (datasets)
You can either:
- Let the connector discover tables (when
datasetsis not provided or is empty), or - Explicitly define streams via
datasets.
Each datasets entry maps to a single stream and supports:
catalog(required)schema(required)table(required)primary_key(optional): a comma-separated list of column names to use as the stream’s primary keyincremental_timestamp_field(optional): a column name to use for incremental extraction
If multiple datasets entries would produce the same stream name ({catalog}.{schema}.{table}), the connector will error.
System schemas excluded from discovery
When using discovery, the connector skips system catalogs/schemas:
systeminformation_schema
Extract metadata field
Each record includes an additional field:
_singular_extract_timestamp: the UTC timestamp when the record was extracted.
Additional Information
Stream naming
Stream names follow the pattern <catalog>__<schema>__<table> using double underscores as separators. This applies to both discovered tables and explicitly configured datasets.
Extract timestamp field
Extract automatically adds a _singular_extract_timestamp field to every stream, recording when each row was extracted.
Primary key detection
In table discovery mode, Extract queries information_schema.table_constraints and information_schema.key_column_usage to detect primary key columns. Tables with a single-column primary key support sorted diffing for change detection. Primary key detection is skipped for tables in the hive_metastore catalog since it does not support primary key constraints.
Diffing support
When a stream has a single-column primary key defined, Extract can compute changes between runs using sorted diffing. The results include a _diff_result field indicating whether each row was added, modified, or removed.
Warehouse availability
If your SQL warehouse is stopped, Databricks may start it automatically when Extract runs a query. This can add latency to the first sync. To avoid delays, ensure the warehouse is running or configure auto-start policies.
Result streaming
Extract uses the Databricks SQL Statement Execution API with Arrow format and external links for efficient streaming of large result sets. Results are fetched in chunks and processed incrementally.
Troubleshooting
- Authentication failed: Verify the workspace URL and access token are correct. Ensure the token has not expired.
- Permission denied: Ensure the Databricks identity has warehouse access (
CAN USE) plusUSE CATALOG,USE SCHEMA, andSELECTpermissions on the referenced objects. - Table not found: Verify the catalog, schema, and table names are correct. Names are case-sensitive in Unity Catalog.
- No tables discovered: Ensure the identity has permissions to list catalogs, schemas, and tables. Check that at least one accessible table exists outside of
systemandinformation_schema. - Incremental sync errors: Ensure the incremental timestamp field exists in the table and is a supported timestamp type (
DATE,TIMESTAMP,TIMESTAMP_LTZ, orTIMESTAMP_NTZ). - Warehouse timeout: If syncs time out waiting for query completion, ensure the SQL warehouse is running and appropriately sized for your workload.
- Network connectivity issues: Confirm firewall rules allow outbound access from Extract to your Databricks workspace URL.