ClickHouse
The ClickHouse connector can extract data from and load data into a ClickHouse database over HTTP(S).
- As a destination, it creates tables as needed and supports different extraction modes that determine how tables are created and how rows are written.
- As a source, it extracts data from one or more configured SQL datasets and supports full refresh and incremental extraction.
Prerequisites
- A reachable ClickHouse server (self-hosted or managed).
- Network access from the platform to your ClickHouse HTTP(S) endpoint.
- A ClickHouse user with permissions appropriate to your use case:
- Destination: create databases/tables (if you want the connector to create them) and insert/select.
- Source: select permissions on the referenced database(s)/table(s).
Authentication
The connector authenticates with:
- Host and Port for the ClickHouse HTTP interface
- Username and Password (optional; defaults to empty)
- Optional TLS (HTTPS)
- Optional Custom root certificate (PEM) for private CAs / self-signed certs
- Optional SSH tunnel (to reach a private ClickHouse endpoint)
Setup
- Identify your ClickHouse HTTP endpoint:
- Default HTTP port is commonly
8123(varies by deployment).
- Default HTTP port is commonly
- Ensure the user has appropriate permissions on the target database (destination) or source objects (source).
- If using TLS with a private CA, obtain the CA certificate in PEM format.
- If your ClickHouse instance is only reachable via a bastion/private network, configure an SSH tunnel.
Configuration
Common connection fields (source and destination)
| Field | Type | Required | Description |
|---|---|---|---|
host | string | ✅ | ClickHouse host name or IP address. |
port | number | ✅ | ClickHouse HTTP(S) port. |
tls | boolean | ✅ | If true, the connector uses https://. If false, it uses http://. |
username | string | ❌ | ClickHouse username. Defaults to empty. |
password | string | ❌ | ClickHouse password. Defaults to empty. |
database | string | ❌ | Database name. For destination, used as the target database to create/use tables in (if set). For source, used as the default database context (if set). |
custom_root_cert | string (PEM) | ❌ | PEM-encoded root certificate to trust when connecting via TLS (useful for private CAs). |
SSH tunnel (optional)
When enabled, the connector opens an SSH tunnel to the ClickHouse host/port via an SSH server (bastion).
| Field | Type | Required | Description |
|---|---|---|---|
use_ssh_tunnel | boolean | ✅ | If true, connect to ClickHouse through an SSH tunnel. |
ssh_host | string | ✅* | SSH server host. Required when use_ssh_tunnel=true. |
ssh_port | number | ✅* | SSH server port (typically 22). Required when use_ssh_tunnel=true. |
ssh_user | string | ✅* | SSH username. Required when use_ssh_tunnel=true. |
ssh_key.private_key | string | ✅* | Private key contents used for SSH authentication. Required when use_ssh_tunnel=true. |
* Required only when use_ssh_tunnel=true.
Destination-only fields
The destination writes records by first staging them to S3 and then loading into ClickHouse.
| Field | Type | Required | Description |
|---|---|---|---|
s3_bucket | string | ✅ | S3 bucket used for staging files before loading into ClickHouse. |
access_key_id | string | ✅ | AWS access key ID for the staging bucket. |
secret_access_key | string | ✅ | AWS secret access key for the staging bucket. |
table_prefix | string | ❌ | Prefix applied to all created destination tables. Defaults to empty. |
Source-only fields
The source extracts from one or more configured datasets. Each dataset is defined by a name and a SQL query.
| Field | Type | Required | Description |
|---|---|---|---|
datasets | array | ✅ | List of datasets to extract. Each dataset becomes a stream. |
Each dataset supports:
| Field | Type | Required | Description |
|---|---|---|---|
name | string | ✅ | Dataset name. Used as the stream name. |
sql | string | ✅ | SQL query to execute for this dataset. |
primary_key | string | ❌ | Comma-separated list of primary key column(s) (for example: "id" or "account_id, id"). Required for incremental extraction. |
incremental_timestamp_field | string | ❌ | Timestamp column used as the incremental cursor. Required for incremental extraction. |
Destination behavior
Table creation and schema behavior
The connector creates destination tables with ClickHouse MergeTree-family engines depending on the extraction mode.
Nullability rules
When creating tables, columns are Nullable by default. However, some columns must be non-nullable because ClickHouse requires certain expressions/keys to be non-nullable:
- Columns used in
PARTITION BY - Columns used in
ORDER BY - The versioning column used by
ReplacingMergeTree(...)(when applicable)
The connector automatically marks the following columns as non-nullable:
- Partition extraction: all partition keys
- Incremental changes extraction:
- all primary key columns, and
- the configured
timestamp_field(if it is not already part of the primary key)
- Full refresh: no special non-nullable columns are enforced (all columns default to Nullable)
Engine selection by extraction mode
-
Full refresh
- Engine:
MergeTree - Ordering:
ORDER BY tuple()
- Engine:
-
Partition
- Engine:
MergeTree - Partitioning:
PARTITION BY (...)- If a partition key is typed/formatted as
dateordate-time, the connector partitions bytoYYYYMMDD(<key>).
- If a partition key is typed/formatted as
- Ordering:
ORDER BY (<partition keys>)
- Engine:
-
Incremental changes
- Requires a primary key to be defined for the stream.
- Engine (Upsert load mode):
ReplacingMergeTree(<timestamp_field>) - Ordering:
ORDER BY (<primary key columns>)
Load modes
-
Upsert
- Supported for Incremental changes extraction.
- Implemented using
ReplacingMergeTree(timestamp_field)withORDER BY (primary_key...).
-
History
- Not supported for ClickHouse in this connector.
Source behavior
Dataset-based extraction
Each configured dataset is extracted by executing its sql query. The result set is emitted as records for the dataset’s stream.
Incremental extraction
For incremental extraction, configure both:
primary_key(one or more columns), andincremental_timestamp_field(a timestamp column)
The connector uses the timestamp field as the cursor to fetch new/updated rows over time. Ensure your dataset SQL includes the primary key and timestamp field in the selected columns.
Streams
Destination streams
Streams correspond to the entities/tables you configure in your pipeline. For each stream, the connector:
- Creates a table if it does not exist.
- Applies the engine and key strategy based on the stream’s extraction mode and primary key configuration.
- Loads records into the destination table.
Primary key requirements (destination)
- Incremental changes extraction requires a primary key. If a stream is configured for incremental changes without a primary key, the connector will fail the run.
Partition key typing (destination)
For partition extraction, partition keys that are typed/formatted as date or date-time are partitioned using toYYYYMMDD(key) to produce stable daily partitions. Other types are used as-is.
Source streams
Each entry in datasets becomes a stream:
- Stream name:
datasets[i].name - Stream schema: derived from the dataset query result columns
- Incremental behavior: enabled when