Skip to main content

ClickHouse

The ClickHouse connector can extract data from and load data into a ClickHouse database over HTTP(S).

  • As a destination, it creates tables as needed and supports different extraction modes that determine how tables are created and how rows are written.
  • As a source, it extracts data from one or more configured SQL datasets and supports full refresh and incremental extraction.

Prerequisites

  • A reachable ClickHouse server (self-hosted or managed).
  • Network access from the platform to your ClickHouse HTTP(S) endpoint.
  • A ClickHouse user with permissions appropriate to your use case:
    • Destination: create databases/tables (if you want the connector to create them) and insert/select.
    • Source: select permissions on the referenced database(s)/table(s).

Authentication

The connector authenticates with:

  • Host and Port for the ClickHouse HTTP interface
  • Username and Password (optional; defaults to empty)
  • Optional TLS (HTTPS)
  • Optional Custom root certificate (PEM) for private CAs / self-signed certs
  • Optional SSH tunnel (to reach a private ClickHouse endpoint)

Setup

  1. Identify your ClickHouse HTTP endpoint:
    • Default HTTP port is commonly 8123 (varies by deployment).
  2. Ensure the user has appropriate permissions on the target database (destination) or source objects (source).
  3. If using TLS with a private CA, obtain the CA certificate in PEM format.
  4. If your ClickHouse instance is only reachable via a bastion/private network, configure an SSH tunnel.

Configuration

Common connection fields (source and destination)

FieldTypeRequiredDescription
hoststringClickHouse host name or IP address.
portnumberClickHouse HTTP(S) port.
tlsbooleanIf true, the connector uses https://. If false, it uses http://.
usernamestringClickHouse username. Defaults to empty.
passwordstringClickHouse password. Defaults to empty.
databasestringDatabase name. For destination, used as the target database to create/use tables in (if set). For source, used as the default database context (if set).
custom_root_certstring (PEM)PEM-encoded root certificate to trust when connecting via TLS (useful for private CAs).

SSH tunnel (optional)

When enabled, the connector opens an SSH tunnel to the ClickHouse host/port via an SSH server (bastion).

FieldTypeRequiredDescription
use_ssh_tunnelbooleanIf true, connect to ClickHouse through an SSH tunnel.
ssh_hoststring✅*SSH server host. Required when use_ssh_tunnel=true.
ssh_portnumber✅*SSH server port (typically 22). Required when use_ssh_tunnel=true.
ssh_userstring✅*SSH username. Required when use_ssh_tunnel=true.
ssh_key.private_keystring✅*Private key contents used for SSH authentication. Required when use_ssh_tunnel=true.

* Required only when use_ssh_tunnel=true.

Destination-only fields

The destination writes records by first staging them to S3 and then loading into ClickHouse.

FieldTypeRequiredDescription
s3_bucketstringS3 bucket used for staging files before loading into ClickHouse.
access_key_idstringAWS access key ID for the staging bucket.
secret_access_keystringAWS secret access key for the staging bucket.
table_prefixstringPrefix applied to all created destination tables. Defaults to empty.

Source-only fields

The source extracts from one or more configured datasets. Each dataset is defined by a name and a SQL query.

FieldTypeRequiredDescription
datasetsarrayList of datasets to extract. Each dataset becomes a stream.

Each dataset supports:

FieldTypeRequiredDescription
namestringDataset name. Used as the stream name.
sqlstringSQL query to execute for this dataset.
primary_keystringComma-separated list of primary key column(s) (for example: "id" or "account_id, id"). Required for incremental extraction.
incremental_timestamp_fieldstringTimestamp column used as the incremental cursor. Required for incremental extraction.

Destination behavior

Table creation and schema behavior

The connector creates destination tables with ClickHouse MergeTree-family engines depending on the extraction mode.

Nullability rules

When creating tables, columns are Nullable by default. However, some columns must be non-nullable because ClickHouse requires certain expressions/keys to be non-nullable:

  • Columns used in PARTITION BY
  • Columns used in ORDER BY
  • The versioning column used by ReplacingMergeTree(...) (when applicable)

The connector automatically marks the following columns as non-nullable:

  • Partition extraction: all partition keys
  • Incremental changes extraction:
    • all primary key columns, and
    • the configured timestamp_field (if it is not already part of the primary key)
  • Full refresh: no special non-nullable columns are enforced (all columns default to Nullable)

Engine selection by extraction mode

  • Full refresh

    • Engine: MergeTree
    • Ordering: ORDER BY tuple()
  • Partition

    • Engine: MergeTree
    • Partitioning: PARTITION BY (...)
      • If a partition key is typed/formatted as date or date-time, the connector partitions by toYYYYMMDD(<key>).
    • Ordering: ORDER BY (<partition keys>)
  • Incremental changes

    • Requires a primary key to be defined for the stream.
    • Engine (Upsert load mode): ReplacingMergeTree(<timestamp_field>)
    • Ordering: ORDER BY (<primary key columns>)

Load modes

  • Upsert

    • Supported for Incremental changes extraction.
    • Implemented using ReplacingMergeTree(timestamp_field) with ORDER BY (primary_key...).
  • History

    • Not supported for ClickHouse in this connector.

Source behavior

Dataset-based extraction

Each configured dataset is extracted by executing its sql query. The result set is emitted as records for the dataset’s stream.

Incremental extraction

For incremental extraction, configure both:

  • primary_key (one or more columns), and
  • incremental_timestamp_field (a timestamp column)

The connector uses the timestamp field as the cursor to fetch new/updated rows over time. Ensure your dataset SQL includes the primary key and timestamp field in the selected columns.

Streams

Destination streams

Streams correspond to the entities/tables you configure in your pipeline. For each stream, the connector:

  • Creates a table if it does not exist.
  • Applies the engine and key strategy based on the stream’s extraction mode and primary key configuration.
  • Loads records into the destination table.

Primary key requirements (destination)

  • Incremental changes extraction requires a primary key. If a stream is configured for incremental changes without a primary key, the connector will fail the run.

Partition key typing (destination)

For partition extraction, partition keys that are typed/formatted as date or date-time are partitioned using toYYYYMMDD(key) to produce stable daily partitions. Other types are used as-is.

Source streams

Each entry in datasets becomes a stream:

  • Stream name: datasets[i].name
  • Stream schema: derived from the dataset query result columns
  • Incremental behavior: enabled when