Skip to main content

Azure Blob Storage

info

Destination Documentation: Azure Blob Storage Documentation

High-Level Information:

Azure Blob Storage is Microsoft's object storage solution for the cloud, optimized for storing massive amounts of unstructured data. Extract integrates with Azure Blob Storage using account key or SAS token authentication, enabling direct data loading to blob containers. The connector supports multiple file formats including JSONL, CSV, and Parquet, with flexible path templating for organized data storage. Data is written using Azure's append blob mechanism with efficient chunk-based uploads. The integration is ideal for data lake architectures, archival storage, and as a staging area for other Azure services.

Prerequisites

  1. An Azure account with an active subscription
  2. An Azure Storage account with Blob Storage enabled
  3. Either an Account Key or SAS token for authentication

Setup Guide

Step 1 - Create an Azure Storage Account (skip if you already have one)

  1. Sign in to the Azure Portal

  2. Click Create a resource and search for Storage account

  3. Configure your storage account:

    • Resource group: Select existing or create new
    • Storage account name: Must be globally unique, 3-24 characters
    • Region: Select a region close to your data sources/consumers
    • Performance: Standard (recommended for most workloads)
    • Redundancy: Choose based on your durability requirements (LRS, ZRS, GRS, etc.)
  4. In the Advanced tab:

    • Ensure Blob storage is enabled
    • Configure security settings as per your requirements
  5. Review and create the storage account

Step 2 - Create a Blob Container

  1. Navigate to your Storage account in the Azure Portal

  2. In the left menu, under Data storage, click Containers

  3. Click + Container to create a new container

  4. Configure the container:

    • Name: Choose a descriptive name (e.g., extract-data)
    • Public access level: Set to Private (no anonymous access) for security
  5. Click Create

Step 3 - Choose Your Authentication Method

Extract supports both authentication methods for Azure Blob Storage:

Option A: Account Key

  • Full access to the entire storage account
  • Simpler setup process
  • Permanent access (until key is regenerated)

Option B: SAS Token (Recommended)

  • More secure with fine-grained access control
  • Time-limited access with expiry dates
  • Specific permissions per service and resource type
  • Recommended for production environments

Step 4A - Get Your Account Key (if using Account Key authentication)

  1. In your Storage account, navigate to Access keys under Security + networking

  2. Click Show keys to reveal the account keys

  3. Copy either key1 or key2 (both work identically)

  4. Construct your connection string:

DefaultEndpointsProtocol=https;AccountName=YOUR_STORAGE_ACCOUNT_NAME;AccountKey=YOUR_ACCOUNT_KEY;EndpointSuffix=core.windows.net

Step 4B - Generate a SAS Token (if using SAS authentication)

  1. In your Storage account, navigate to Shared access signature under Security + networking

  2. Configure the SAS parameters:

    • Allowed services: Check all checkboxes (Blob, File, Queue, Table)
    • Allowed resource types: Check all checkboxes (Service, Container, Object) - these are not checked by default
    • Allowed permissions: Check all checkboxes (Read, Write, Delete, List, Add, Create, Update, Process, Immutable storage, Permanent delete, Filter)
    • Start and expiry date/time: Set the expiry date to a few years in the future for long-term access
    • Allowed IP addresses: (Optional) Add Extract's IPs if restricting access
  3. Click Generate SAS and connection string

  4. Copy the Connection string that includes the SAS token

Step 5 - Configure Network Access (if required)

If your storage account has network restrictions:

  1. Navigate to Networking under Security + networking

  2. Under Firewall and virtual networks, add Extract's server IPs to the allowed list:

    • 3.134.124.160
    • 3.150.64.207
    • 44.232.26.19
    • 54.214.149.234
  3. Alternatively, configure private endpoints if using Azure Private Link

Step 6 - Configure the Connector in Extract

Navigate to the Destinations tab in Extract and add a new Azure Blob Storage destination with the following parameters:

  • Connection String - The full connection string from Step 4A or 4B
  • Container Name - The name of the blob container you created in Step 2
  • File Format - Choose between JSONL, CSV, or Parquet
  • Blob Path Template - Path template for organizing blobs (see Path Templating section below)

Hit Save and verify the connection is successful.

Configuration Parameters

  • Connection String - Your Azure Storage connection string containing authentication credentials. This can include either an Account Key or SAS token.

  • Container Name - The Azure Blob Storage container where data will be written. This container must exist before data loading begins.

  • File Format - The format for data files:

    • JSONL - Newline-delimited JSON, ideal for semi-structured data
    • CSV - Comma-separated values, best for tabular data and Excel compatibility
    • Parquet - Columnar format, optimal for analytics workloads and compression
  • Compress - Optional. When set to true, output files are gzip-compressed and the destination appends .gz to the generated filename (for example, *.jsonl.gz). Defaults to false.

  • Blob Path Template - Template for organizing blobs within your container. See Path Templating section for available variables.

Path Templating

The blob path template allows you to organize your data using dynamic variables. This helps create a logical folder structure within your container.

Available Template Variables

Stream Information:

  • {stream_name} - Name of the source stream
  • {connection_id} - Unique identifier for the connection
  • {connection_name} - Human-readable connection name

Time-based Variables:

  • {timestamp} - Unix timestamp of the sync
  • {date_time} - ISO 8601 formatted date-time (e.g., 2024-01-15T14:30:00Z)
  • {year} - 4-digit year
  • {month} - 2-digit month (01-12)
  • {day} - 2-digit day (01-31)
  • {hour} - 2-digit hour (00-23)
  • {minute} - 2-digit minute (00-59)

Cursor Variables (for incremental syncs):

  • {incremental_key} - The incremental key value
  • {cursor_value} - Current cursor position
  • {cursor_datetime} - Cursor value as datetime
  • {cursor_date} - Date portion of cursor
  • {cursor_year} - Year from cursor
  • {cursor_month} - Month from cursor
  • {cursor_day} - Day from cursor

Path Template Examples

Daily partitioning by stream:

{stream_name}/{year}/{month}/{day}/data_{timestamp}

Result: customers/2024/01/15/data_1705330200

Hourly partitioning with connection info:

{connection_name}/{stream_name}/{year}-{month}-{day}/{hour}/extract_{incremental_key}

Result: production_sync/orders/2024-01-15/14/extract_1000

Simple stream-based organization:

raw/{stream_name}/{date_time}

Result: raw/products/2024-01-15T14:30:00Z

Notes

  • The destination can optionally gzip-compress output files. Enable this by setting the compress configuration parameter to true. When enabled, the connector writes the same file format content (for example, JSONL or CSV) and appends a .gz suffix to the uploaded blob name.