Transferring data to Databricks

This guide explains how to transfer data to Databricks to store and further process information.

Concept

Databricks is an Active Destination. After you set Databricks as the Destination of a Datastream, data is transferred to Databricks each time data is fetched for the Datastream. For more information, see Destination types.

You can assign multiple Destinations to a Datastream. For more information on possible limitations, see Assigning multiple Destinations to a Datastream.

Prerequisites

Before you complete the procedure in this guide, perform all of the following actions:

  • Set up instance profiles for access to S3 buckets from Databricks clusters. For more information, see the Databricks documentation.

  • Obtain Access Key ID and Secret Access Key. Use an AWS policy file as you would for an AWS S3 Destination.

Procedure

To transfer data from a Datastream to Databricks, follow these steps:

  1. Add Databricks as a Destination to the Workspace which contains the Datastream or to one of its parent Workspaces.

  2. Assign the Databricks Destination to the Datastream.

  3. Configure transfer settings.

Adding Databricks as a Destination

To add Databricks as a Destination to a Workspace, follow these steps:

  1. Click the Transfer element and select the Workspace you work with in Connect, Enrich & Transfer.

  1. Click + Add.

  2. Click Databricks.

    You can connect to Databricks in two different ways:

    • To connect to Databricks with REST, click Databricks (REST).

    • To connect to Databricks with SQL, click Databricks (SQL).

  1. Click Setup a new Authorization.

  2. Click Next.

  1. In the Authorization page, fill in the following fields:

    Personal Access Token

    The personal access token generated in Databricks. For more information, see the Databricks documentation.

    Databricks Instance

    The address of the Databricks instance to which you want to connect.

    S3 Bucket

    The address of the S3 bucket that you set up for Databricks.

    Access Key ID

    The Access Key ID with which Adverity accesses the S3 bucket.

    Secret Access Key

    The Secret Access Key with which Adverity accesses the S3 bucket.

    Instance Profile ARN

    The ARN of the instance profile set up to access the S3 bucket.

    HTTP Path

    The HTTP Path for the SQL connection. For more information, see the Databricks documentation.

    You only see this field when connecting to Databricks with SQL.

  2. Click Authorize.

  3. In the Configuration page, fill in the following fields:

    Name

    (Optional) Rename the Destination.

    Database

    Specify the name of the database where to transfer the data.

    Partition by date

    (Recommended) If selected, the target table is partitioned by a date column, and data is only replaced based on the date. This means that if you import data into a table with data already present for certain dates, the existing data for these overlapping dates is overwritten, and the data for other, unique dates remains unchanged.

    This option is only effective if you also enable the option Local Data Retention > Extract Filenames > Unique by day. For more information, see Configuring advanced Datastream settings.

  1. Click Create.

Assigning Databricks as a Destination

To assign the Databricks Destination to a Datastream, follow these steps:

  1. Click the Connect element and select the Workspace you work with in Connect, Enrich & Transfer.

  1. Select the chosen Datastream.

  1. In the Destinations section, click + Add Destination.

  2. Click Assign Existing Destinations.

  1. Select the Databricks checkbox in the list.

  2. Click Save.

Configuring transfer settings

To configure transfer settings, follow these steps:

  1. Click the Connect element and select the Workspace you work with in Connect, Enrich & Transfer.

  1. Select the chosen Datastream.

  1. In the Destinations section, find the Databricks Destination in the list, and click on the right.

  2. Click Destination Settings.

  1. Fill in the following fields:

    Table name

    Specify the target table in the Destination where to transfer data from the Datastream. The name can contain alphanumeric characters and underscores. For example, target_table. To specify a schema, use the syntax schemaName.tableName.

    By default, Adverity saves data from each Datastream in a different table named {datastream_type}_{datastream_id} (for example, mailgun_83).

    You can specify the same target table for several Datastreams. If a column is shared between Datastreams, Adverity performs a full outer join and concatenates values. If a column is not shared between Datastreams, Adverity writes null values in the relevant cells.

    Table mode

    Specify how Adverity loads data to the target table (overwrite, append, or update).

  2. Click Save.