Loading data into Databricks

This guide explains how to load data into Databricks for further analysis.

Prerequisites

Before you complete the procedure in this guide, perform all of the following actions:

  • Create a datastream whose data you want to load into Databricks. For more information on creating a datastream, see Creating a datastream.

  • Set up instance profiles for access to S3 buckets from Databricks clusters. For more information, see the Databricks documentation.

  • Obtain Access Key ID and Secret Access Key. Use an AWS policy file as you would for an AWS S3 destination.

Procedure

To load data from a datastream into Databricks, follow these steps:

  1. Add Databricks as a destination to the workspace which contains the datastream or to one of its parent workspaces.

  2. Assign the Databricks destination to the datastream.

    You can assign as many destinations to a datastream as you want.

    Some destinations require specific Data Mapping, such as Hubspot and Facebook Offline Conversions. If these Data Mapping requirements conflict, the destinations cannot be assigned to the same datastream.

  3. Configure load settings.

Adding Databricks as a destination

To add Databricks as a destination to a workspace, follow these steps:

  1. Go to the Destinations page.

  2. Click + Create destination.

  3. Search for and click Databricks.

    You can connect to Databricks in two different ways:

    • To connect to Databricks with REST, click Databricks (REST).

    • To connect to Databricks with SQL, click Databricks (SQL).

  1. Choose how to authorize Adverity to access Databricks:

    • To use your details, click Access Databricks using your credentials.

    • To ask someone else to use their details, click Access Databricks using someone else's credentials.

      If you choose this option, the person you ask to create the authorization will need to go through the following steps.

  2. Click Next.

  1. In the authorization page, fill in the following fields:

    Personal Access Token

    The personal access token generated in Databricks. For more information, see the Databricks documentation.

    Databricks Instance

    The address of the Databricks instance to which you want to connect.

    S3 Bucket

    The address of the S3 bucket that you set up for Databricks.

    Access Key ID

    The Access Key ID with which Adverity accesses the S3 bucket.

    Secret Access Key

    The Secret Access Key with which Adverity accesses the S3 bucket.

    Instance Profile ARN

    The ARN of the instance profile set up to access the S3 bucket.

    HTTP Path

    The HTTP Path for the SQL connection. For more information, see the Databricks documentation.

    You only see this field when connecting to Databricks with SQL.

  2. Click Authorize.

  3. In the Configuration page, fill in the following fields:

    Name

    (Optional) Rename the destination.

    Database

    Specify the name of the database into which you want to load the data.

    Partition by date

    (Recommended) If selected, the target table is partitioned by a date column, and data is only replaced based on the date. This means that if you import data into a table with data already present for certain dates, the existing data for these overlapping dates is overwritten, and the data for other, unique dates remains unchanged.

    This option is only effective if you also enable the option Local Data Retention > Extract Filenames > Unique by day. For more information, see Configuring advanced datastream settings.

    When this option is selected, select the Partition Date Column from the drop-down list.

  1. Click Create.

Assigning Databricks as a destination

To assign the Databricks destination to a datastream, follow these steps:

  1. Go to the Datastreams page.

  2. Open the chosen datastream by clicking on its name.

  1. In the Load section, click + Add destination.

  1. Select the Databricks checkbox in the list.

  2. Click Save.

  3. For the automatically enabled destinations, in the pop-up window, click Yes, load data if you want to automatically load your previously collected data into the new destination. The following data extracts will be loaded:

    • All data extracts with the status collected if no other destinations are enabled for the datastream

    • All data extracts with the status loaded if the data extracts have already been sent to Adverity Data Storage or external destinations

    Alternatively, click Skip to continue configuring the destination settings or re-load the data extracts manually. For more information, see Re-loading a data extract.

Configuring settings for loading data into Databricks

To configure the settings for loading data into Databricks, follow these steps:

  1. Go to the Datastreams page.

  2. Open the chosen datastream by clicking on its name.

  1. In the Load section, find the Databricks destination in the list, and click Actions on the right.

  2. Click Destination settings.

  1. Fill in the following fields:

    Table name

    Specify the target table in the destination into which to load data from the datastream. The name can contain alphanumeric characters and underscores. For example, target_table. To specify a schema, use the syntax schemaName.tableName.

    By default, Adverity saves data from each datastream in a different table named {datastream_type}_{datastream_id} (for example, mailgun_83).

    You can specify the same target table for several datastreams. If a column is shared between datastreams, Adverity performs a full outer join and concatenates values. If a column is not shared between datastreams, Adverity writes null values in the relevant cells.

    • To create a new Databricks spreadsheet containing the data you load into Databricks, enter a name for the new spreadsheet into this field.

    Table mode

    Specify how Adverity loads data to the target table (overwrite, append, or update).

  2. Click Save.