Loading data into Databricks#
This guide explains how to load data into Databricks for further analysis.
Prerequisites#
Before you complete the procedure in this guide, perform all of the following actions:
Create a datastream whose data you want to load into Databricks. For more information on creating a datastream, see Collecting data in Adverity.
Set up instance profiles for access to S3 buckets from Databricks clusters. For more information, see the Databricks documentation.
Obtain Access Key ID and Secret Access Key. Use an AWS policy file as you would for an AWS S3 destination.
Procedure#
To load data from a datastream into Databricks, follow these steps:
Add Databricks as a destination to the workspace which contains the datastream or to one of its parent workspaces.
Assign the Databricks destination to the datastream.
You can assign as many destinations to a datastream as you want.
Some destinations require specific Data Mapping, such as Hubspot and Facebook Offline Conversions. If these Data Mapping requirements conflict, the destinations cannot be assigned to the same datastream.
Adding Databricks as a destination#
To add Databricks as a destination to a workspace, follow these steps:
Go to the Destinations page.
Click + Create destination.
Search for and click Databricks.
You can connect to Databricks in two different ways:
To connect to Databricks with REST, click Databricks (REST).
To connect to Databricks with SQL, click Databricks (SQL).
Choose how to authorize Adverity to access Databricks:
To use your details, click Access Databricks using your credentials.
To ask someone else to use their details, click Access Databricks using someone else’s credentials.
If you choose this option, the person you ask to create the authorization will need to go through the following steps.
Click Next.
In the authorization page, fill in the following fields:
- Personal Access Token
The personal access token generated in Databricks. For more information, see the Databricks documentation.
- Databricks Instance
The address of the Databricks instance to which you want to connect.
- S3 Bucket
The address of the S3 bucket that you set up for Databricks.
- Access Key ID
The Access Key ID with which Adverity accesses the S3 bucket.
- Secret Access Key
The Secret Access Key with which Adverity accesses the S3 bucket.
- Instance Profile ARN
The ARN of the instance profile set up to access the S3 bucket.
- HTTP Path
The HTTP Path for the SQL connection. For more information, see the Databricks documentation.
You only see this field when connecting to Databricks with SQL.
Click Authorize.
In the Configuration page, fill in the following fields:
- Name
(Optional) Rename the destination.
- Database
Specify the name of the database into which you want to load the data.
- Partition by date
(Recommended) If selected, the target table is partitioned by a date column, and data is only replaced based on the date. This means that if you import data into a table with data already present for certain dates, the existing data for these overlapping dates is overwritten, and the data for other, unique dates remains unchanged.
This option is only effective if you also enable the option Local Data Retention > Extract Filenames > Unique by day. For more information, see Configuring advanced datastream settings.
Partitioning is set the first time you load data into the destination and cannot be changed through Adverity later.
When this option is selected, select the Partition Date Column from the drop-down list.
Click Create.
Assigning Databricks as a destination#
To assign the Databricks destination to a datastream, follow these steps:
Go to the Datastreams page.
In the Load section, click + Add destination.
Select the Databricks checkbox in the list.
Click Save.
For the automatically enabled destinations, in the pop-up window, click Yes, load data if you want to automatically load your previously collected data into the new destination. The following data extracts will be loaded:
All data extracts with the status collected if no other destinations are enabled for the datastream
All data extracts with the status loaded if the data extracts have already been sent to Adverity Data Storage or external destinations
Alternatively, click Skip to continue configuring the destination settings or re-load the data extracts manually. For more information, see Re-loading a data extract.
Configuring settings for loading data into Databricks#
To configure the settings for loading data into Databricks, follow these steps:
Go to the Datastreams page.
In the Load section, find the Databricks destination in the list, and click Actions on the right.
Fill in the following fields:
- Table name
Specify the target table in the destination into which to load data from the datastream. The name can contain alphanumeric characters and underscores. For example,
target_table
. To specify a schema, use the syntaxschemaName.tableName
.By default, Adverity saves data from each datastream in a different table named
{datastream_type}_{datastream_id}
(for example,mailgun_83
).You can specify the same target table for several datastreams. If a column is shared between datastreams, Adverity performs a full outer join and concatenates values. If a column is not shared between datastreams, Adverity writes null values in the relevant cells.
To create a new Databricks spreadsheet containing the data you load into Databricks, enter a name for the new spreadsheet into this field.
- Table mode
Specify how Adverity loads data to the target table (overwrite, append, or update).
Click Save.