Loading data into Databricks

Loading data into Databricks#

This guide explains how to load data into Databricks for further analysis, covering both AWS and Azure implementations.

Prerequisites#

Before you complete the procedure in this guide, perform all of the following actions:

Create a datastream whose data you want to load into Databricks. For more information on creating a datastream, see Collecting data in Adverity.
Apply the correct Data Mapping to your datastream.

AWS

Set up instance profiles for access to S3 buckets from Databricks clusters. For more information, see the Databricks documentation.
Obtain Access Key ID and Secret Access Key. Use an AWS policy file as you would for an AWS S3 destination.

Azure

Configure Azure storage authorization (File Azure or File Azure SAS)
Obtain Service Principal credentials (Client ID/Secret/Tenant ID)
Enable Unity Catalog in target workspace
Assign Storage Blob Data Contributor role to Service Principal
Generate Databricks personal access token with workspace admin privileges

Procedure#

To load data from a datastream into Databricks, follow these steps:

Add Databricks as a destination to the workspace which contains the datastream or to one of its parent workspaces.
Assign the Databricks destination to the datastream.

You can assign as many destinations to a datastream as you want.

Some destinations require specific Data Mapping. If these Data Mapping requirements conflict, the destinations cannot be assigned to the same datastream.
Configure load settings.

Adding Databricks as a destination#

To add Databricks as a destination to a workspace, follow these steps:

Go to the Destinations page.
Click + Create destination.
Search for and click Databricks.
Choose how to authorize Adverity to access Databricks:
- To use your details, click Access Databricks using your credentials.
- To ask someone else to use their details, click Access Databricks using someone else’s credentials.
  
  If you choose this option, the person you ask to create the authorization will need to go through the following steps.
Click Next.
Select the authorization type and fill in the following fields:
Databricks Access Token (PAT)
Personal Access Token
Enter the personal access token generated in Databricks. For more information, see the Databricks documentation.

Databricks hostname
Provide hostname without protocol and slashes. For example:

[instance].cloud.databricks.com for AWS

[instance].azuredatabricks.net for Azure

[instance].gcp.databricks.com for Google Cloud
Databricks OAuth Token
Client ID and Client Secret
Enter your OAuth credentials. For more information, see the Databricks documentation.

Hostname
Provide hostname without protocol and slashes. For example:

[instance].cloud.databricks.com for AWS

[instance].azuredatabricks.net for Azure

[instance].gcp.databricks.com for Google Cloud
Click Authorize.
In the Configuration page, fill in the following fields:

Name
(Optional) Rename the destination.

Warehouse
Select the SQL warehouse that will execute queries against your data. This is the compute resource that processes your SQL queries and data operations.

Unity Catalog
Select the Unity Catalog to use for data governance. Unity Catalog provides a unified governance model for data across Databricks workspaces and clouds.

Database
Specify the name of the database into which you want to load the data.

The database name can only contain alphanumeric characters, periods, and underscores.

External Location
Select the registered external location where the data files will be stored. External locations in Databricks are registered storage paths that can be referenced in table creation statements.
Click Create.

Assigning Databricks as a destination#

To assign the Databricks destination to a datastream, follow these steps:

Go to the Datastreams page.
Open the chosen datastream by clicking on its name.
In the Load section, click + Add destination.
Select the Databricks checkbox in the list.
Click Save.
For the automatically enabled destinations, in the pop-up window, click Yes, load data if you want to automatically load your previously collected data into the new destination. The following data extracts will be loaded:
- All data extracts with the status collected if no other destinations are enabled for the datastream
- All data extracts with the status loaded if the data extracts have already been sent to Adverity Data Storage or external destinations
Alternatively, click Skip to continue configuring the destination settings or re-load the data extracts manually. For more information, see Re-loading a data extract.

Configuring settings for loading data into Databricks#

To configure the settings for loading data into Databricks, follow these steps:

Go to the Datastreams page.
Open the chosen datastream by clicking on its name.
In the Load section, find the Databricks destination in the list, and click Actions on the right.
Click Destination settings.
Fill in the following fields:

Table name

Specify the target table in the destination into which to load data from the datastream. The name can contain alphanumeric characters and underscores. For example, target_table. To specify a schema, use the syntax schemaName.tableName.

For Azure Databricks with Unity Catalog, table names must follow the format {datastream_type}_{datastream_id} by default. You can use supported placeholders such as {extract_id}, {meta}, {scheduled_day}, {scheduled_month}, and {scheduled_year} to create dynamic table names.

By default, Adverity saves data from each datastream in a different table named {datastream_type}_{datastream_id} (for example, mailgun_83).

You can specify the same target table for several datastreams. If a column is shared between datastreams, Adverity performs a full outer join and concatenates values. If a column is not shared between datastreams, Adverity writes null values in the relevant cells.

To create a new Databricks spreadsheet containing the data you load into Databricks, enter a name for the new spreadsheet into this field.

Table mode

Specify how Adverity loads data to the target table (overwrite, append, or update).

When using Azure Databricks, each table mode has specific considerations:

Overwrite: Completely replaces existing table data. For Azure implementations, this may require restarting the cluster after the operation completes due to how Delta Lake handles metadata updates.
Append: Adds new data to the existing table. Azure Databricks uses Delta Lake versioning to manage these operations efficiently.
Update: Updates existing records based on matching keys. For Azure implementations, this mode requires a timestamp column in your schema to track changes.

Click Save.

Advanced Databricks tips#

Technical details of loading data into Databricks#

When loading a data extract into Databricks, Adverity creates the following additional columns in the target table:

dt_created: The date the data extract was created.
dt_extract_uuid: The UUID of the data extract.
dt_datastream_uuid: The UUID of the datastream.

Troubleshooting: Connection issues with Azure Databricks#

If you encounter issues connecting to your Databricks destination, follow these steps to resolve common problems.

Invalid Catalog errors#

If you receive Invalid Catalog errors:

Verify that your user or service principal has proper Unity Catalog privileges
Ensure the database name contains only alphanumeric characters, periods, and underscores
Check that your Unity Catalog is properly configured in the Databricks workspace

Authentication failures#

If you experience authentication issues:

Confirm that your Service Principal has the correct roles assigned
Check that your personal access token hasn’t expired (tokens typically expire after 90 days)
Verify that the Databricks hostname is correctly formatted without protocol and slashes

Storage access denied errors#

If you receive Storage access denied errors:

Confirm that your Service Principal has been assigned the Storage Blob Data Contributor role
Verify that the storage authorization is properly configured in your Adverity workspace
Check that the storage container permissions allow read/write access

Cluster timeout issues#

If you encounter cluster timeout issues:

Increase the auto-termination threshold to at least 60 minutes in your Databricks workspace settings
Stop the Databricks cluster before configuring partitioning settings to avoid resource conflicts
Restart the cluster after completing operations that use the Overwrite table mode