Databricks

Amplitude's Databricks import source enables you to import data from Databricks to your Amplitude account. Databricks import uses the Databricks Change Data Feed feature to securely access and extract live data from your Databricks workspace.

Features

  • Import all data types, including events, user properties, and group properties.
  • Support for delta sync, to ensure Amplitude imports only new or changed data.
    • For event data, Amplitude imports rows with an insert operation.
    • For user properties and group properties, Amplitude imports rows with insert or update operations.
    • Amplitude ignores rows with delete operations for all data types.

Limitations

  • The User Look-Up page doesn't display 100 most recent events ingested.

For guided instructions to setting up this integration, view the Loom video.

Configure Databricks

Before you start to configure the Databricks source in Amplitude, complete the following tasks in Databricks.

Find or create an all-purpose compute cluster

Amplitude creates workflows in this cluster on your behalf to start sync jobs. When complete, copy the Server hostname and HTTP path values to use in a later step. Find both values on the Configuration -> JDBC/ODBC tab. For more information about cluster types, see Compute.

where to find server host name and HTTP path

Note

Ensure that the new cluster can run jobs by NOT having configs below in cluster's policy. See details in Databricks' article Policy definition.

1"workload_type.clients.jobs": {
2 "type": "fixed",
3 "value": false
4}

Note

Ensure that the your cluster has python version >= 3.9; Otherwise, you may see the following error in your workflow job:

1TypeError: 'type' object is not subscriptable

Cluster policies and access modes

Amplitude supports all policies and access modes. However, if your clusters have the following policies and access modes, grant the Data Reader permission (USE CATALOG, USE SCHEMA, EXECUTE, READ VOLUME, SELECT) to the your workspace user or service principal, whose personal access token is used to authenticate in Amplitude. Otherwise, you can't access the tables in the unity catalog of your import source.

AWS Databricks:

Policy Node Cluster Access mode
Unrestricted Multi node No isolation shared
Unrestricted Single node No isolation shared
Power User Compute N/A No isolation shared
Legacy Shared Compute N/A N/A

GCP Databricks:

Policy Node Cluster Access mode
Unrestricted Multi node No isolation shared
Unrestricted Single node No isolation shared
Power User Compute N/A Shared
Power User Compute N/A No isolation shared
Legacy Shared Compute N/A N/A

data reader permissions

Authentication

Amplitude's Databricks import supports authentication with personal access tokens for Databricks workspace users, or personal access tokens for Service Principals. Choose Workspace User authentication for faster setup, or Service Principal authentication for finer grained control. For more information, see Databrick's article Authentication for Databricks Automation

Create a workspace user Personal Access Token (PAT)

Amplitude's Databricks import uses Personal Access Tokens to authenticate. For the quickest setup, create a PAT for your workspace user in Databricks. For more information, see Databricks' article Personal Access Tokens for Workspace Users

Create a service principal Personal Access Token (PAT)

Amplitude recommends that you create a service principal in Databricks to allow for more granular control of access.

  1. Follow the Databricks instructions to create a service principal. Copy the UUID for use in a later step.

  2. Generate a PAT on this Service Principal.

The service principal you created above requires the following permissions in Databricks:

Permission Reason Location in Databricks
Workspace Grants access to your Databricks workspace. Workspace → <workspace_name> → Permissions → Add permissions
Add the service principal you create with the User permission, click Save.
Table Grants access to list tables and read data. Catalog → pick the catalog→ Permissions → Grant
Select the Data Reader permission (USE CATALOG, USE SCHEMA, EXECUTE, READ VOLUME, SELECT).
Cluster Grants access to connect to the cluster and run workflows on your behalf Compute → All-purpose compute → Edit Permission
Add the Add Can Attach To permission to the service principal.
Export Enables the service principal to unload your data through spark and export it to S3. Run the SQL commands below in any notebook: GRANT MODIFY ON ANY FILE TO `<service_principal_uuid>`; GRANT SELECT ON ANY FILE TO `<service_principal_uuid>`;

Enable CDF on your tables

Amplitude uses Databricks' Change Data Feed to continuously import data. To enable CDF on a Databricks table, see Databricks | Enable change data feed

Configure the Amplitude Databricks source

To add Databricks as a source in Amplitude, complete the following steps.

Connect to Databricks

  1. In Amplitude Data, click Catalog and navigate to the Sources tab.
  2. Search for Databricks.
  3. On the Credentials tab of the Connect Databricks screen, enter the credentials you configured during the Databricks configuration:
    • Server hostname
    • HTTP Path
    • Personal Access Token (for the workspace user or Service Principal)
  4. Click Next to verify access.

Select data to import

  1. Select the data type for data to be imported. The Databricks source supports three data types:

    For the Event data type, optionally select Sync User Properties or Sync Group Properties to sync the corresponding properties within an event.

  2. Configure the SQL command that transforms data in Databricks before Amplitude imports it.

    • Amplitude treats each record in the SQL execution output as an event to be import. See the Example body in the Batch Event Upload API documentation to ensure each record you import complies.
    • Amplitude can transform / import from only the tables you specify in step 1 above.
      • For example, if you have access to tables A, B and C but only selected A in step 1, then you can only import data from A.
    • The table names you reference in the SQL command must match exactly the name of the table you select in step 1. For example, if you select catalog.schema.table1, use that exact value in the SQL.
    1select
    2 unix_millis(current_timestamp()) as time,
    3 id as user_id,
    4 "demo" as event_type,
    5 named_struct('name', name, 'home', home, 'age', age, 'income', income) as user_properties,
    6 named_struct('group_type1', ARRAY("group_A", "group_B")) as groups,
    7 named_struct('group_property', "group_property_value") as group_properties
    8from catalog.schema.table1;
  3. After you add the SQL, click Test SQL. Amplitude runs a test against your Databricks instance to ensure the SQL is valid. Click Next.

  4. Select the table version for initial import. The initial import brings everything the from table as of the selected version. Select First or Latest.

    • First means first version, which is 0.
    • Latest means latest version.
  5. Set the sync frequency. This frequency determines the interval at which Amplitude pulls data from Databricks.

  6. Enter a descriptive name for this instance of the source.

  7. The source appears in your workspace's Sources list.

Verify data import

Events that Amplitude imports assume the name you assign in your SQL statement. In the example above, the events have the name demo

To verify the data coming into Amplitude:

  • View the Events page of your Tracking Plan
  • Create a Segmentation chart that filters on the event name you specify.
  • Go to the Ingestion Jobs tab in your source. You can view the status of the ingestion and debug using ERROR LOG if necessary.
Was this page helpful?

Thanks for your feedback!

June 21st, 2024

Need help? Contact Support

Visit Amplitude.com

Have a look at the Amplitude Blog

Learn more at Amplitude Academy

© 2024 Amplitude, Inc. All rights reserved. Amplitude is a registered trademark of Amplitude, Inc.