Amplitude's Databricks import source enables you to import data from Databricks to your Amplitude account. Databricks import uses the Databricks Change Data Feed feature to securely access and extract live data from your Databricks workspace.
For guided instructions to setting up this integration, view the Loom video.
insert
operation.insert
or update
operations.delete
operations for all data types.The User Look-Up page doesn't display 100 most recent events ingested.
The following Databricks features don't support time travel, and aren't supported by this integration:
SQL input restrictions for Continuous Sync change data feed type:
UNION
, INTERSECT
, MINUS
, and EXCEPT
aren't supportedJOIN
clause use mutation metadata from the main table, ignoring the mutation history of the joined table. Amplitude uses the latest version of data in the joined table during data synchronization.Before you start to configure the Databricks source in Amplitude, complete these tasks in Databricks:
Amplitude creates workflows in this cluster on your behalf to start sync jobs. When complete, copy the server hostname and HTTP path values to use in a later step. Find both values on the Configuration -> JDBC/ODBC tab. For more information about cluster types, see Compute.
Ensure the new cluster can run jobs by NOT having configs below in cluster's policy. See details in Databricks' article Policy definition.
1"workload_type.clients.jobs": {2 "type": "fixed",3 "value": false4}
Ensure that the your cluster has python version >= 3.9; Otherwise, you may see the following error in your workflow job:
1TypeError: 'type' object is not subscriptable
Amplitude supports all policies and access modes. However, if your clusters have the following policies and access modes, grant the Data Reader permission (USE CATALOG
, USE SCHEMA
, EXECUTE
, READ VOLUME
, SELECT
) to the your workspace user or service principal, whose personal access token is used to authenticate in Amplitude. Otherwise, you can't access the tables in the unity catalog of your import source.
AWS Databricks:
Policy | Node | Cluster Access mode |
---|---|---|
Unrestricted | Multi node | No isolation shared |
Unrestricted | Single node | No isolation shared |
Power User Compute | N/A | No isolation shared |
Legacy Shared Compute | N/A | N/A |
GCP Databricks:
Policy | Node | Cluster Access mode |
---|---|---|
Unrestricted | Multi node | No isolation shared |
Unrestricted | Single node | No isolation shared |
Power User Compute | N/A | Shared |
Power User Compute | N/A | No isolation shared |
Legacy Shared Compute | N/A | N/A |
Amplitude's Databricks import supports authentication with personal access tokens for Databricks workspace users, or personal access tokens for Service Principals. Choose Workspace User authentication for faster setup, or Service Principal authentication for finer grained control. For more information, see Authentication for Databricks Automation
Amplitude's Databricks import uses personal access tokens to authenticate. For the quickest setup, create a PAT for your workspace user in Databricks. For more information, see Databricks' article Personal Access Tokens for Workspace Users
Amplitude recommends that you create a service principal in Databricks to allow for more granular control of access.
Follow the Databricks instructions to create a service principal. Copy the UUID for use in a later step.
Generate a PAT on this Service Principal.
The service principal you created above requires the following permissions in Databricks:
Permission | Reason | Location in Databricks |
---|---|---|
Workspace | Grants access to your Databricks workspace. | Workspace → <workspace_name> → Permissions → Add permissions Add the service principal you create with the User permission, click Save. |
Table | Grants access to list tables and read data. | Catalog → pick the catalog→ Permissions → Grant Select the Data Reader permission (USE CATALOG , USE SCHEMA , EXECUTE , READ VOLUME , SELECT ). |
Cluster | Grants access to connect to the cluster and run workflows on your behalf | Compute → All-purpose compute → Edit Permission Add the Add Can Attach To permission to the service principal. |
Export | Enables the service principal to unload your data through spark and export it to S3. | Run the SQL commands below in any notebook: GRANT MODIFY ON ANY FILE TO `<service_principal_uuid>`; GRANT SELECT ON ANY FILE TO `<service_principal_uuid>`; |
Amplitude uses the Databricks Change Data Feed to continuously import data. To enable CDF on a Databricks table, see Databricks | Enable change data feed
To add Databricks as a source in Amplitude, complete the following steps.
Select the data type for data to be imported. The Databricks source supports three data types:
For the Event
data type, optionally select Sync User Properties or Sync Group Properties to sync the corresponding properties within an event.
If you selected Event
or Profiles
as the data type, choose the change data feed type:
Configure the SQL command that transforms data in Databricks before Amplitude imports it.
A
, B
and C
but only selected A
in step 1, then you can only import data from A
.catalog.schema.table1
, use that exact value in the SQL.1select2 unix_millis(current_timestamp()) as time,3 id as user_id,4 "demo" as event_type,5 named_struct('name', name, 'home', home, 'age', age, 'income', income) as user_properties,6 named_struct('group_type1', ARRAY("group_A", "group_B")) as groups,7 named_struct('group_property', "group_property_value") as group_properties8from catalog.schema.table1;
After you add the SQL, click Test SQL. Amplitude runs a test against your Databricks instance to ensure the SQL is valid. Click Next.
Select the table version for initial import. The initial import brings everything the from table as of the selected version. Select First or Latest.
First
means first version, which is 0.Latest
means latest version.Set the sync frequency. This frequency determines the interval at which Amplitude pulls data from Databricks.
Enter a descriptive name for this instance of the source.
The source appears in the Sources list for your workspace.
Events that Amplitude imports assume the name you assign in your SQL statement. In the example above, the events have the name demo
To verify the data coming into Amplitude:
Ingestion Jobs
tab in your source. You can view the status of the ingestion and debug using ERROR LOG
if necessary.Depending on your company's network policy, you may need to add the following IP addresses to your allowlist to allow Amplitude's servers to access your Databricks instance:
Thanks for your feedback!
September 19th, 2024
Need help? Contact Support
Visit Amplitude.com
Have a look at the Amplitude Blog
Learn more at Amplitude Academy
© 2024 Amplitude, Inc. All rights reserved. Amplitude is a registered trademark of Amplitude, Inc.