Databricks
This feature is in Early Access. During this time, aspects of the functionality may still be developed, and this documentation may not always be up to date. If you have any questions, contact Amplitude Support.
Load your Amplitude raw event data into your Databricks workspace through a Unity Catalog Delta table. The connection authenticates with OAuth Workload Identity Federation (WIF), so Amplitude never holds a long-lived client secret for your workspace.
Considerations
- Amplitude delivers exports on a best-effort basis. Expect data to land in your target Delta table within about 20 minutes of Amplitude receiving the events, though timing can vary with load and volume.
- You can create multiple Databricks exports from the same Amplitude project. Each export has its own event filter and destination table, so a single project can fan out into purpose-built tables.
Prerequisites
You need admin or manager privileges in Amplitude, and a role in Databricks that can manage service principals, create Unity Catalog grants, and create an account-level Federation Policy.
Set up the following in Databricks before you open the wizard:
- A Databricks workspace with a Unity Catalog catalog and schema. Capture the workspace host (for example,
https://dbc-xxxx.cloud.databricks.com). - A SQL warehouse. Capture the warehouse ID from the Databricks UI under SQL Warehouses → your warehouse → View JSON or the URL.
- A service principal in Settings → Identity and access → Service principals. Don't generate an OAuth secret. Capture both:
- The application_id (a UUID). The wizard calls this the
clientId, and it's the credential Amplitude uses at run time. - The numeric service principal ID. The wizard uses this to fill in the Federation Policy CLI command.
- The application_id (a UUID). The wizard calls this the
Why Workload Identity Federation?
WIF removes the shared client secret that OAuth machine-to-machine (M2M) authentication requires. It's the more secure default for this integration:
- No shared secret. Amplitude never receives a credential that authenticates as your service principal outside of a live export run.
- Nothing for you to rotate. The credential is a short-lived JWT that Amplitude's infrastructure mints and refreshes automatically. There's no password or key on your side that drifts, expires, or leaks through a backup.
- One-line revocation. Deleting the Federation Policy on your service principal ends Amplitude's ability to authenticate immediately, without disturbing other integrations on the same workspace.
The wizard generates the Federation Policy command and the Unity Catalog grants SQL for you. You run them in Databricks during setup.
Set up the integration
The setup wizard has two steps: Get started and Set up. The wizard generates the Databricks CLI command and grant SQL as you type, so you can run them in Databricks without leaving the page.
In Amplitude Data, click Catalog and select the Destinations tab.
In the Warehouse Destinations section, click Databricks.
On the Get started step, select Export events ingested today and moving forward. Set how often Amplitude exports data with the frequency picker (the default is every hour). To export only events that meet certain criteria, add a filter. Click Next.
On the Set up step, enter your Credential (OAuth WIF) details:
- Workspace host: the full workspace URL including the scheme, for example
https://dbc-xxxx.cloud.databricks.com. - Service Principal ID (numeric): the numeric ID from the Databricks UI under Settings → Identity and access → Service principals → your service principal. The wizard uses this only to fill in the Federation Policy command. Amplitude doesn't store it.
- Service Principal application_id (clientId): the UUID from the same Databricks page. This is the credential Amplitude uses at run time.
Copy the Federation Policy command on the right and run it with the Databricks CLI authenticated against your Databricks account, not a workspace. The command tells Databricks to accept JWTs from Amplitude's environment as proof of identity for the service principal. The wizard sets the OIDC issuer in the command automatically for the Amplitude environment your project runs in.
- Workspace host: the full workspace URL including the scheme, for example
In the same step, enter your Unity Catalog target:
- Catalog, Schema, Table name: the three-part Unity Catalog identifier where Amplitude writes events. Amplitude creates the table on the first run if it doesn't exist.
- SQL warehouse ID: the warehouse Amplitude uses to run the
COPY INTOstatement.
Copy the Unity Catalog grants SQL on the right and run it in a Databricks SQL editor. Then in the Databricks UI, open SQL Warehouses → your warehouse → Permissions and grant the service principal CAN USE. The warehouse permission isn't grantable through SQL.
Click Finish.
Amplitude creates the credential and the export schedule. The first run executes at the next scheduled tick on the cadence you picked, verifying the credential end to end.
Revoke Amplitude's access
To stop Amplitude from authenticating as the service principal, delete the Federation Policy you created in step 4:
databricks account service-principal-federation-policy list <sp-application-id> \
--profile <your-databricks-account-profile>
databricks account service-principal-federation-policy delete <sp-application-id> <policy-id> \
--profile <your-databricks-account-profile>
Use the same Databricks account profile you used in the Federation Policy command during setup. A workspace-scoped profile returns a 401 for these subcommands.
After you delete the policy, the next scheduled run fails with PERMISSION_DENIED and Databricks stops issuing workspace tokens for the service principal. To also block data access for any token that's still valid, remove the Unity Catalog grants from step 5.
Databricks export format
Data location
Amplitude writes events to the Unity Catalog table you named in the setup wizard. The full identifier is {catalog}.{schema}.{table_name}, using the values you entered.
Route different event slices to different tables
Amplitude's Databricks destination lets you create multiple exports from the same project. Each destination can have its own event filter and its own destination table, so you can fan a project's event stream out into purpose-built tables. For example, set up one destination that filters to checkout events and lands them in analytics.finance.checkout_events, and a second that filters to engagement events and lands them in analytics.product.engagement_events.
Event table schema
The event table uses the following Delta columns:
| Column | Type | Description |
|---|---|---|
amplitude_attribution_ids | VARIANT | Hashed attribution IDs on the event. |
amplitude_id | BIGINT | The original Amplitude ID for the user. Use this field to automatically handle merged users. Example: 2234540891. |
app | BIGINT | Project ID found in your project's Settings page. Example: 123456. |
city | STRING | City. Example: "San Francisco". |
client_event_time | TIMESTAMP | Local timestamp (UTC) of when the device logged the event. Example: 2015-08-10T12:00:00.000000. |
client_upload_time | TIMESTAMP | The local timestamp (UTC) of when the device uploaded the event. Example: 2015-08-10T12:00:00.000000. |
country | STRING | Country. Example: "United States". |
data | VARIANT | Dictionary where Amplitude stores certain fields such as first_event and merged_amplitude_id. |
device_carrier | STRING | Device carrier. Example: Verizon. |
device_family | STRING | Device family. Example: Apple iPhone. |
device_id | STRING | The device-specific identifier. Example: C8F9E604-F01A-4BD9-95C6-8E5357DF265D. |
device_type | STRING | Device type. Example: Apple iPhone 5s. |
dma | STRING | Designated marketing area (DMA). Example: San Francisco-Oakland-San Jose, CA. |
event_id | BIGINT | A counter that distinguishes events. Example: 1. |
event_properties | VARIANT | Event properties as semi-structured JSON. |
event_time | TIMESTAMP | Amplitude timestamp (UTC), which is client_event_time adjusted by the difference between server_received_time and client_upload_time: event_time = client_event_time + (server_received_time - client_upload_time). Amplitude uses this timestamp to organize events on Amplitude charts. If the difference between server_received_time and client_upload_time is less than 60 seconds, Amplitude doesn't adjust event_time and it equals client_event_time. Example: 2015-08-10T12:00:00.000000. |
event_type | STRING | Event type. |
group_properties | VARIANT | Group properties as semi-structured JSON. |
groups | VARIANT | Group types. For more information, refer to Account-level reporting. |
ip_address | STRING | IP address. Example: "123.11.111.11". |
language | STRING | Language reported by the device. |
library | STRING | The Amplitude SDK that sent the event. |
location_lat | DOUBLE | Latitude. Example: 12.3456789. |
location_lng | DOUBLE | Longitude. Example: -123.4567890. |
os_name | STRING | OS name. Example: ios. |
os_version | STRING | OS version. |
paying | STRING | True if the user has ever logged any revenue, otherwise none. You can modify the property value with the Identify API. Example: true. |
platform | STRING | Platform that sent the event. |
processed_time | TIMESTAMP | Time Amplitude processed the event. |
region | STRING | Region. Example: California. |
sample_rate | BIGINT | Event sampling rate. |
server_received_time | TIMESTAMP | Server time when Amplitude received the event. |
server_upload_time | TIMESTAMP | Amplitude timestamp (UTC) of when Amplitude servers received the event. Example: 2015-08-10T12:00:00.000000. |
session_id | BIGINT | The session start time in milliseconds since epoch. Example: 1396381378123. |
start_version | STRING | App version when Amplitude first tracked the user. Example: 1.0.0. |
user_id | STRING | A readable ID you specify. Use something that doesn't change. Avoid using the user's email address. |
user_properties | VARIANT | User properties as semi-structured JSON. |
What's coming next
Coming soon
Merged Amplitude IDs export
A second table that tracks Amplitude's user-merge ledger, so you can correctly join events back to a user's canonical Amplitude ID.
Data Configuration export
Snapshots of your event transformations and custom event definitions, so warehouse-side queries stay aligned with the semantics you build in Amplitude.
Historical backfill
Export your existing Amplitude event history into Databricks in a single job, after your recurring export is set up.
Was this helpful?