You can import historical data to Amplitude yourself using the Batch Event Upload API.
Review these considerations before backfilling data.
"$skip_user_properties_sync": true
.There are a few different limits to keep in mind when backfilling data.
exceeded_daily_quota_users
or exceeded_daily_quota_devices
in the response. See the Batch Event Upload for more information.$insert_id
while HTTP and Batch APIs use the format insert_id
without the $
.insert_id
field to deduplicate events.This is Amplitude's recommendation for backfilling large amounts of data:
device_id
).device_id
or user_id
.To optimize this process further, add aggressive retry logic with high timeouts. Continue to retry until you receive a 200 response. If you send an insert_id
,
then Amplitude deduplicates data that has the same insert_id
sent within 7 days of each other.
When Amplitude captures an event, it includes the current values for each user property, which can change over time. When Amplitude receives an event with user properties, it updates the existing user properties, but also adds any new user properties. You can change this behavior by adding "$skip_user_properties_sync": true
to the event payload.
When you include "$skip_user_properties_sync": true
, Amplitude ignores the user properties table completely. The event has only the user properties sent with the event, doesn't update the user properties table, and doesn't display any pre-existing user properties.
For example, you send the following event to Amplitude. The user property table already has the user property "city": "New York"
1{ 2 "api_key": "API_KEY", 3 "events": [ 4 { 5 "user_id": "b4ee5d78-e1b6-11ec-8fea-0242ac120002", 6 "insert_id": "97b74bc6-a8c8-48f3-bbc7-de9f95aea636", 7 "device_id": "", 8 "event_type": "Button Clicked", 9 "user_properties":{10 "subscriptionStatus":"active"11 }12 }13 ]14}
The event appears in Amplitude as:
1 "events": [ 2{ 3 "user_id": "b4ee5d78-e1b6-11ec-8fea-0242ac120002", 4 "insert_id": "97b74bc6-a8c8-48f3-bbc7-de9f95aea636", 5 "device_id": "", 6 "event_type": "Button Clicked", 7 "user_properties":{ 8 "city":"New York" 9 "subscriptionStatus":"active"10 }11}12]
You include "$skip_user_properties_sync": true
and send the same event. The event appears in Amplitude like this:
1 "events": [ 2{ 3 "user_id": "b4ee5d78-e1b6-11ec-8fea-0242ac120002", 4 "insert_id": "97b74bc6-a8c8-48f3-bbc7-de9f95aea636", 5 "device_id": "", 6 "event_type": "Button Clicked", 7 "$skip_user_properties_sync": true, 8 "user_properties":{ 9 "subscriptionStatus":"active"10 }11}12]
Notice that it doesn't include the city property.
Next, you include "$skip_user_properties_sync": true
and send this event:
1{ 2 "api_key": "API_KEY", 3 "events": [ 4 { 5 "user_id": "b4ee5d78-e1b6-11ec-8fea-0242ac120002", 6 "insert_id": "97b74bc6-a8c8-48f3-bbc7-de9f95aea636", 7 "device_id": "", 8 "event_type": "Button Clicked", 9 "$skip_user_properties_sync": true,10 "user_properties":{11 "city":"San Francisco"12 }13 }14 ]15}
Amplitude doesn't update the user properties table, and the event appears in Amplitude like this:
1 "events": [ 2{ 3 "user_id": "b4ee5d78-e1b6-11ec-8fea-0242ac120002", 4 "insert_id": "97b74bc6-a8c8-48f3-bbc7-de9f95aea636", 5 "device_id": "", 6 "event_type": "Button Clicked", 7 "user_properties":{ 8 "city":"San Francisco" 9 }10}11]
Any new event still has "city":"New York"
, but this event displays "city":"San Francisco"
.
If you send data that has a timestamp of 30 days or older, then it can take up to 48 hours to appear in some parts of Amplitude system. Use the User Activity tab
to check the events that you are sending as that updates in real-time regardless of the time of the event.
In Amplitude's ingestion system, each user's current user properties are always tracked and synced to a user's incoming events.
When sending data to Amplitude, you either send event data or send identify
calls to update a user's user properties. These identify
calls update a user's current user property values and affect the user properties associated to events received after the identify
call.
The Datamonster user currently has one user property, 'color', and it is set to 'red'. Then, Datamonster logs 'View Page A' event and triggers an identify
that sets 'color' to 'blue'. Afterwards, they log a 'View Page B' event:
logEvent
-> 'View Page A'identify
-> 'color':'blue'logEvent
-> 'View Page B'If Amplitude receives events from Datamonster in that exact order, then you would expect 'View Page A' to have 'color' = 'red' and 'View Page B' to have 'color' = 'blue'. This is because Amplitude maintains the value of user properties at the time of the event. For this reason, the order in which events are uploaded is very important. If the identify
was received after 'View Page B', then 'View Page B' would have 'color' = 'red' instead of 'blue'.
Because Amplitude processes all a user's events using the same ingestion worker, Amplitude guarantees that it processes events in the order in which they're received. In essence, all the Datamonster user's events queue up in order on a single ingestion worker. If these events were instead processed in parallel across two separate workers, then it's harder to guarantee the order. For example, one worker might be faster than another.
Because a single ingestion worker processes a user's events, a user sending an abnormally high number of events in a short period would overload their assigned worker. To avoid overloading your ingestion workers, Amplitude recommends limiting event upload to 300 events per second per device ID. It's possible for backfills to exceed 300 events per second if you iterate through historical data and send data as fast as possible in parallel. Amplitude keeps track of each device ID's event rate and reject events and returns a 429 throttling HTTP response code if a device ID is sending too many events. If you receive a 429 in response to an event upload, the process should sleep for a few seconds and then keep retrying the upload until it succeeds. This ensures that events aren't lost in the backfill process. If you don't retry after a 429 response code, then that batch of events isn't ingested.
If you have preexisting users, then you should backfill the users to accurately mark when they were new users. Amplitude marks users new based on the timestamp of their earliest event.
To backfill your preexisting users, use the Batch API. Send a "dummy event" or a signup event where the event timestamp is the actual time the user was originally created. For instance, if a user signed up on Aug 1st, 2022, the timestamp of the event you send should be Aug 1st, 2022.
Thanks for your feedback!
July 25th, 2024
Need help? Contact Support
Visit Amplitude.com
Have a look at the Amplitude Blog
Learn more at Amplitude Academy
© 2024 Amplitude, Inc. All rights reserved. Amplitude is a registered trademark of Amplitude, Inc.