On this page

Data backfill

You can import historical data to Amplitude using the Batch Event Upload API.

Considerations

Review these considerations before backfilling data.

  • Consider keeping historical data in a separate Amplitude project instead of backfilling into a live production project. Keeping historical data separate makes the upload easier and keeps your live Amplitude data clean and focused on current and future data. You typically don't need to check historical data often, but you still want it available. Historical user property values overwrite current live values during a backfill. Amplitude syncs the out-of-date property values onto new live events. To skip user property sync, add the following to your event payload: "$skip_user_properties_sync": true.
  • To connect historical data with current data, combine the historical data and live data in the same project. To connect users from each dataset, the users need matching Amplitude user IDs in each set.
  • The new user count might change. Amplitude defines a new user based on the earliest event timestamp it sees for a given user. If Amplitude records a user as new on June 1 2021 and you backfill data for the user from February 1 2021, then Amplitude defines the user as new on February 1 2021.
  • Backfilling can compromise your app data. If a mismatch exists between the current user ID and the backfilled user ID, Amplitude interprets the two distinct User IDs as two distinct users. As a result, Amplitude double-counts users. Because Amplitude can't delete data after it's recorded, you might have to create a new project to prevent data issues.
  • Amplitude uses the Device ID and User ID fields to compute the Amplitude ID. For more information, refer to Track unique users.
  • Events in the backfill count toward your monthly event volume.

Limits

Keep these limits in mind when backfilling data.

  • Daily limit: An ingestion daily limit of 500K events per device ID (and per user ID) applies to each project to protect Amplitude from event spam. This limit uses a 24-hour rolling window of 1-hour intervals. A user or device can send at most 500K events in the last 24 hours at any given time. If you hit this limit, you get exceeded_daily_quota_users or exceeded_daily_quota_devices in the response. For more information, refer to Batch Event Upload.
  • Batch limit: An upload limit applies of 100 batches per second and 1000 events per second. You can batch events into an upload, but Amplitude recommends sending no more than 10 events per batch. Amplitude throttles your upload if you send more than 10 events per second for a single device ID. For more information about throttling, refer to Batch Event Upload. To avoid overloading your ingestion workers, Amplitude recommends limiting backfill event upload to 300 events per second per device ID. Backfills can exceed 300 events per second if you iterate through historical data and send data as fast as possible in parallel.

Backfill best practices

  • Review the documentation for the Batch API. If you exported historical data using the Export API and want to use the data to backfill, note that the exported fields aren't in the same format as the fields needed for import. For example, the Export API uses $insert_id, while HTTP and Batch APIs use the format insert_id without the $.
  • Decide which fields to send and map your historical data to Amplitude fields. Amplitude strongly recommends that you use the insert_id field to deduplicate events.
  • Because no way exists to undo an import, create a test project in Amplitude to send sample data from your backfill. Run several tests with a few days' worth of data in an Amplitude test project before the final upload to the production project.

Amplitude recommends this approach for backfilling large amounts of data:

  1. Break up the set of events into mini non-overlapping sets (for example, partition by device_id).
  2. Have one worker per set of events run these steps:
    1. Read many events from your system.
    2. Partition those events into requests based on device_id or user_id.
    3. Send your requests concurrently or in parallel to Amplitude.

To optimize further, add aggressive retry logic with high timeouts. Continue to retry until you receive a 200 response. If you send an insert_id, Amplitude deduplicates data that has the same insert_id sent within 7 days of each other.

Skip user properties sync

When Amplitude captures an event, it includes the current values for each user property, which can change over time. When Amplitude receives an event with user properties, it updates the existing user properties and adds any new user properties. To change this behavior, add "$skip_user_properties_sync": true to the event payload.

When you include "$skip_user_properties_sync": true, Amplitude ignores the user properties table completely. The event has only the user properties sent with the event, doesn't update the user properties table, and doesn't display any preexisting user properties.

For example, you send the following event to Amplitude. The user property table already has the user property "city": "New York".

json
{
    "api_key": "API_KEY",
        "events": [
    {
        "user_id": "b4ee5d78-e1b6-11ec-8fea-0242ac120002",
        "insert_id": "97b74bc6-a8c8-48f3-bbc7-de9f95aea636",
        "device_id": "",
        "event_type": "Button Clicked",
        "user_properties":{
            "subscriptionStatus":"active"
        }
    }
    ]
}

The event appears in Amplitude as:

json
        "events": [
    {
        "user_id": "b4ee5d78-e1b6-11ec-8fea-0242ac120002",
        "insert_id": "97b74bc6-a8c8-48f3-bbc7-de9f95aea636",
        "device_id": "",
        "event_type": "Button Clicked",
        "user_properties":{
            "city":"New York",
            "subscriptionStatus":"active"
        }
    }
    ]

You include "$skip_user_properties_sync": true and send the same event. The event appears in Amplitude like this:

json
        "events": [
    {
        "user_id": "b4ee5d78-e1b6-11ec-8fea-0242ac120002",
        "insert_id": "97b74bc6-a8c8-48f3-bbc7-de9f95aea636",
        "device_id": "",
        "event_type": "Button Clicked",
        "$skip_user_properties_sync": true,
        "user_properties":{
            "subscriptionStatus":"active"
        }
    }
    ]

The event doesn't include the city property.

Next, you include "$skip_user_properties_sync": true and send this event:

json
{
    "api_key": "API_KEY",
        "events": [
    {
        "user_id": "b4ee5d78-e1b6-11ec-8fea-0242ac120002",
        "insert_id": "97b74bc6-a8c8-48f3-bbc7-de9f95aea636",
        "device_id": "",
        "event_type": "Button Clicked",
        "$skip_user_properties_sync": true,
        "user_properties":{
            "city":"San Francisco"
        }
    }
    ]
}

Amplitude doesn't update the user properties table, and the event appears in Amplitude like this:

json
        "events": [
    {
        "user_id": "b4ee5d78-e1b6-11ec-8fea-0242ac120002",
        "insert_id": "97b74bc6-a8c8-48f3-bbc7-de9f95aea636",
        "device_id": "",
        "event_type": "Button Clicked",
        "user_properties":{
            "city":"San Francisco"
        }
    }
    ]

Any new event still has "city":"New York", but this event displays "city":"San Francisco".

Timing

If you send data with a timestamp 30 days or older, it can take up to 48 hours to appear in some parts of Amplitude. Use the User Activity tab to check the events you're sending, because that tab updates in real time regardless of the event time.

Resources

Data ingestion system

In Amplitude's ingestion system, each user's current user properties are tracked and synced to a user's incoming events.

Diagram of user properties synced to each incoming event in ingestion

When sending data to Amplitude, you either send event data or send identify calls to update a user's user properties. These identify calls update a user's current user property values and affect the user properties attached to events received after the identify call.

The Datamonster user has one user property, 'color', set to 'red'. Datamonster logs a 'View Page A' event and triggers an identify that sets 'color' to 'blue'. Afterward, Datamonster logs a 'View Page B' event:

  1. logEvent -> 'View Page A'
  2. identify -> 'color':'blue'
  3. logEvent -> 'View Page B'

If Amplitude receives events from Datamonster in that exact order, you'd expect 'View Page A' to have 'color' = 'red' and 'View Page B' to have 'color' = 'blue'. Amplitude maintains the value of user properties at the time of the event. For this reason, the order in which events are uploaded matters. If the identify arrives after 'View Page B', then 'View Page B' has 'color' = 'red' instead of 'blue'.

Because Amplitude processes all of a user's events using the same ingestion worker, Amplitude guarantees that it processes events in the order it receives them. All Datamonster's events queue in order on a single ingestion worker. If two separate workers processed these events in parallel, ordering would be harder to guarantee. For example, one worker might run faster than another.

Because a single ingestion worker processes a user's events, a user sending an abnormally high number of events in a short period can overload that worker. To avoid overloading your ingestion workers, Amplitude recommends limiting event upload to 300 events per second per device ID. Backfills can exceed 300 events per second if you iterate through historical data and send data as fast as possible in parallel. Amplitude tracks each device ID's event rate and rejects events with a 429 throttling HTTP response code if a device ID sends too many events. If you receive a 429 in response to an event upload, the process should sleep for a few seconds and then keep retrying the upload until it succeeds. This approach ensures that events aren't lost in the backfill process. If you don't retry after a 429 response code, Amplitude doesn't ingest that batch of events.

Backfill preexisting users

If you have preexisting users, backfill them to accurately mark when they became new users. Amplitude marks users new based on the timestamp of their earliest event.

To backfill your preexisting users, use the Batch API. Send a placeholder event or a signup event where the event timestamp is the actual time the user was originally created. For example, if a user signed up on Aug 1st, 2022, the timestamp of the event you send should be Aug 1st, 2022.

Was this helpful?