On this page

Amazon S3 Import

With Amplitude's Amazon S3 Import, you can import and mutate event data, and sync user or group properties into your Amplitude projects from an AWS S3 bucket. Use Amazon S3 Import to backfill large amounts of existing data, connect existing data pipelines to Amplitude, and ingest large volumes of data where you need high throughput and latency is less sensitive.

During setup, you configure conversion rules to control how Amplitude instruments events. After you set up and enable Amazon S3 Import, Amplitude's ingestion service continuously discovers data files in S3 buckets and then converts and ingests events.

Amplitude regional IP addresses

Depending on your company's network policy, you may need to add these IP addresses to your allowlist so Amplitude's servers can access your S3 buckets:

RegionIP Addresses
US52.33.3.219, 35.162.216.242, 52.27.10.221
EU3.124.22.25, 18.157.59.125, 18.192.47.195

Prerequisites

Before you start, make sure you meet the following prerequisites.

  • An Amplitude project exists to receive the data. If not, create a new project.
    • You're an Admin or Manager of the Amplitude project.
  • Your S3 bucket has data files ready for Amplitude to ingest. They must conform to the mappings that you outline in your converter file.
  • The data to import must have a unique and immutable insert_id for each row. This helps prevent data duplication if unexpected issues arise. For more information, refer to Deduplication with insert_id.
  • Mirror Syncs require a user ID. If a row doesn't contain a user ID, Amplitude drops the event.

File requirements

The files you want to send to Amplitude must follow some basic requirements:

  • Files contain events, with one event per line.
  • Upload files in the events' chronological order.
  • Filenames are unique.
  • File size must be greater than 1MB and smaller than 5GB. For customers with large event volumes, Amplitude recommends file sizes close to 500MB for optimal performance.
  • Files are compressed or uncompressed JSON, CSV, or parquet files.
  • For Mirror Sync, which supports mutations, the following constraints apply:
    • Mutations to events require a user ID. If a row doesn't contain a user ID, Amplitude drops the event. If you have a high volume of anonymous events, Amplitude recommends against using this mode.
    • Amplitude permits the following mutation types: INSERT, UPDATE, and DELETE. If you don't provide a mutation type, the process defaults to UPDATE.

File processing

Amplitude processes files exactly once. You can't edit files after you upload them to the S3 bucket. If you edit a file after you upload it, there's no guarantee that Amplitude processes the most recent version of the file.

After an S3 import source ingests a file, the same source doesn't process the file again, even if the file receives an update.

Deduplication with insert_id

For ingestion syncs only, Amplitude uses a unique identifier, insert_id, to match against incoming events and prevent duplicates. Within the same project, if Amplitude receives an event with insert_id and device_id values that match a different event received within the last 7 days, Amplitude drops the most recent event.

Amplitude recommends that you set a custom insert_id for each event to prevent duplication. To set a custom insert_id, create a field that holds unique values, like random alphanumeric strings, in your dataset. Map the field as an extra property named insert_id in the guided converter configuration.

Give Amplitude access to your S3 bucket

Follow these steps to give Amplitude read access to your AWS S3 bucket.

  1. Create a new IAM role, for example: AmplitudeReadRole.

  2. Go to Trust Relationships for the role and add Amplitude's account to the trust relationship policy to allow Amplitude to assume the role using the following example.

    • external_id: unique identifier used when Amplitude assumes the role. You can generate it with help from third party tools. An example external ID is vzup2dfp-5gj9-8gxh-5294-sd9wsncks7dc.

    • Trust policy for Amplitude US region:

    plaintext
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
          "AWS": ["arn:aws:iam::358203115967:role/k8s_prod_cargo",
                  "arn:aws:iam::358203115967:role/k8s_prod_falcon",
                  "arn:aws:iam::358203115967:role/vacuum_iam_role" ]
          },
          "Action": "sts:AssumeRole",
          "Condition": {
          "StringEquals": {
          "sts:ExternalId": "<external_id>" 
            }
          }
        }
      ]
    }
    
    • Trust policy for Amplitude EU region
    plaintext
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
          "AWS": ["arn:aws:iam::202493300829:role/k8s_prod-eu_cargo",
                  "arn:aws:iam::202493300829:role/k8s_prod-eu_falcon",
                  "arn:aws:iam::202493300829:role/vacuum_iam_role" ]
        }, 
          "Action": "sts:AssumeRole",
          "Condition": {
          "StringEquals": {
          "sts:ExternalId": "<external_id>" 
            }
          }
        }
      ]
    }
    
  3. Create a new IAM policy, for example, AmplitudeS3ReadOnlyAccess. Use the entire example code that follows, but make sure to update <> in highlighted text.

    • <bucket_name>: the S3 bucket name where your data comes from.
    • <prefix>: the optional prefix of files that you want to import, for example filePrefix. For folders, make sure prefix ends with /, for example folder/. For the root folder, keep prefix empty.

    Example 1: IAM policy without prefix:

    json
    {
      "Version":"2012-10-17",
      "Statement":[
        {
          "Sid":"AllowListingOfDataFolder",
          "Action":[
            "s3:ListBucket"
          ],
          "Effect":"Allow",
          "Resource":[
            "arn:aws:s3:::<bucket_name>"
          ],
          "Condition":{
            "StringLike":{
              "s3:prefix":[
                "*" 
              ]
            }
          }
        },
        {
          "Sid":"AllowAllS3ReadActionsInDataFolder",
          "Effect":"Allow",
          "Action":[
            "s3:GetObject",
            "s3:ListBucket"
          ],
          "Resource":[
            "arn:aws:s3:::<bucket_name>/*" 
          ]
        },
        {
          "Sid":"AllowUpdateS3EventNotification",
          "Effect":"Allow",
          "Action":[
            "s3:PutBucketNotification",
            "s3:GetBucketNotification"
          ],
          "Resource":[
            "arn:aws:s3:::<bucket_name>" 
          ]
        }
      ]
    }
    

    Example 2: IAM policy with a prefix. For a folder, make sure the prefix ends with /, for example folder/:

    json
    {
      "Version":"2012-10-17",
      "Statement":[
        {
          "Sid":"AllowListingOfDataFolder",
          "Action":[
            "s3:ListBucket"
          ],
          "Effect":"Allow",
          "Resource":[
            "arn:aws:s3:::<bucket_name>"
          ],
          "Condition":{
            "StringLike":{
              "s3:prefix":[
                "<prefix>*" 
              ]
            }
          }
        },
        {
          "Sid":"AllowAllS3ReadActionsInDataFolder",
          "Effect":"Allow",
          "Action":[
            "s3:GetObject",
            "s3:ListBucket"
          ],
          "Resource":[
            "arn:aws:s3:::<bucket_name>/<prefix>*" 
          ]
        },
        {
          "Sid":"AllowUpdateS3EventNotification",
          "Effect":"Allow",
          "Action":[
            "s3:PutBucketNotification",
            "s3:GetBucketNotification"
          ],
          "Resource":[
            "arn:aws:s3:::<bucket_name>" 
          ]
        }
      ]
    }
    
  4. Go to Permissions for the role. Attach the policy created in step 3 to the role.

Set up the integration

Complete the following steps to configure the Amazon S3 source:

  1. Configure and verify the connection
  2. Select the file
  3. Create the converter configuration
  4. Enable the source

Configure and verify the connection

In Amplitude, create the S3 Import source.

Amplitude recommends that you create a test project or development environment for each production project to test your instrumentation.

To create the data source in Amplitude, gather information about your S3 bucket:

  • IAM role ARN: The IAM role that Amplitude uses to access your S3 bucket. This is the role created in Give Amplitude access to your S3 bucket.
  • IAM role external ID: The external ID for the IAM role that Amplitude uses to access your S3 bucket. This is the external ID created in Give Amplitude access to your S3 bucket.
  • S3 bucket name: The name of the S3 bucket with your data.
  • S3 bucket prefix: The S3 folder with your data.
  • S3 bucket region: The region where the S3 bucket resides.

When you have your bucket details, create the Amazon S3 Import source.

  1. In Amplitude Data, click Catalog and select the Sources tab.

  2. In the Warehouse Sources section, click Amazon S3.

  3. Select Amazon S3, then click Next. If this source doesn't appear in the list, contact your Amplitude Solutions Architect.

  4. Complete the Configure S3 location section on the Set up S3 Bucket page:

    • Bucket Name: Name of bucket you created to store the files. For example, com-amplitude-vacuum-<customername>. This tells Amplitude where to look for your files.
    • Prefix: Prefix of files to import. If it's a folder, prefix must end with "/". For example, dev/event-data/. For root folder, leave it empty.
    • AWS Role ARN. Required.
    • AWS External ID. Required.
    • AWS Region. Required.
  5. Optional: enable S3 Event Notification.

  • Event notification lets Amplitude's ingestion service discover data in your S3 bucket faster. Compared to scanning buckets, the ingestion service discovers new data based on notifications that S3 publishes. This feature reduces the time it takes to find new data.
  • Use this feature if you want near-real-time import. Amplitude discovers new data within 30 seconds with notifications enabled.
  • Before you enable notifications, note the following:
    • The IAM role you use must have permission to configure bucket event notifications.
    • The bucket can't have existing event notifications. This is a limit that Amazon imposes on S3 buckets.
    • Notifications don't apply retroactively.
  1. Click Test Credentials after you fill out all the values. You can't edit these values from the UI after you create the source, so make sure all the info is correct before clicking Next.
  2. Enter a Data Source Name and a Description (optional) and save your source. You can edit these details from Settings.

Next, create your converter configuration.

Amplitude continuously scans buckets to discover new files as they're added.

Select the file

  1. Specify the file type, compression type, and regular expression pattern for your files. The boilerplate of your converter file prepopulates based on the selections you make during this step. Click See Preview to test the configuration.
  2. Click Next.

If you add new fields or change the source data format, update your converter configuration.

Create the converter configuration

The converter configuration gives the S3 vacuum this information:

  • A pattern that tells Amplitude what a valid data file looks like. For example: \w+\_\d{4}-\d{2}-\d{2}.json.gz.
  • Whether the file is compressed, and if so, how.
  • The file's format. For example: CSV (with a particular delimiter), or lines of JSON objects.
  • How to map each row from the file to an Amplitude event or mutation.

Select the data type

You can import event, user property, and group property data.

Data TypeDescription
EventIncludes user actions associated with either a user ID or a device ID and may also include event properties.
User PropertiesIncludes dictionaries of user attributes you can use to segment users. Each property is associated with a user ID.
Group PropertiesIncludes dictionaries of group attributes that apply to a group of users. Each property is associated with a group name.
ProfilesIncludes dictionaries of properties that relate to a user profile. Profiles display the most current data synced from your warehouse, and are associated with a user ID.

Select the import strategy

Select from the following strategies, depending on your data type selection.

StrategyDescription
Mirror SyncDirectly mirrors the data in S3 with INSERT, UPDATE, and DELETE operations. To keep the data in sync with your source of truth, this strategy deactivates Amplitude's enrichment services like user property syncing, group property syncing, and taxonomy validation.
Append Only SyncImports new rows with Amplitude's standard enrichment services.

See the following table to understand which data types are compatible with which import strategies.

Data typeSupported import strategies
EventMirror and Append Only
User propertiesAppend Only
Group propertiesAppend Only
ProfilesMirror

Mutations and event volume

When you use mutations, Amplitude doesn't merge INSERT, UPDATE, or DELETE operations to per-row mutations based on your sync frequency. When more than one operation applies to an event during the sync window, the operations may apply out of order. Each operation also counts toward your event volume. As a result, you may use your existing event volume more quickly than you otherwise would. Contact sales to purchase additional event volume.

Event streaming destinations

Amplitude can't export events ingested through mutation-based imports (Mirror Sync) through event streaming destinations. If you need to export events to streaming destinations, use Append Only Sync instead of Mirror Sync.

Find a list of supported fields for events in the HTTP V2 API documentation and for user properties in the Identify API documentation. Add any columns not in those lists to either event_properties or user_properties, otherwise Amplitude ignores them.

After you add all the fields you want to import, view samples of this configuration in the Data Preview section. Data Preview automatically updates as you include or remove fields and properties. In Data Preview, you can review sample records from your source and how Amplitude imports that data. This ensures that you bring in all the data points you need. You can review 10 sample source records and their corresponding Amplitude events.

The group properties import feature requires that groups are set in the HTTP API event format. The converter expects a groups object and a group_properties object.

Manual converter creation

The converter file tells Amplitude how to process the ingested files. Create it in two steps: first, configure the compression type, file name, and escape characters for your files.

Then use JSON to describe the rules your converter follows.

The converter language describes extraction of a value given a JSON element. You specify this with a SOURCE_DESCRIPTION, which includes:

  • BASIC_PATH.
  • LIST_OPERATOR.
  • JSON_OBJECT.

Example converters

Refer to the Converter Configuration reference for more help.

Enable the source

Enabling the source requires a successful test of your converter. You can save your changes and return later, but the option to enable the source is available only after the converter completes a test successfully.

When you configure your converter, click Save and Enable to enable the source.

Troubleshooting

  • Make sure you give access to the correct Amplitude account. Use the same data center as your organization. For more information, refer to Give Amplitude access to your S3 bucket.
  • Amplitude doesn't support dot characters in bucket names. Ensure your bucket names consist of lowercase letters, numbers, and dashes.
  • You can use an existing bucket that you own. Update the bucket's policy with the output from the Amplitude wizard to ensure compatibility.

Was this helpful?