Google Cloud Storage (GCS)
Amplitude's GCS Import feature lets you import event or user properties into your Amplitude projects from a GCS bucket. This article helps you configure this data source within Amplitude.
Getting started
Prerequisites
Before you start, make sure you've taken care of some prerequisites.
- Make sure you have a GCS service account with the appropriate permissions. Learn more.
- Make sure that an Amplitude project exists to receive the data. If not, create a new project.
- Make sure you are an Admin or Manager of the Amplitude project.
- Make sure your GCS bucket has data files ready for ingestion. They must conform to the mappings that you outline in your converter file.
- Make sure the data in your GCS bucket follows the format outlined in Amplitude's HTTP API v2 spec.
File requirements
The files you want to send to Amplitude must follow some basic requirements:
- Files contain events, with one event per line.
- Files upload in the events' chronological order.
- Filenames are unique.
- File size must be greater than 1MB and smaller than 5GB. For customers with large event volumes, Amplitude recommends file sizes close to 500MB for optimal performance.
- Files are compressed or uncompressed JSON, CSV, or parquet files.
Create a GCS service account and set permissions
If you haven't already, create a service account for Amplitude within the Google Cloud console. This lets Amplitude export your data to your Google Cloud project.
After you create a service account, generate and download the service account key file and upload it to Amplitude. Make sure you export Amplitude's account key in JSON format.
Add this service account as a member to the bucket you'd like to export data to. Make sure to give this member the storage admin role so that Amplitude has the necessary permissions to export the data to your bucket.
You can also create your own role, if you prefer.
Note that the export process requires, at a minimum, the following permissions:
storage.buckets.get.storage.objects.get.storage.objects.create.storage.objects.delete.storage.objects.list.
Add a new GCS source
To add a new GCS data source for Amplitude to draw data from, follow these steps:
- In Amplitude Data, click Catalog and select the Sources tab.
- In the Warehouse Sources section, click GCS.
- Upload your Service Account Key file. This gives Amplitude the permissions to pull data from your GCS bucket.
- After you've uploaded the Service Account Key file, enter the bucket name and folder where the data resides.
- Click Next to test the credentials. If all your information checks out, Amplitude displays a success message. Click Next > to continue the process.
- In the Enable Data Source panel, name your data source and give it a description. You can edit this information later through Settings. Then click Save Source. Amplitude confirms that you've created and enabled your source.
- Click Finish to go back to the list of data sources. If you've already configured the converter, the data import starts in a few moments. Otherwise, it's time to create your data converter.
Create the converter configuration
The final step in setting up Amplitude's GCS ingestion source is creating the converter file. Your converter configuration gives the integration this information:
- A pattern that tells Amplitude what a valid data file looks like. For example: "\w+_\d{4}-\d{2}-\d{2}.json.gz".
- Whether the file is compressed, and if so, how.
- The file's format. For example: CSV (with a particular delimiter), or lines of JSON objects.
- How to map each row from the file to an Amplitude event.
The converter file tells Amplitude how to process the ingested files. Create it in two steps: first, configure the compression type, file name, and escape characters for your files. Then use JSON to describe the rules your converter follows.
Guided converter creation
You can create converters through Amplitude's new guided converter creation interface. This lets you map and transform fields visually, removing the need to manually write a JSON configuration file. Internally, the UI compiles down to the existing JSON configuration language used at Amplitude.
First, take a look at the different data types you can import: Event, User Property, and Group Property data.
Amplitude recommends selecting preview in step 1 of the Data Converter, where you see a sample source record before moving to the next step.
After you select a particular field, you can choose to transform the field in your database. Do this by clicking on "Transform" and choosing the kind of transformation you want to apply. You can find a short description for each transformation.
After you select a field, you can open the transformation modal and choose from a variety of Transformations.
Depending on the transformation you select, you may need to include more fields.
After you have all the fields needed for the transformation, you can save it. You can update these if your requirements change.
Although Amplitude needs certain fields to bring data in, it also supports extra fields which you can include by clicking the "Add Mapping" button. Here, Amplitude supports 4 kinds of mappings: Event properties, User Properties, Group Properties, and Additional Properties.
Find a list of supported fields for events in the HTTP V2 API documentation and for user properties in the Identify API documentation. Add any columns not in those lists to either event_properties or user_properties, otherwise Amplitude ignores them.
After you add all the fields you want to bring into Amplitude, you can view samples of this configuration in the Data Preview section. Data Preview updates automatically as you include or remove fields and properties. In Data Preview, you can look at a few sample records based on the source records along with how Amplitude imports that data. This ensures that you bring in all the data points you need into Amplitude. You can look at 10 different sample source records and their corresponding Amplitude events.
The converter language describes extraction of a value given a JSON element. Specify this using a SOURCE_DESCRIPTION, which includes:
- BASIC_PATH.
- LIST_OPERATOR.
- JSON_OBJECT.
Example converters
Refer to the Converter Configuration reference for more help.
Configure converter in Amplitude
- Click Edit Import Config to configure the compression type, file name, and escape characters for your files. The boilerplate of your converter file pre-populates based on the selections made during this step. You can also test whether the configuration works by clicking Pull File.
- Click Next.
- Enter your converter rules in the text editor.
- Test your conversion. Click Test Convert. Examine the conversion preview. Make adjustments to your converter configuration as needed.
- Click Finish.
If you add new fields or change the source data format, you need to update your converter configuration. Note that the updated converter only applies to files discovered_after_converter updates are saved.
Misleading "Malformed File" Error during Preview
If you see an error during preview with converter, it does not always mean your original GCS files are malformed. Amplitude first copies files from GCS to an internal S3. The Preview button reflects files in the S3, not directly in GCS. If your files do not match the required folder structure, Amplitude doesn't copy them into the Amplitude S3, resulting in nothing imported or shown in the preview.
Storage organization requirements
After the initial ingestion, your data organization must conform to this standard for subsequent imports:
{bucket name}/{GCSPrefix}/{YYYY}/{MM}/{DD}/{HH}/{optional}/{additional}/ {folder}/{structure}/{file name}
where:
{bucket name}is the name of your GCS bucket.{GCSPrefix}is the source prefix folder specified in your source setup configuration.{YYYY}/{MM}/{DD}/{HH}is the required date prefix format to upload new files. Organize files according to the time they upload to the bucket, not when your system generates them. Also, you must always use two digits (as opposed to one) to represent the month, day, and hour.{optional}/{additional}/{folder}/{structure}is where you can add additional folder structure details. These details are strictly optional. If you do include them, an example file path might look like{bucket name}/{GCSPrefix}/{YYYY}/{MM}/{DD}/{HH}/**cluster-01/node-25**/{file name}.
These organizational requirements apply only to new data you want to import after the source is enabled. You don't have to reorganize any pre-existing files, because Amplitude's GCS Import captures the data they contain on the first ingestion scan. After the initial scan, new data uploaded to the bucket must conform to the requirements outlined here.
File Search Time Window (48 Hours)
The Amplitude GCS Import job only searches for files from the past 48 hours based on the folder date in the path.
Example: If the batch job starts at 2024/09/03 00:00, it only searches in:
{bucket name}/{GCSPrefix}/2024/09/01/00/
to
{bucket name}/{GCSPrefix}/2024/09/03/00/
Ensure your data files upload into a folder that falls within this 2-day window, so the Amplitude service can detect and import them.
Was this helpful?