Platform

AI

AI Agents
Sense, decide, and act faster than ever before
AI Visibility
See how your brand shows up in AI search
AI Feedback
Distill what your customers say they want
Amplitude MCP
Insights from the comfort of your favorite AI tool

Insights

Product Analytics
Understand the full user journey
Marketing Analytics
Get the metrics you need with one line of code
Session Replay
Visualize sessions based on events in your product
Heatmaps
Visualize clicks, scrolls, and engagement

Action

Guides and Surveys
Guide your users and collect feedback
Feature Experimentation
Innovate with personalized product experiences
Web Experimentation
Drive conversion with A/B testing powered by data
Feature Management
Build fast, target easily, and learn as you ship
Activation
Unite data across teams

Data

Warehouse-native Amplitude
Unlock insights from your data warehouse
Data Governance
Complete data you can trust
Security & Privacy
Keep your data secure and compliant
Integrations
Connect Amplitude to hundreds of partners
Solutions
Solutions that drive business results
Deliver customer value and drive business outcomes
Amplitude Solutions →

Industry

Financial Services
Personalize the banking experience
B2B
Maximize product adoption
Media
Identify impactful content
Healthcare
Simplify the digital healthcare experience
Ecommerce
Optimize for transactions

Use Case

Acquisition
Get users hooked from day one
Retention
Understand your customers like no one else
Monetization
Turn behavior into business

Team

Product
Fuel faster growth
Data
Make trusted data accessible
Engineering
Ship faster, learn more
Marketing
Build customers for life
Executive
Power decisions, shape the future

Size

Startups
Free analytics tools for startups
Enterprise
Advanced analytics for scaling businesses
Resources

Learn

Blog
Thought leadership from industry experts
Resource Library
Expertise to guide your growth
Compare
See how we stack up against the competition
Glossary
Learn about analytics, product, and technical terms
Explore Hub
Detailed guides on product and web analytics

Connect

Community
Connect with peers in product analytics
Events
Register for live or virtual events
Customers
Discover why customers love Amplitude
Partners
Accelerate business value through our ecosystem

Support & Services

Customer Help Center
All support resources in one place: policies, customer portal, and request forms
Developer Hub
Integrate and instrument Amplitude
Academy & Training
Become an Amplitude pro
Professional Services
Drive business success with expert guidance and support
Product Updates
See what's new from Amplitude

Tools

Benchmarks
Understand how your product compares
Templates
Kickstart your analysis with custom dashboard templates
Tracking Guides
Learn how to track events and metrics with Amplitude
Maturity Model
Learn more about our digital experience maturity model
Pricing
LoginContact salesGet started

AI

AI AgentsAI VisibilityAI FeedbackAmplitude MCP

Insights

Product AnalyticsMarketing AnalyticsSession ReplayHeatmaps

Action

Guides and SurveysFeature ExperimentationWeb ExperimentationFeature ManagementActivation

Data

Warehouse-native AmplitudeData GovernanceSecurity & PrivacyIntegrations
Amplitude Solutions →

Industry

Financial ServicesB2BMediaHealthcareEcommerce

Use Case

AcquisitionRetentionMonetization

Team

ProductDataEngineeringMarketingExecutive

Size

StartupsEnterprise

Learn

BlogResource LibraryCompareGlossaryExplore Hub

Connect

CommunityEventsCustomersPartners

Support & Services

Customer Help CenterDeveloper HubAcademy & TrainingProfessional ServicesProduct Updates

Tools

BenchmarksTemplatesTracking GuidesMaturity Model
LoginSign Up

Optimal Streaming Histograms

How can you create a bucketing algorithm for an arbitrary dataset you don't know in advance?
Insights

Aug 6, 2014

7 min read

Alicia Shiu

Alicia Shiu

Former Growth Product Manager, Amplitude

Optimal Streaming Histograms

How can you create a bucketing algorithm for an arbitrary dataset you don’t know in advance?

We get a constant stream of numerical data from our customers. We’re not sure what the range of this data might be or how many orders of magnitude it may span. The distribution of the data is constantly changing as new data is added.

It’s not feasible to keep track of all of these distinct values, both on the database end and for data visualization — we need to sort it into buckets. How can we decide on a reasonable bucketing algorithm for histogramming solution boundaries for all of these constantly changing datasets?

Basically, this boils down to 2 key requirements. Our bucketing algorithm needs to generate bucket boundaries that:

1. are useful & reasonable for any range of data

2. remain useful even if the distribution of the data changes

We went through a few iterations of bucketing algorithms before we landed on our current solution.

Technique #1: Save the first 1000 distinct values, and then make evenly spaced buckets.

This is the most naive way to bucket. We made 50 evenly spaced buckets across the range of the first 1000 values, and included 1 bucket for everything below that range, and 1 bucket for everything above it.

Problems with Technique #1:

1. Resolution

Say you have a highly skewed distribution of values with a long tail:

skewed histogram image

With evenly spaced buckets (dotted lines – in this example, 10 buckets), you lose resolution exactly where the majority of your values are.

2. Fixed buckets

If the distribution changes over time, the chosen buckets won’t be applicable anymore. We did our best to get around this issue by re-bucketing each day.

3. Strange (arbitrary) spacing

With even spacing, our buckets each had a width of: [range of values]/50. This resulted in weird bucket ranges that didn’t correspond to anything meaningful.

Technique #2: Iteratively merge the closest values into buckets

We tried Technique #2 to solve the resolution problem. We took the first 1000 distinct values and merged the 2 values that were closest together into 1 bucket. We then did this over and over again until we had 50 buckets.

Problems with Technique #2:

Although this made resolution better, it made the strange spacing problem even worse. Not only did bucket ranges seem arbitrary, but all of the bins were of different widths. This was especially problematic when visualizing the data in a histogram.

Let’s illustrate with a simple example. We have this list of 20 numbers, and we want to put them in 10 buckets:

3, 6, 7, 8, 13, 20, 23, 28, 31, 42, 45, 50, 63, 70, 74, 85, 88 93, 95, 98

We run the merging algorithm and end up with these bins:

bucket-tech-2-table

bucketing-tech-2-histo

The buckets are all of varying width, which makes the x-axis deceptive. And again, this technique runs into the problem of having to update the bucketing scheme whenever the data’s distribution changes.

Interlude: the Equal Width vs. Equal Height Conflict

When deciding on bucket size, there’s a constant struggle between equal height buckets and equal width buckets. Technique #1 had equal width buckets, which are nice and neat, but ran into the resolution problem. Technique #2, which iteratively merged the closest distinct values into buckets, was an attempt at equal height buckets — ie, buckets that each held about the same number of values, no matter the distribution of those values. However, we got weird spacing issues.

This brings us to Technique #3:

Technique #3: V-optimal bucketing

V-optimal histograms try to strike a balance between optimizing for equal height and equal width buckets. The idea is to minimize the cumulative weighted variance, which is defined as:

v-optimal

where J is the number of buckets, nj is the number of items in the _j_th bin, and Vj is the variance between values in the _j_th bin. V-optimal histograms are better at estimating bucket contents than either equal width or equal height histograms.

Problems with Technique #3:

The key disadvantage with this technique is that choosing the optimal bucket boundaries gets very complex as the range of values and number of buckets increases. This also means it’s difficult to update (again, the problem of fixed buckets as distributions change over time).

We never actually implemented V-optimal bucketing, but we did consider it. Ultimately, we decided on technique #4:

Technique #4: Pre-create buckets on a logarithmic scale

For a given range of values, we used a pre-determined bucket size: 10% increments for every order of magnitude. The table below shows our bucket sizes for a given range of values:

bucketing-tech-4-table

Here’s an example from the Amplitude dashboard. We’re looking at the responseTime property of an event called Server Response. Since the data falls in the 100-1000 range, the buckets are in increments of 10:

bucketing-screenshot2

Problems that Technique #4 SOLVES:

1. Resolution: This method ensures that no matter what a value is, the range of the bucket will be within 10% of the true value, or two significant figures. That’s plenty for almost all histograming needs.

2. Fixed width buckets: With this method, having buckets of fixed width is a viable option. We no longer have to worry about adjusting the buckets every time the distribution changes. This bucketing scheme should work for a variety of data set types, and works across any range of values.

3. Strange spacing: With these pre-determined buckets that can be applied to any range of values, we no longer have weird buckets like 63.47 to 84.19. Everything is nice and neat.

Admittedly, this method is still not ideal for data visualization if your data spans several orders of magnitude. Buckets have different widths at different orders of magnitude, but at least the width progresses in a way that’s predictable and is fixable in a visualization.

Ultimately, we decided on this bucketing algorithm because it allows us to efficiently collect continuous value data and organize it with a minimal loss of resolution in the information. It meets our 2 key requirements: being useful for any range of data and remaining useful even if the distribution of the data changes.

Thoughts? Join the discussion on Hacker News.

About the author
Alicia Shiu

Alicia Shiu

Former Growth Product Manager, Amplitude

More from Alicia

Alicia is a former Growth Product Manager at Amplitude, where she worked on projects and experiments spanning top of funnel, website optimization, and the new user experience. Prior to Amplitude, she worked on biomedical & neuroscience research (running very different experiments) at Stanford.

More from Alicia
Topics
Platform
  • Product Analytics
  • Feature Experimentation
  • Feature Management
  • Web Analytics
  • Web Experimentation
  • Session Replay
  • Activation
  • Guides and Surveys
  • AI Agents
  • AI Visibility
  • AI Feedback
  • Amplitude MCP
Compare us
  • Adobe
  • Google Analytics
  • Mixpanel
  • Heap
  • Optimizely
  • Fullstory
  • Pendo
Resources
  • Resource Library
  • Blog
  • Product Updates
  • Amp Champs
  • Amplitude Academy
  • Events
  • Glossary
Partners & Support
  • Contact Us
  • Customer Help Center
  • Community
  • Developer Docs
  • Find a Partner
  • Become an affiliate
Company
  • About Us
  • Careers
  • Press & News
  • Investor Relations
  • Diversity, Equity & Inclusion
Terms of ServicePrivacy NoticeAcceptable Use PolicyLegal
EnglishJapanese (日本語)Korean (한국어)Español (Spain)Português (Brasil)Português (Portugal)FrançaisDeutsch
© 2025 Amplitude, Inc. All rights reserved. Amplitude is a registered trademark of Amplitude, Inc.

Recommended Reading

article card image
Read 
Insights
The Product Benchmarks Every B2B Technology Company Should Know

Dec 11, 2025

5 min read

article card image
Read 
Company
How Amplitude Taught AI to Think Like an Analyst

Dec 11, 2025

8 min read

article card image
Read 
Product
Amplitude + OpenAI: Get New Insights in ChatGPT via MCP

Dec 10, 2025

3 min read

article card image
Read 
Product
Introducing the Next Frontier of Analytics: Automated Insights

Dec 10, 2025

5 min read

Explore Related Content

Integration
Using Behavioral Analytics for Growth with the Amplitude App on HubSpot

Jun 17, 2024

10 min read

Personalization
Identity Resolution: The Secret to a 360-Degree Customer View

Feb 16, 2024

10 min read

Product
Inside Warehouse-native Amplitude: A Technical Deep Dive

Jun 27, 2023

15 min read

Guide
5 Proven Strategies to Boost Customer Engagement

Jul 12, 2023

Video
Designing High-Impact Experiments

May 13, 2024

Startup
9 Direct-to-consumer Marketing Tactics to Accelerate Ecommerce Growth

Feb 20, 2024

10 min read

Growth
Leveraging Analytics to Achieve Product-Market Fit

Jul 20, 2023

10 min read

Product
iFood Serves Up 54% More Checkouts with Error Message Makeover

Oct 7, 2024

9 min read

Blog
InsightsProductCompanyCustomers
Topics

101

AI

APJ

Acquisition

Adobe Analytics

Amplify

Amplitude Academy

Amplitude Activation

Amplitude Analytics

Amplitude Audiences

Amplitude Community

Amplitude Feature Experimentation

Amplitude Guides and Surveys

Amplitude Heatmaps

Amplitude Made Easy

Amplitude Session Replay

Amplitude Web Experimentation

Amplitude on Amplitude

Analytics

B2B SaaS

Behavioral Analytics

Benchmarks

Churn Analysis

Cohort Analysis

Collaboration

Consolidation

Conversion

Customer Experience

Customer Lifetime Value

DEI

Data

Data Governance

Data Management

Data Tables

Digital Experience Maturity

Digital Native

Digital Transformer

EMEA

Ecommerce

Employee Resource Group

Engagement

Event Tracking

Experimentation

Feature Adoption

Financial Services

Funnel Analysis

Getting Started

Google Analytics

Growth

Healthcare

How I Amplitude

Implementation

Integration

LATAM

Life at Amplitude

MCP

Machine Learning

Marketing Analytics

Media and Entertainment

Metrics

Modern Data Series

Monetization

Next Gen Builders

North Star Metric

Partnerships

Personalization

Pioneer Awards

Privacy

Product 50

Product Analytics

Product Design

Product Management

Product Releases

Product Strategy

Product-Led Growth

Recap

Retention

Startup

Tech Stack

The Ampys

Warehouse-native Amplitude