Building Reliable AI Infrastructure: What We Learned Scaling AI Visibility

Releasing AI Visibility exposed some reliability gaps. Early report failures ultimately led to a more stable product.

Feb 4, 2026

6 min read

Leo Jiang

Head of Engineering, AI Products, Amplitude

Amplitude AI Visibility measures what LLMs say about your brand. Every week, it generates fresh reports so you can track how your marketing efforts influence AI answers over time. Thousands of customers have already generated and used these reports to understand their brand's presence in AI-driven search. The most innovative brands are hungry for data about LLM performance, and AI Visibility is exactly what they need to effectively reach modern customers.

Running AI Visibility at scale has taught us a lot about building reliable AI infrastructure. We also received a lot of helpful feedback from our users. We’ve learned a lot in just a few months. Here is a summary of the lessons we’ve learned so far and an explanation of how those learnings have helped us continually improve our product.

Generating reports is incredibly complex

Creating AI Visibility reports involves coordination between multiple internal and external services. For each brand that uses AI Visibility, we send thousands of prompts to multiple AI models, extract brand mentions and citations, determine competitors, collect sentiment, and analyze cited URLs.

This process has a lot of moving parts, but it works well most of the time. However, it depends on multiple third-party services, each with their own reliability variance. Over the past months, we saw more failures than we found acceptable. Reports didn't update on schedule, data went stale, and reports came back incomplete.

We heard your feedback and took it seriously. We’ve investigated the infrastructure to look for improvements.

What we learned about orchestrating AI workflows

The core lesson of our analysis is that in a large-scale system with many external dependencies, failure modes compound in ways that won't occur during testing.

For example, when one LLM provider experienced an outage, our system retried requests, which is normally the right behavior. But during a sustained outage, retries increased load. So other reports started timing out, meaning more reports retried soon afterwards. Retries consumed usage limits with providers. Once limits were hit, new failures appeared, which persisted even after the original outage was resolved.

One problem became three. Three became ten. By the time we noticed the symptoms, the root cause was buried.

Finding the root cause

After months of putting bandage fixes around the symptoms, we traced the root cause to how we were implementing rate limiting.

AI Visibility reports run on Temporal, a workflow engine for scheduling complex tasks. We added rate limiters to workflows, expecting them to be shared across all reports. But they weren't. Since Temporal sometimes executed code in isolated environments, each report created its own instance of the limiter. When hundreds of reports ran simultaneously, the effective limit was hundreds of times higher than intended.

The result was a series of unwanted problems: we overloaded providers, triggered failures, and created cascading retries that made the system unstable under heavy load.

What we built to fix it

During the process of finding the root cause, we've made many improvements to make the system more durable. When errors inevitably occur, the system can recover or pause without causing cascading failures, regardless of what caused the original error:

Smarter retry behavior. We added guardrails so reports do not endlessly retry when a dependency is clearly unhealthy. The workflow now detects when failure rates are too high and aborts early, rather than burning compute and usage limits on work that is unlikely to succeed.
Partial success handling. Previously, small failures could cause the entire report to fail. We changed the workflow to tolerate a limited amount of failure in each step and still complete a report when the majority of the data is available. This reduced the number of missing weekly updates and made the system more resilient to intermittent issues.
Better load distribution. We improved how work is distributed so the system does not swing between overloaded and idle. This reduced peak-time congestion and made report completion more predictable.
Faster execution. We redesigned parts of report generation to run more work in parallel and batch external calls more efficiently. Faster reports mean fewer timeouts, fewer retries, and fewer opportunities for partial failures.
Clearer status reporting. When a report fails, users should not have to guess whether the data is fresh. We improved how failures are surfaced so customers do not accidentally make decisions based on incomplete data.
Real-time monitoring. We added better internal monitoring and alerting so we can detect drops in completion rates quickly, identify the most common failure modes, and respond before customers notice.

How Amplitude will continue to improve AI Visibility

Going forward, we are treating AI Visibility report generation as critical infrastructure. That means installing strong guardrails against cascades, improving failure visibility, and detecting problems early. We are also prioritizing stability whenever we make significant changes.

If you have not tried AI Visibility recently, now is a good time to try it out for free. You can use it to see how your brand appears across leading LLMs and track how your position changes from week to week. For those of you who have already tried AI Visibility, check out our latest updates and let us know what you think.

About the author

Leo Jiang

Head of Engineering, AI Products, Amplitude

More from Leo

Leo Jiang is the Head of Engineering, AI Products at Amplitude, focused on building new AI and marketing products. He has helped build Ask Amplitude, Agents, and AI Visibility.

More from Leo

Topics

Marketing Analytics