This is our first post in a series we’re writing on the Evolution of Analytics, from the early web days to today. Check in next week for our second article, in which we’ll discuss data practices and stories from a pioneer of data-driven products and analytics: Zynga. (update 06/24/15 - read ‘Zynga Analytics at its Peak‘)
Early analytics tools didn’t always work.
In the earliest days of the Internet, there was one relatively simple method to measure the traffic your website saw: use the server log.
Every time someone views your web page, their browser makes an HTTP – hypertext transfer protocol – request for every file on your page, to the server each file is hosted on. A record of each request is kept in the server log.
Server log records keep track of many useful things: the person’s IP address, a timestamp of the request, the status of the request (whether it was successful, timed out, etc.), the number of bytes transferred, and the referring url – the page the window displayed before making a request for your file.
Breakdown of a fictional line of server log via jasoft
The early, highest-tech web analytics solutions were programs that took every entry in the server log and helped you interpret them. The log could tell you whether your server saw much heavier traffic at certain points in the day, which you needed to do a better job of accommodating, as many requests failed. A good log analysis software could also help you pick out the requests made by bots and crawlers from those made by human visitors in the log – that’s good both if you’re worried about being crawled, and if you want accurate data on the people who view your site. The log could also tell you who drove the most traffic to your site, or if somebody new was linking to you.
Though the server log has potential for sophisticated analysis, in the early 1990s, few sites took full advantage of it. Web traffic analysis typically fell under the responsibility of the IT department: The server log was especially handy when managing performance issues and keeping websites running smoothly. “Companies typically didn’t have defined site goals and success metrics,” a writer at marketing blog ClickZ observes. “Even if they did, very rarely was the tool set up to track those key metrics.”
The log got harder and harder to interpret meaningfully as the web developed. In the earliest days, most urls only hosted a single html file of plain hypertext, so one pageview generated one HTTP request. But once pages started to have other files embedded in them – images, audio files, video – that resulted in multiple HTTP requests per page visit. Also, as browsers became more sophisticated, they developed a technique called caching: temporarily storing a version of a file in the system cache, to get around issuing requests over and over again for repeated visits. But because a new HTTP request isn’t being issued, these repeated views don’t show up in the server log.
In 1997, with existing tools, it could take up to 24 hours for a big company to process its website tracking data. Server log analytics companies hussled to keep up with the rate at which the web was growing, with varied success. But others looked for a solution elsewhere – in the late 1990s, a solution arose from an unlikely place: the Internet of amateurs.
Hit Counters and Server Logs
A site that was taken down in 2009 when Geocities shut down, via One Terabyte of Kilobyte Age
Not everybody with a website had access to their server log, and not everybody who did knew what to do with it. The mid to late 90’s saw the rise of the extreme-amateur web developer: X-Files fans who filled pages with photos of every moment of visible romantic tension between Scully and Mulder; teenagers with LiveJournal accounts about being teenagers with LiveJournal accounts; retired conspiracy theorists.
These web developers weren’t about to pour a lot of money into their sites, and they often hosted their sites with free providers that didn’t give them access to their server logs. Being amateurs, many of them were OK with this. Their pages were relatively simple, and their traffic relatively low, so they weren’t too worried about monitoring performance. Even if they had access to their server logs, there wasn’t much incentive for them to fork over a couple hundred dollars for log analysis software, or to learn to use the complicated free versions.
However, this did leave these amateur web developers with one, obvious, burning question: How many people are looking at my website?
Enter the hit counter: basically the bluntest web analytics solution ever provided.
A subset of the variety of hit counter styles still available to you, if you want one for some reason.
Hit counters – also known as web counters – were chunks of code that used a simple php script to display an image of a number, which incremented every time the script ran. It wasn’t a sophisticated metric, and often kind of obtuse, but hit counters were damn easy to use, even if you knew next to nothing about the web. A user could simply select a style they wanted from the hit counter website, and then copy-paste the generated code.
Some hit counters were better than others. Later hit counters came with a hosted service, which let you log in to see slightly more sophisticated website stats than the basic hit count. Others were not as fancy. They did a variable job of filtering out HTTP requests from bots. A lot of script counters were SEO spamming techniques: they would hide links in the copy-paste code, to get lots of link referrals.
Despite their shortcomings, hit counters were an improvement on server log analysis in a few very important ways.
First of all, server log analysis required website owners to have access to their server log files, and to know how to manage that data and interpret it. Hit counters introduced a way to automatically send the data to somebody else to analyze. Because the hit counter images were hosted on a website owned by the people who made the hit counter, the HTTP requests for these images ended up in their server logs. The data was now successfully being gathered, stored, and interpreted by a third party.
Second, a website owner did not need to know what they were doing to read a hit counter. They might misinterpret the data, or miss a lot of its nuance, but it was clear what a hit counter was at least supposed to estimate: how many times people have visited this website. Most of the time this was not useful data, but it made many an amateur feel like they were part of the World Wide Web. And when their site did experience a relative surge in traffic, they had a means to discover it.
A tagged butterfly, photo by Derek Ramsey.
Hit counters were a primitive example of web page tagging, which is a technique employed by most analytics software today. In field biology, tagging is when an animal is outfitted with a chip, which can be monitored remotely as a means of tracking the animal. Web analytics tagging is similar. The “tag” is a file – the images in a hit counter script, for example – embedded in the web page’s html. When someone HTTP requests a page, they also HTTP request the tag, sending data about the user and the request to whomever is collecting the data for analysis.
In the late 1990s, tagging-based analytics companies started to proliferate. Some of these sold software similar to the old server log analytics packages: they used tags to track user behavior, but sent this data to a client-hosted and client-managed database. More common, though, were web-hosted-tag-based analytics solutions – which stored your data in a web-hosted database, owned and managed by the analytics company. This was a much cheaper and easier-to-implement solution for the rapidly growing number of smaller tech companies.
These new, hosted analytics solutions included: Omniture, WebSideStory, and Sawmill. Their solutions also tended to come with simpler, less-technical interfaces, as analytics itself was undergoing a metamorphosis.
“Marketers [were] starting [to think] of ways they could use this data if they had better access to Web analytics,” ClickZ chronicles. But, starting out, they usually didn’t know what questions to ask, or how to answer them. Their companies’ understaffed IT departments often ran out of the resources and patience it took to help them. These new tagging-based analytics tools took the reliance off of IT, the servers, and log file processing. From ClickZ:
“Suddenly interested marketers could go to one of those companies, have some basic tags placed on the site, and get more accurate data. It looked like all the issues were solved, and [marketing departments] could finally get the information they needed – still rarely tied to overall business goals, but a step in the right direction.”
Google Analytics Buys Urchin, and a New Era is Born
The founding team of Urchin, an early web analytics company.
One of the best and biggest of the companies providing server log solutions was Urchin. Back when some companies took 24 hours to parse their server logs, the first version of Urchin’s software could do it in 15 minutes. This gave them a huge advantage. Some of their first customers were major web hosts, which meant that soon they were the standard analytics solution for anybody who used those hosts. This included NBC, NASA, AT&T and about one fifth of the Fortune 500. In addition to their client-hosted server log software, Urchin, they also offered a web-hosted tag-based analytics program, Urchin On Demand.
Seeing the trend towards less-technical analytics users, the Urchin team strove to develop software that was powerful, but still usable by non-programmers or statisticians. One of the founders has said in an interview,
“[Urchin’s] value was to democratize the web feeling, trying to make something complex really easy to use.”
In 2004, Google representatives approached the Urchin team at a trade show. Months later, in 2005, the acquisition was announced for a rumored $30 million. Most of the founders became executives at Google, and Urchin On Demand became Google Analytics.
Google proceed to build Google Analytics out into the most widely-used analytics solution in the world. And they proceeded to let Urchin, the client-hosted server log solution, languish. Many an Urchin customer cried foul. Others observed that there was a clear trend away from server log analytics solutions, in general.
That was a blow to any company invested in the client-hosted Urchin solution. But Google’s focus on Google Analytics ended up working out for them – they’re far and away the market leader in the Analytics and Tracking industry. According to BuiltWith.com, almost 28 million active websites use Google Analytics. New Relic is its nearest competitor, trailing far behind with under 1 million active websites.
Yesterday, Pageviews; Today, Events
But even Google Analytics still placed a strong emphasis on everybody’s favorite vanity metric: page views. This powerful technology was being employed to answer the question: “How many people are looking at my website?” In some sense they were glorified hit counters.
The future of analytics was yet to come, and, catalyzed by the rise of mobile analytics, it would move far beyond page views. The next step in the evolution of analytics was on its way: event tracking. Unlike the early days of web analytics, we would soon see a shift from pageview-centric website analytics, to event-centric product analytics.
Thanks for reading! Don’t miss our next post in the Evolution of Analytics series: stories from the Zynga trenches.