Online Ad Effectiveness Research: Crisis of Control


ADOTAS – As I explained in an earlier post on InsightfulAnalytics (the blog of my company, InsightExpress), what makes online ad effectiveness measurement work is the use of an experimental design. I’ve also mentioned in earlier posts that while experimental design is a fantastic approach and one we recommend, for a variety of reasons clients prefer to run quasi-experimental studies. One of the important aspects of putting together a good quasi-experimental design is to create a control cell that is as equivalent to the test cell as possible. Unfortunately, this is only a trend — that’s just not how things work online.

When I first started doing online ad effectiveness research in 1997, there was no such things as ad server-delivered tags. Everything we did for sampling a campaign was hard-coded to a page, including the advertising. This made for an extremely easy design. Since there was no complex ad server to worry about, I could randomly redirect visitors to either the page with the test ad or the page with the control ad. It doesn’t get much better than that – pure random assignment of the respondent pool. However, with the advances in ad serving, the survey sampling code moved into the ad server, and thus began the era of the pop-up and the dreaded bonus inventory.

For those of you who don’t know how this works, let me paint a picture for you. I negotiate a buy with a publisher for 100 milion premium ad impressions. It’s pretty sizable, and I want to measure the effectiveness of the campaign. Since the test ad will be measured via pop-up (or, more precisely, a DHTML fly-over) triggered by Javascript code on the page where the ad runs, the only gap in sampling I need to fill is people who didn’t see the ad. But here’s the rub: I don’t want to spend any more money on premium impressions to run a public service ad, just to collect people who didn’t see my tested ad, so instead I ask my publisher for bonus inventory to run a PSA. Now it gets tricky. The publisher knows I want to evaluate the advertising and, indirectly, their site, so a measurement of the campaign could mean future business. It’s also likely that the agency will be annoyed if the publisher doesn’t fork over bonus inventory, so the reality is the publisher has few options. As you can imagine, there’s more downside for the publisher in this equation. As a publisher, you’re forced into giving up bonus impressions for a campaign, which means you’re giving away inventory that could be earning you money– so it’s a loss leader.

If you were a publisher and your client just bought 100 million home page impressions and was now asking for bonus inventory for a control cell, just where are you planning to source that bonus inventory from? Are you going to give the advertiser the most equitable sample of respondents, i.e.bonus inventory from home page impressions? Or are you going to find the cheapest, hardest-to-sell bonus inventory and hand that over? Obvious conclusion here, but I’ll say it anyway: You’ll get the cheap stuff. What this means is that while your test cell may have been recruited from home page visitors, your control could very well come from a niche section of the site. Or, more effectively illustrated, if you bought impressions and collected test cell sample from the NFL home page on a sports site, your control cell respondents could be coming from cycling or figure skating – not exactly the most favorable

Of course, there are alternate methods of identifying control cell respondents. First among them is to rely on a page node. This is a Javascript tag that lives on a website as opposed to in an ad server. This gives you access to recruit from the premium inventory sections of a site without needing to be tied to a bonus ad impression. InsightExpress deploys our own nodes, called iCompass, across a number of the comScore 250 sites, and Dynamic Logic has a similar system in their infrastructure. While a node improves the comparability between test and control, it is by no means a comprehensive solution.

The other approach that one can take is to create a model to predict results for the control cell. This is an approach that is commonly associated with comScore and their Smart Control methodology. These models are often reverse frequency models that look at the impact of an ad campaign at a frequency of one, two, three, four and so on, and reverse forecast that data to what an impact would be at a frequency of zero. This novel approach wins kudos for being an innovative solution, but being a model, it’s highly susceptible to errors. Specifically, since these models forecast based on frequency, you need to be absolutely certain that two things are true: frequency counts are accurate, and there are no differences in the site affinity across frequency buckets.

What this means is that if there is any cookie deletion present in the campaign, you could have a number of people in your lower frequency buckets that actually had a higher number of exposures. If your model assumes that someone had one exposure, and in reality they had five exposures, your data is wrong. When this happens, you end up with garbage in, garbage out. Ironically, the data gets even worse when you apply frequency caps (as clients often do). With a frequency cap applied, viewers who don’t delete their cookies will only see the advertising as frequently as the cap allows. However, those who delete cookies are unrestricted in terms of exposure and can end up seeing the ads more times than restricted by the frequency cap, and due to cookie deletion the server only counts them as a single exposure. When frequency capping is employed on a campaign, it is not unusual to see higher impacts in lower frequencies due to cookie deletion. This certainly makes the data impossible to model. To see how big of a deal cookie deletion can be in a campaign, check out my post on cookies.

Even more concerning is the error introduced into these models when they’re applied at a site level — specifically, to understand how to forecast effect back to a frequency of zero (or the control cell), we need to understand the impact at various frequencies. However, the fundamental audiences that make up site visitors can change dramatically as frequency increases. If you think about it, this makes total sense. A person who only goes to one time in the past month is very different from a person that goes to eight times in the past month. The person who only goes once might be a random consumer following a link from a Google search for the best snow tires, while the person who goes eight times a month might be a total gearhead and completely engaged in autoculture. It’s natural that the heaviest consumers of a site are the folks who have the highest affinity with the site. They’re also likely to interpret the advertising that runs on the site differently than those with the lowest frequencies. Unless you can take into account the differences between these groups (which a model can’t), you’re reverse forecasting your data on an assumption that everyone is equal, and you get inaccurate data. What’s concerning about the modeled approach is that there are no respondents, there is no easy way to refute the data, and often models produce positive results – exactly what everyone wants to see.

Most concerning about a frequency-based model is that it assumes a linear relationship between no exposure and many exposures. With these models, you’re inherently assuming that the relationship between one ad exposure and two ad exposures is similar to the relationship between zero exposures and one exposure. Of course this doesn’t make sense. No model can predict the initial effectiveness of an advertisement. Some ads might move brand metrics dramatically after the first exposure, and some might move metrics only slightly. The amount of that initial movement cannot be determined by a model, but only by empirical observation of the actual effect. So if you’re looking for the truth, you might want to try a different approach.

The final method of control cell collection that bears mentioning is our own UniversalControl methodology. Many people in this industry are heretical about promoting the use of random experiments. This sounds good at first blush, but it really misses the mark when it comes to what truly matters in this kind of research. Sure, random assignment is great — I won’t dispute that fact. However, more importantly, studies need to be “blocked.” Talk to anyone in the medical research field, and they’ll tell you they run random block designs. For some reason, most internet researchers forget the blocking part of the design. Many of us here at InsightExpress believe that blocking is more important than random assignment.

For those of you who don’t know what blocking is, it’s a method deployed in experimental research to ensure/force equal representation across test and control cells. If I’m studying the impact of an ad for a heart medication and recruit 1,000 test and 1,000 control, I need to ensure that there is an equal number of people with heart disease in both the test and control cells. The incidence of heart disease is such that in a random sample of 1,000 test and 1,000 control, I could end up with significantly more heart disease sufferers in either my test or control cell, which would corrupt the efficacy of my design. In fact, the same thing happens routinely for more mundane variables, such as age or income. Random designs are not a panacea, at least not without blocking or controlling for the audience.

As you can no doubt tell from this detail, here at InsightExpress we put a lot of faith into the practice of blocking. For every study we run, we’ve pre-assigned (just like in a true experimental design) control cell respondents for every one that’s exposed. Each person exposed to a campaign has, in effect, a twin that serves as his or her control, and our testing has shown that this blocked approach produces much cleaner and more comparable results between test and control cells. We also love that this process runs off of our Ignite Network, which means no more pop-ups, no more bonus inventory, higher response rates and no more scrambling to find control respondents. What’s even better is that, unlike the results of a model, our control cell contains actual respondent data that can be easily verified and cross-tabbed to understand sub-segments of an audience.

So, if you skipped the bulk of this post and jumped to the end, here’s what I’d suggest you take away:

Bonus inventory control cell collection is flawed.

Nodes improve things, but are not universally available.

Models are built on data that can create erroneous output.

Blocked control cell collection most closely mirrors the true spirit of an experimental design.



Please enter your comment!
Please enter your name here