For my experiment, I crawled several thousand pages from The New York Times. Using The Times’ pages was a natural choice, both because lots of data experiments use this source, and also because I know they share the keywords they associate to each URL as a metadata field. After doing some de-duplication and other cleaning, I ended up with a my data set of 2,142 stories, choosing articles in three main categories: politics, art and entertainment and business.
I wanted to run an A/B test on the data, modeling the association between the known category of a URL and its assigned keywords as the A side of the test. Next, I modeled the association between the known category of a URL and a feature set, where the feature set is semantically generated data about that URL in the form of entities – the people, places, things and emotions present in the content. This is the B side of the test. The difference between the models will express the advantages of one form of contextual targeting over another.
By the way, the Big Data term to use here is that we are constructing a “supervised learning” model. This means we know the categorical outcome of each URL as “truth” and are testing the independent variables that will prove to be the best predictors of a category to assign new URLs later.
I used a Big Data estimation package called R (http://www.r-project.org/) and the user interface R-Studio (http://www.rstudio.com/). This was just one of several options out there, including Orange, which is much more GUI driven and has good set of estimators (http://orange.biolab.si/).
For both the A and B test I chose to use an ensemble model approach. This means that I use multiple sampling runs and multiple estimation models in combination in what is termed a “machine learning” approach to finding the best fit. After lots of iterative experimentation, I chose to use the following estimation models; Maximum Entropy, Support Vector Machine (SVM) and Random Forests (RF).
The A Test Outcome
The table below shows the core measurements of fit for the keyword A-side test. The Ensemble Recall measure is about the ability of keywords to “nominate” or identify a category at all. This means that keywords can be used about 70% of the time to nominate a category. It also means that 3 out of 10 times, keywords will fail to accurately represent what an article is actually about.
The accuracy measures per model type indicate the percentage of time the category nominated is correct. Averaging the three models means that keywords correctly identify the category only in 1 out of 3 cases.
The B Test Outcome
Semantic processing, whose goal it is to mimic the comprehension and richness of human understanding, provides a stark difference. In the B side of the test, semantic data correctly identifies a category almost every time as shown by the 95% recall measure.
In terms of accuracy, instead of 1 out of 3 stories being correct, 2 out of 3 are correct.
Thinking about the two measures together and comparing the A and B tests provides even starker relief. It is not at all clear at all why anyone would use a targeting technique that fails to assign an understanding of content 30% of time, and when it does it is wrong 66% of the time. This does not seem to be a technology you can rely on to satisfy the demands of advertisers.
Semantic data, on the other hand, almost always provides an answer and only gets 1/3 of them wrong. In practice, the accuracy number is typically higher as the specific needs of advertisers are understood in terms of optimization, audience characteristics and brand safety concerns.
A successful advertisement is a vital factor that helps contribute to growing your top line. With so many display ad impressions wasted and misplaced, it is not that we can immediately fix all of the problems. But technology gives us a new chance to start moving in a new direction to re-establish the credibility of online ads: a fundamental step in making the economics of digital advertising work for buyer, seller and consumer.