ADOTAS – [UPDATED AT BOTTOM] According to a new study by the Stanford Law School’s Center for Internet and Society, 61% or 113 of Quantcast’s 185 most popular sites sent user names or user IDs (potentially email addresses) to third parties via cookies. And most of it appears to be unintentional — an issue that has far more to do with site data security than online data collection practices and behavioral advertising.
Released yesterday as part of a giant privacy gala in DC that featured a keynote and Q&A session with Federal Trade Commission Chair Jon Leibowitz, the report was advertised in the press release running up to the event as a paper that would debunk “the myth that digital data collection is anonymous.” Whether it achieved that goal is definitely arguable.
What’s Being Sent
I feel pretty secure that when I log into HomeDepot.com, the website is not sending a message from the login page saying to all its third-party buddies, “Hey, guitarsexgod930, who you all know is Gavin Dunaway, just showed up! He’s looking at toilets — I don’t want to think about what he did to the old one. Who’s going to show him an ad for the new America Standard model?”
No, a bunch of data (including my username) has been stuffed into the login URL, which then gets shared with third parties who have deals with the publisher. As will be mentioned many times, the research does not cover what happens when the data is received by those third parties.
Typically this “identifying information” — which research author and Stanford graduate student Jonathan Mayer describes as “information that with moderate probability and moderate effort can be used to identify a user” — is shoved into the URL to assist with site personalization efforts and only a little work is required to strip out the identifiable meat. Mayer uses this example:
As you can see, a site login, email address and real name can all be derived from that.
The SSL report follows a recent report (PDF) by Balachander Krishnamurthy, Craig Wills and Konstantin Naryshkin using a similar methodology that pretty much found the same results — i.e., 56% of sites studied leaked some kind of identifying information, with 48% leaking a user name in particular.
Mayer’s study expanded the number of sites from 120 to 185 (culling them from the Quantcast 250 based on whether a site offered a signup without requiring a purchase or other qualification, as well as other concerns related to the scope of the research), as well as shifting the focus to “identifying data leakage” and using a public dataset.
While a complete spreadsheet of results can be downloaded here, Mayer singled out these gems:
- Viewing a local ad on the Home Depot website sent the user’s first name and email address to 13 companies.
- Entering the wrong password on the Wall Street Journal website sent the user’s email address to 7 companies.
- Changing user settings on the video sharing site Metacafe sent first name, last name, birthday, email address, physical address, and phone numbers to 2 companies.
- Signing up on the NBC website sent the user’s email address to 7 companies.
- Signing up on Weather Underground sent the user’s email address to 22 companies.
- The mandatory mailing list page during CNBC signup sent the user’s email address to 2 companies.
- Clicking the validation link in the Reuters signup email sent the user’s email address to 5 companies.
- Interacting with Bleacher Report sent the user’s first and last names to 15 companies.
- Interacting with classmates.com sent the user’s first and last names to 22 companies.
Whose Fault Is It Anyway?
All this research shows is what data is going to third parties and what identifying information can be gleaned from it. It doesn’t show what the third parties actually do with this information on reception, which Mayer points out was out of the research’s scope.
“We did not study – and cannot study – what companies do when they receive personal information. It is likely that many of the information leaks we identified were logged. Some third parties may take precautions to prevent logging of identifying information, and we certainly laud such efforts. But for policy purposes, there is a tremendous difference between a tracking ecosystem that is anonymous and a tracking ecosystem that is suffused with identity but promises to ignore it.”
No, the data is not anonymous, but these websites are also not delivering PII right into the hands of third parties — most data collectors will argue that they wouldn’t strip out the personal identification because they don’t want the PII (it causes problems). And it’s not anonymous at the source because the publishers haven’t anonymized it — 72 pubs in this study managed to have systems in place to keep user login data anonymous.
Interestingly, Mayer seems to absolve developers (and by extension publishers) of responsibility by saying this kind of thing just happens.
“Many times, developers are not thinking about privacy issues, and it’s a fact of life that information is going to leak to third parties. I think we have to recognize that’s just the way the Web works,” he said at the press conference.
Further, in the report he writes:
“The better practice for all first-party and third-party websites would be to acknowledge that identifying information leakage is a fact of life on the web, and that identifying information may be shared with third parties.”
And then in The Wall Street Journal, he said:
“The web is suffused with identity. And it’s a fact of life that that identity will get sent to third parties at some point.”
So maybe I’ve been wrong all along — it’s not that Internet privacy is an oxymoron, but that online data security offered by publishers is an oxymoron. Wow, that makes me feel so much better. I’d love to hear industry perspective on Mayer’s suggestion.
It’s kind of an end-run argument, though not a bad one, for Do Not Track functionality (the press conference appeared to be a big pep rally for DNT efforts) — there’s no helping personal information being shared with data collectors, so if you’re worried about it, flip on DNT and cut off the cookies.
At the same time, I can’t help thinking about the Facebook privacy scandal last year in which WSJ discovered social games played within the network were sending Facebook unique IDs to third-party ad servers. It’s a pretty similar case — and WSJ couldn’t find any instances of third-party ad tech firms using the data or associating the IDs with profiles (just companies that refused to).
So shouldn’t the onus fall on the publishers to tighten up the management of personal data — including logins and user names? Not to be too repetitive, but isn’t this a site security issue being stretched into a justification for DNT?
(At the same time, I’m not saying DNT is a bad idea… I’m just being critical, which is why they pay me the big bucks. Maybe some third-party data service can tell you how much.)
A Bold Accusation
Once again, data collectors — the cyberazzi, as FTC Chair Leibowitz would call them — are being vilified without a bit proof. It’s always implied that data collectors are doing nasty things, like building profiles with PII (well, Rapleaf does that, but they’re very transparent). However, Mayer does cite an example of third-party data collectors purposefully grabbing very personal information — and it would be a damning claim if there was corroboration.
“In computer security, leakage is a term of art for an information flow – some instances of leakage are entirely intentional. For example, OkCupid, a free online dating website, appears to sell user information to the data providers BlueKai and Lotame, including gender, age, ZIP code, relationship status, and drug use frequency.”
First, Mayer seems to be confusing a data-buying agreement with data leakage. Second, BlueKai and Lotame vehemently deny this claim.
While it is contractually forbidden to disclose all the data categories it receives from a specific partner, BlueKai says it only collects general demographic and interest data (zip code, age and gender were cited) and that none of it is connected to individuals or user names. Consumers are invited to visit the BlueKai Registry to manage their interests and opt outs, as well as see what their cookies say about them.
As of press time, a representative from BlueKai said that they were “working with them to get this corrected.”
UPDATE: Oct. 12, 9:20 a.m. Mayer updated the blog posting on
[Update 10/11: The original version of this post conflated the information OkCupid provides to Lotame and BlueKai. In the interest of complete accuracy, and in response to both a deluge of questions on OkCupid’s intentional leakage and a note from BlueKai seeking clarification, I have updated this section with per-company intentional leakage. I have also included the results of a leakage test (with the methodology described below) on OkCupid. My apologies to BlueKai for the incorrect implication that it collects the same sensitive profile data that Lotame does. The amibiguous discussion was solely my error.]
He gives this list of what the companies “appear” to receive — “To learn which profile information OkCupid leaks, I modified each field of a profile and observed how values sent to the two companies changed.”
- Age – Both
- Cats – Both
- Children – Both
- Country – Both
- Dogs – Both
- Drinking Frequency – Lotame
- Drug Use Frequency – Lotame
- Education – Both
- Ethnicity – Lotame
- Gender – Both
- Income – Both
- Job Sector – Both
- Language Proficiencies – BlueKai
- Relationship Status – Lotame
- Religion – Lotame
- Smoking Frequency – Lotame
- State – Both
- ZIP Code – Both