Blog Post

Understanding Google's Information Empire



 

I’m running Mac OSX 10.7.5. I like to use Chrome, Safari, and Firefox in that order. On Linux I use IceWeasel. I speak English. I am part of Brown University. This information is technical - its part of the mechanics of any network, and in most situations, it’s all contained in the packets any computer sends out when it connects to the Internet. But that’s just the beginning of the information that exists online and describes my Internet identity. Google, for example knows that I am an 18 - 24 year old male, that in the last month, I haven’t left Providence, that I have an iPhone. Based on my recent browsing and search history, Google has gotten this information about me:

 

What Google doesn’t know is that I’m a student at Brown. It doesn’t know that I’m taking Intro to Software Engineering (in Java), Music Theory, or that I’m currently researching Google’s own user tracking practices. But it wouldn’t take a genius to connect the dots, especially if you looked at the fluctuation in these interests over time - information that Google stores. Unsurprisingly, the persistence and the accuracy of this online data is unsettling for some people. The recent NSA surveillance scandal has raised awareness of the size and quality of personal data collected by big data companies, but the average user still has knows very little about how the data is used and collected. I have chosen to focus this post on Google for obvious reasons: Google started off as just a search-engine, which means it specialized in data-mining from the get-go. Point 2 of the company philosophy states that: “It’s best to do one thing really well. We do search. Our dedication to improving search helps us apply what we’ve learned to new products, like Gmail and Google Maps.” It’s a little hard to believe, but the many services Google now offers are still ancillary to search - Point 7 is “There is always more information out there”. The cost of developing those services is counterbalanced by the value of the information Google acquires every time someone uses one.

Information, particularly the personal kind, is Google’s currency in a metaphorical but also in a very literal sense. In 2011, 96% of Google’s (enormous) revenue came from advertising. So how does Google go about collecting and using user data in advertising? Google’s main advertising platform is a four-headed monster - consisting of Google AdWords, Google AdSense, DoubleClick, and Google Analytics. AdWords is the simplest - advertisers buy space amongst Google search results and then place ads relevant to the search terms. AdSense allows sites other than Google to host the same ads, except the ads are targeted to match the content of the host website. These two are both effective advertising tools, but neither requires enough information about individual users to make people uncomfortable. But DoubleClick and Google Analytics are different: together they gather information about individual users and their internet behavior. DoubleClick attempts to build profiles about interests and behavior patterns, and Google Analytics tracks many features of users interaction with specific websites - like every click, hover, or mouse movement. Within these broader descriptions the following specific technologies may be used:

IP Address - a unique number assigned to each computer connected to the internet. Used heavily by DoubleClick to monitor ad occurrence frequency, and target locations.

Cookies - small (4kb max) files which contain data about a user’s interaction with a website. One common usage is authentification - a cookie is responsible for remembering whether or not you are logged into your gmail account.

Flash Cookies (Locally Shared Objects) - similar to a cookie, but larger and stored in a different location (relies on Flash, hence the name). Some advertising companies that relied on cookies used these as backups in case their cookies were cleared. If the cookie expired or was deleted, the company could generate new cookies from the flash cookie. They’ve acquired a bit of a bad name, and web browsers (including Google’s Chrome) have recently begun to crack down on flash cookie storage.

Anonymous Identifier - a string of random characters for identifying unique users in the absence of cookies. More common on mobile devices which tend to have more tightly protected filesystems.

Web Beacons/Pixel Tags/Web Bugs - a 1 pixel image, usually totally invisible to the user. The 1 pixel makes a difference though - a host server will be contacted and notified if you load a page including a web beacon. Especially common in emails, where advertisers use pixel tags to identify whether or not you ever opened their email.

Browser Fingerprinting - when advertising companies such as Google are attempting to identify unique customers, they might attempt to do so based on the configuration of your browser. The Electronic Frontier Foundation has offered a tool , called panopticlick, to identify how unique your specific browser configuration is. When I used it, I found that my browser configuration was unique amongst 4.5 million others. Most browsers can be uniquely identified based on their version, operating system, plugins, fonts, screen-size, and preferences. An advertising company can then drop a cookie on a user, and be reasonably sure that the user is the same one.

Content Tracking - Have a Gmail account? Use Google Docs? Google’s crawlers pore through all user-generated content searching for keywords.  Google is careful to state that no humans will see this information. Unless they have a warrant. Or work at Google. Or have your account information. Or are good at hacking.

Click Tracking: Google uses its Analytics platform to literally track every single click on Google’s websites. They can collect some very specific information too - ever tried to click something, accidentally clicked the wrong button, but then dragged your mouse away before releasing the button? Google saw it.

Server Request Tracking - a pretty obvious one, if you’re a search engine. For every Google search you do, Google stores the (1) IP address, (2) time, (3) language, (4) search query, (5) OS, and (6) browser of the user. By amassing search terms and combing through them with an algorithm, companies generate interest-profiles like in the image above.

Referrer Tracking: Referrals are the foundation of Google’s PageRank algorithm - for all intents and purposes they are the arteries of the internet - bouncing traffic around the system. Perhaps the most fundamental web technology is the hyperlink. When you click a hyperlink, your browser is redirected to another page. But what most users don’t know is that that page knows exactly where you came from. DuckDuckGo, a competing web search engine, provides a troubling example:  you search for “herpes.” You click the first link that pops up. That link gets information about the referring page - your Google search for herpes.  Based on a combination of IP geolocation and browser fingerprinting that page could get a pretty good idea of who you are. A third-party ad hosted on that sight might make the connection between your presence on a page which is the top result for herpes, and your uniquely identifying cookies, and then serve you ads for herpes medication that follow you across their network.

 

The above tracking methods are not just employed sporadically by Google, but across Google’s entire content network. Specific Google Apps use more specialized tracking methods as well. There is a startling amount of location data collected from smartphones by google. Most people aren’t aware that if you opt-in to location services, say, to use Google Maps on your iPhone, then Google will attempt to continuously monitor your location as long as you have a cell signal or wifi connection. And this is in spite of the notoriously friction-filled relationship between Google and Apple. It gets even scarier - despite the iPhone’s popularity, iOS commands only 17.3% of smartphone operating-system’s market share - good for second place amongst all OSs. But a whopping 75% of all smartphones use Google’s Android operating system - a system which has location tracking baked into the software. It’s possible to turn this off, but a user can only do so with a cost - maps, weather, and other apps that rely on location data to perform their functions.

 

On January 8th, France’s data protection authority CNIL fined Google 150,00 euros based on Google’s (March 2012) unified protection policy. Essentially, Google has released one privacy policy to stretch across all of its products and services. France, and five other EU companies promptly launched investigations.  Two weeks ago, France’s highest administrative appeals court denied Google’s appeal of the fine and required Google post a notice of the sanction. In their press release, CNIL approved of the Google’s stated objective - to simplify the privacy policy, but cited several legal requirements Google failed to address in the new policy.

 

The company does not sufficiently inform its users of the conditions in which their personal data are processed, nor of the purposes of this processing. They may therefore neither understand the purposes for which their data are collected, which are not specific as the law requires, nor the ambit of the data collected through the different services concerned. Consequently, they are not able to exercise their rights, in particular their right of access, objection or deletion.

The company does not comply with its obligation to obtain user consent prior to the storage of cookies on their terminals.

It fails to define retention periods applicable to the data which it processes.

Finally, it permits itself to combine all the data it collects about its users across all of its services without any legal basis. - CNIL.

 

The last two points are the particularly troubling ones - the combination of the massive amounts of data Google collects across specific user profiles could make it increasingly easy for governments, hackers, etc. to get an increasingly complete portrait of a user’s life.

The fact that Google collects and uses this information does not necessarily make it evil. Especially in Google’s case, there is a tangible value that the user pays for with his data. Google’s profit from that data is tangible as well - online advertising is accepted by most people because it keeps free services free. These compromises are relatively safe because they are explicit. But as CNIL points out, nobody fully understands what Google does with its treasure trove behind the scenes. How long does the data persist? What does Google do with it, besides advertising? Until the full extent of Google’s data processing and distribution are realized, then the only way users can really exercise their rights, is to know exactly what information they are giving and how.

111

2 comments

This post does a great job at laying down the specific type of data that companies like Google can collect from you. I am most interested in a point you make in your last paragraph, "...could make it increasingly easy for governments, hackers, etc. to get an increasingly complete portrait of a user's life. The fact that Google collects and uses this information does not necessarily make it evil." I have had some questions about the fear of big data and maybe you can answer them for me. If I am a law abiding citizen, why is it such a bad thing that a big coporation like Google knows my location, age, gender, interests, etc. if they are just using that information to target me with ads? So maybe the answer is that they can do/are doing more than just targetting me with ads. But still, so what? The fears I have in this world are of something bad happening to my loved ones. I could be sounding very naive right now, I don't know, but Google isn't trying to harm my family... So why so much secrecy? 

It seems to me that big data might actually be able to fight crime. If you're not doing anything sketchy on the Internet, then why care? I know there are answers to these questions out there but after your very informative post I wonder even more about what the real fear is at the heart of big data. 

121

What an interesting project--this really speaks to current discussions on where the currency of personal data factors into national boundary-making and corporate fashioning of modern lives (real and perceived). Since historians and other scholars increasingly rely on Google Books or the newly relaunched News Archive to conduct their research, I wonder how Google uses that data? While this project focuses more on personal data (which is a much-needed field of study), I'd be curious to learn how Google is shaping a profile of what the modern researcher looks/act like online. Or in other words, how does Google use all of these amazing bells and whistles to profile the "Google Scholar"? And can/will the ad wizards accurately distinguish between collecting "personal" data and "public" data, i.e. for a piece of historical scholarship?

131