Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What data is collected by Google Analytics (by default)

I try to identify what data is actually collected by the default script of Google Analytics. What seems to be an easy question turns out to have no clear answer.

I know that they (for example) collect the IP-address, screen resolution, operating system and so forth ... but I simply do not find a complete list. I also have a list of all the possible dimensions and metrics that can be collected, but not for the "default" analytics script.

I ask for a list of all the data collected by default by Google Analytics.

like image 374
Schuiram Avatar asked Jan 07 '15 13:01

Schuiram


1 Answers

... identify what data is actually collected by the default script .... I also have a list of all the possible dimensions and metrics that can be collected

Just to be clear, GA collects more information than what they share with Analytics consumers. While their client-side script may allow for additional data to be collected (like custom query string parameters), most of what they collect data seems to be similar on every site, regardless of what the analytics user chooses to consume (with the exception of a few configuration items such as "anonymizeIp").

Google's policies are cleverly worded to indicate that turning on "Advertising Features" doesn't necessarily change what they collect with GA, other than the fact that a new cookie might be present:

By enabling the Advertising Features, you enable Google Analytics to collect data about your traffic via Google advertising cookies and identifiers

Knowing what GA collects (even when you don't ask it to) is particularly important given the ambiguity around whether GA is really GDPR compliant (which includes IP addresses, cookie identifiers, and GPS locations as "personal data").

Looking at the source code

Google Analytics is a moving target, BUT there is value in having a snapshot of the identifying information about the client and browser that was being leaked to Google Analytics at a given point in time,

Even though it's a bit outdated, this analysis was done using a Manually Deobfuscated Google Analytics javascript file, snapshot taken Mar 27, 2018.

1. Data available in Document and Window Objects

Some key objects to look for in the analytics JS: DOCUMENT, WINDOW, NAVIGATOR, SCREEN, LOCATION

Here are the items that are utilized by GA (doesn't necessarily mean this data is sent back to google in a raw form).

Data Utilized         |   Code Snippet
-------------         |   ------------
Url                   |   LOCATION.protocol + "//" + LOCATION.hostname + LOCATION.pathname + LOCATION.search
ReferringPage         |   DOCUMENT.referrer
PageTitle             |   DOCUMENT.title
HowLongIsPageVisible  |   DOCUMENT.visibilityState .. DOCUMENT,"visibilitychange"
DocumentSize          |   DOCUMENT.documentElement  .clientWidth && .clientHeight
ScreenResolution      |   SCREEN.width  SCREEN.height
ScreenColors          |   SCREEN.colorDepth + "-bit"
ClientSize            |   e = document.body; e.clientWidth && e.clientHeight
ViewportSize          |   ca = [documentEl.clientWidth .... : ca = [e.clientWidth .... ca.join("x")
FlashVersion          |   getFlashVersion
Encoding              |   characterSet || DOCUMENT.charset
JSONAvailable         |   window.JSON
JavaEnabled           |   NAVIGATOR.javaEnabled()
Language              |   NAVIGATOR.language || NAVIGATOR.browserLanguage
UserAgent             |   NAVIGATOR.userAgent
Timezone/LocalTime    |   c.getTimezoneOffset(), c.getYear(), c.getDate(), c.getHours(), c.getMinutes()
PerformanceData       |   WINDOW.performance || WINDOW.webkitPerformance   ... loadEventStart,domainLookupEnd,domainLookupStart,connectStart,responseStart,requestStart,responseEnd,responseStart,fetchStart,domInteractive,domContentLoadedEventStart
Plugins               |   NAVIGATOR.plugins
SignalUserLeaving     |   navigator.sendBeacon()  // how long the user was on the page
HistoryLength         |   WINDOW.history.length   // number of pages viewed with this browser tab
IsTopSiteForUser      |   navigator.loadPurpose   // "Top Sites" section of Safari
NameOfPage (JS)       |   WINDOW.name
IsFrame               |   WINDOW.top != WINDOW
IsEmbedded            |   WINDOW.external
RandomData            |   WINDOW.crypto.getRandomValues  // because of the try/catch, it doesn't appear to leak anything other than random values
ScriptTags            |   getElementsByTagName("script");  // probably for Ads, AutoLink decorating [https://support.google.com/analytics/answer/4627488?hl=en] and cross-domain tracking [https://developers.google.com/analytics/devguides/collection/analyticsjs/cross-domain]
Cookies (JS)          |   DOCUMENT.cookie.split(";")   // limited to cookies not marked as server only

2. Data available from the QueryString and Hash

By default, GA seems to only explicitly collect querystring parameters that are documented as specific to Google Analytics. But keep in mind that they also have the entire URL available to extract this data server-side, querystring and hash included:

_ga
_gac
gclid
gclsrc
dclid
utm_id
utm_campaign
utm_source
utm_medium
utm_term
utm_content

3. Data available in the HTTP Header

They can choose to capture anything on the request header from the browser. Most notably:

Cookies (Google)   |   for the google analytics domain, to track the user between sites
IP Address         |   (parameter "anonymizeIp" claims to anonymize the IP address)
Browser w/ version |
Operating system   |
Device Type        |   
Referer            |   (in this context, only the url of the page the client is currently on)
X-Forwarded-For    |   Is a proxy being used?  And, if not used for privacy, the actual IP address

4. Other inferred data

Javascript enabled
Cookies enabled

Other identifying information they don't appear to track/utilize

Some other metrics that are readily available, but GA doesn't appear to access:

Canvas Supported
CPU Architecture
CPU Number of cores
AudioContext Supported 
Bluetooth Supported
Battery Status
Memory (RAM)
Number of speakers
Number of microphones
Number of webcams
Device Orientation
Device input is Touchscreen
System Fonts
LocalStorage Data
IndexedDB Data
WebRTC Supported
WebGL Supported
WebSocket Supported

Misc Hacks

They don't appear to use any known hacks to extract additional unique user information, such as finding the video card model of the current machine using Canvas and GL. This is not too surprising, since Google can just expose any data they want in chromium/webkit.

However, their control of 70% of the browser market gives them the power to manipulate otherwise innocuous functions (like the random number generator) to leak data for user tracking, if they so desire.

Summary

What you choose to see from the Google Analytics portal does not necessarily impact what they collect.

GA helps Google determine how well a site performs for Search Ranking, and creates a User Fingerprint to track what each internet user looks at and for how long. The latter helps them select ads, which is where they make the bulk of their money. Much of the data they touch in their script doesn't get sent back in raw form, but rather, is used to create said fingerprint.

like image 104
James Avatar answered Jan 04 '23 10:01

James