How are services like Alexa and Google Analytics capable of tracking visitors' age, gender, college education, and so forth?
http://www.alexa.com/siteinfo/stackoverflow.com
Alexa definitely gets its traffic info from its toolbar users. Since that is a relatively small and self-selecting group of people, this inevitably leads to a biased sample (which is why Alexa traffic doesn't match measured traffic on the sites I run). Even with the best statistical techniques for reducing bias, you can never get rid of it entirely when the sampling distribution is not uniform.
Unclear how Google does it, although it might involve tracking cookies.
A project I have been working on recently has bearing on this question.
Another way to do this (that also has biases, but different ones) would be to use an IP to location service to find the approximate latitude and longitude of each visitor to your site. Then use my project (full disclosure: I run that site and it is commercial):
http://askgeo.com
To get demographic information for that location. AskGeo actually provides demographic information on several geographic levels (state, county, county subdivision, city, ZIP code, census tract (a few thousand people), and census block group (about a thousand people). You'd presumably want to use the lowest level (i.e., census block group) for a given latitude and longitude.
The site returns a huge number of demographic variables. The idea would be to use soft counts from the demographic variables provided on the block group level. To take an example, if you are trying to track the age distribution of your users, then you'd use the age ranges provided in the AskGeo response and for a given sample, you'd add a fractional soft count to each range that corresponds to the percentage of the population in that block group from the corresponding age range. For example, take my neighborhood in San Francisco. It has the following age distribution:
... (skipping a bit, as you probably get the idea) ...
If you got an IP address that you tracked to that census block group, you'd add each of those percentages (as a fraction from 0 to 1) to your (soft) counters for those age ranges. (A soft counter is just a counter that allows for non-integer counts.)
You could do the same with race, gender, income level, house values, etc.
This method also has biases, for sure, since it assumes that all the people in a given block group are equally likely to visit your site. But it is something that you can do on your own site, not just Google and Alexa, and it would still give you a relative sense of who is visiting your site if your soft counts in a given category are higher than the national average in that category.
It is also possible that a more sophisticated technique than simple direct counts could lead to a much richer result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With