Bots and Analytics

02 Apr 15

So I’ve spent the past few days adding some Google Analytics tracking to the site. It’s just some pageview tracking and a small timeout snippet that sends a ‘ping’ type thing to Analytics when someone’s been on a page for longer than fifteen seconds, but in the interest of transparency and full disclosure I’ll be explaining exactly what the code snippets do and why we have them, as well as what kind of data is being collected from visitors.

You can opt out by disabling JavaScript from running on the relevant pages, either by changing your browser settings directly or downloading a plugin like NoScript for Firefox and putting both “random-hall.mit.edu” and “web.mit.edu/random-hall/www/” on your blacklist.

(This post obviously has nothing to do with the fact that it has now been a week or so since my last update and I have nothing new and/or exciting to write about.)

We are first going to look at all the sketchy scripts that run when a page is opened on either the Random Hall website or the blog. Pasted into the code box below is all the relevant JavaScript that shows up in the footer of both sites — the blog, of course, has an extra line of script that calls lightbox, but that’s not relevant to us right now so we can ignore it.

<script type="text/javascript" src="http://code.jquery.com/jquery-latest.min.js">
</script>
<script type="text/javascript"> 
    $(document).ready(function() {
        $('a[href^="mailto:"]').on('click', function() {
            this.href = this.href.replace('AT','@').replace('DOT','.');
        });
        
        setTimeout(function() {
            ga('send', 'event', 'page', 'read');
        }, 15000);
 
        (function(i,s,o,g,r,a,m){
            i['GoogleAnalyticsObject']=r;
            i[r]=i[r]||function(){
                (i[r].q=i[r].q||[]).push(arguments)
            },i[r].l=1*new Date();
            a=s.createElement(o), m=s.getElementsByTagName(o)[0];
            a.async=1;
            a.src=g;
            m.parentNode.insertBefore(a,m)
        })
        (window,document,'script','//www.google-analytics.com/analytics.js','ga');
        ga('require', 'linker');
        ga('linker:autoLink', ['http://random-hall.mit.edu/blog/']);
        ga('create', 'UA-61305805-1', 'auto', {
            'allowLinker': true
        });
        ga('send', 'pageview'); 
    });
</script> 

The first-most line calls jQuery, which is useful for writing JavaScript in fewer lines:

<script type="text/javascript" src="http://code.jquery.com/jquery-latest.min.js">
</script>

It’s basically there so I can run the $(document).ready() check, which itself forces whatever is inside it to wait until the page is done loading to do anything. I have this $(document).ready() check encompassing all the other inline JavaScript in the footer, so none of it should execute until the page has finished loading first.

Within document.ready() are several functions, the first of which is for unmasking email addresses:

$('a[href^="mailto:"]').on('click', function() {
    this.href = this.href.replace('AT','@').replace('DOT','.');
});

Basically, all the email addresses on this site have their ‘@’ and ‘.’ replaced with ‘AT’ and ‘DOT’ to discourage web crawlers from finding them. This piece of code reverts them back to working email addresses when the links are clicked on, so anyone who actually wants to contact the email address can simply click on it and get the right address.

Our next set of lines is the timeout function:

setTimeout(function() {
    ga('send', 'event', 'page', 'read');
}, 15000); 

This function exists because Google Analytics assumes that any time someone genuinely visits a website they must be clicking on links or going through more than one page in a session. In most cases, this may be true — a website that has all of its articles displayed on the front page, for instance, guarantees that most visitors will start at the front page and then click on a link that sends them to the relevant article. When a visitor deviates from this expected behavior however, e.g goes directly to a page, reads it, then closes it without clicking on any links to any other pages, Google notes this as a “bounce”, reasoning that since the person hasn’t clicked on any links they must not have been interacting with the page and have therefore failed to be reached by the website.

Both the Random Hall website and blog have many self-contained pages that can be read without consulting other pages or links. And if someone visits the page looking for something specific finds it, then closes the tab or navigates away, we still want to count that as a successful visitor.

So the timeout function starts a countdown the moment the page is loaded — in this case, from fifteen seconds. When fifteen seconds have passed without interruption, i.e when the page has been left open for fifteen seconds for some reason or another, it automatically declares a ‘read’ event and sends it along to Google Analytics, which notes that down as an interaction and prevents the session from being labeled a bounce.

Why only fifteen seconds? Some of the informational pages in the Random Hall website and blog are short, and I figured that if you were staring at a page for longer than fifteen seconds you were probably legitimately skimming it for information, in which case I would consider you a successful visitor and give myself a pat on the back.

The third and largest chunk of code is the rest of the Google Analytics code, which was copy-pasted directly from the Google Analytics cross-domain setup guide.

This part grabs any other relevant Google Analytics information and calls the source code.

(function(i,s,o,g,r,a,m){
    i['GoogleAnalyticsObject']=r;
    i[r]=i[r]||function(){
       (i[r].q=i[r].q||[]).push(arguments)
    },i[r].l=1*new Date();
    a=s.createElement(o), m=s.getElementsByTagName(o)[0];
    a.async=1;
    a.src=g;
    m.parentNode.insertBefore(a,m)
})
(window,document,'script','//www.google-analytics.com/analytics.js','ga');

This part adds the cross-domain checking, so we don’t double count visitors who go from the website to the blog or vice versa.

ga('require', 'linker');
ga('linker:autoLink', ['http://web.mit.edu/random-hall/www/']);

Then this part packages all the information together and sends it along to Google Analytics, which compiles the information and spits it back out on my dashboard every morning as a series of pretty graphs and tables.

ga('create', 'UA-61305805-1', 'auto', {
    'allowLinker': true
});
ga('send', 'pageview');

As far as I can tell, the amount of potential information captured by Google Analytics is pretty scary — from several preliminary runs the other day I was able to gather general age, gender, and location data down to the city level, as well as network provider and internet browser. The data was all anonymized, of course, and presented along the vein of “~75% of your visitors were from Cambridge, Massachusetts”, but the Demographics and Interests tab was sufficiently sketchy that I ended up disabling it rather than risk gathering too much information, and the number of tracked Events was downsized to just send a ‘ping’ after counting down 15s instead of tracking every single mouse click on a link or image in the page.

What information is being collected, then? As of this writing, the tracker takes location, time accessed, and session duration data, along with browser and new vs existing user information. It also sends ‘read’ events when a page has been open for more than 15s, allowing us to filter out who is actually reading the content and who is just passing through. From what I can tell, the data available to me contains no IP addresses or anything more personally identifiable than city and network provider. It is being collected primarily so the webmins can see the kind of traffic we are getting and decide what kind of information we want presented on both the website and the blog.

Again, you can opt out of Analytics by either disabling JavaScript (on Chrome, Settings > Show advanced settings… > Privacy > Content settings > JavaScript ) or installing a plugin that blocks scripts from running on some webpages. If need be, I will write up a continuation post in a few days detailing precisely how to do this on the more popular browsers, or at the very least, edit this post to add links to the relevant guides. :)