friso.lol

I'm being serious here.

Web Analytics Privacy — A Matter of Trust

Posted at —

TL;DR

  • Web analytics and tracking are non-trivial.
  • The privacy aspects of a solution can not be fully inspected from outside; hence ultimately privacy is a matter of trust.
  • Cookies have bad reputation, but are actually the one thing we can inspect from the outside.
  • EU lawmakers at some point started waging war against cookies, so now we have cookie banners, but no improved privacy.
  • A fully privacy preserving analytics solution can inherently not know about unique users.

Prelude

A post on LinkedIn by Lukas Vermeer asking for recommendations on a web analytics setup, triggered me into writing this piece on the topic.

Lukas poses two very basic questions:

  1. Is anyone visiting my websites?
  2. What are they reading there?

There is one important caveat: we are looking for the most privacy-friendly approach to answering these.

The post on LinkedIn received about seventy comments, a handful of which even make varying degrees of sense. The responses also suggest a growing consensus amongst our tech-savvy cohort that the solution offered by Plausible Analytics is exactly what we want. Plausible Analytics is very privacy friendly, because this is stated many times on their company website. Also, they do not use cookies as part of their solution…

Why Cookies?

Web Analytics is aggregate reporting on individual visitor events. Visitor events are actions like page view, scroll to bottom, select text, etc. Most solutions report page view events out of the box by default, with the option of reporting other events through custom client side code. When reporting on events in aggregate, you need some dimensions for aggregation. These dimensions come from event metadata. For example:

  • reporting the number of page views per unique user agent, requires to store the user-agent header for incoming requests.
  • reporting the number of page views per session from a visitor, requires to store a session identifier for incoming requests so we can connect individual events to a session.

The tricky bit with the second example is that HTTP requests do not have session identifiers, as HTTP is stateless. The stateless nature of HTTP was a nuisance for many other use cases as well, so the clever people of the IETF came up with a HTTP State Management Mechanism specified in RFC6265, which standardises two HTTP headers: cookie and set-cookie. It is the use of these two HTTP headers from that 37 page specification which, at the time of this writing, EU lawmakers are failing to definitively regulate for several years and counting.

To create session, device, or user context, your solution needs a persistent identifier on the client device throughout the context lifecycle. Cookies are the simplest thing that is universally available, and supported by all web browsers. Because the identifiers are persistent, and controlled by the publisher, they can be used to track users.

No Cookies Equals Privacy, Right?

We could think about our users' privacy conceptually, and work our way towards a privacy first solution from first principles. But in reality, we tend to be much more concerned with legal definitions of privacy, and making sure to cover our behinds in terms of compliance. Let us focus on that.

In The Netherlands, and the EU, we concern ourselves with two legal frameworks: GDPR, and the Cookie Law. The former is what is called AVG in Dutch, and the latter is actually the EU's ePrivacy Directive (EPD). GDPR is about what you can and can not do with PII (Personally Identifiable Information), while the EPD tells us about storing identifiers for tracking purposes (using cookies or otherwise).

One detail to note is that for EU stuff, things that end with an R (GDPR) are Regulations, meaning law. Things that end with a D (EPD) are Directives, meaning it is expected that member states encode the directive into local laws. The successor to EPD should be EPR (ePrivacy Regulation), which then would be EU law. Of course the target date for EPR going into effect was missed, and regulators have still not caught up with technology, so who knows whether this will ever happen. In the meantime, we are stuck with the cookie banner.

Details aside, simply put: if you store and process PII, look at GDPR; if you store and process identifiers of your own making or otherwise on client devices, look at EPD (Cookie Law).

One very unfortunate coincidence is that the very nature of how the network layer of the internet works, requires the client's IP address to be visible to the server, and IP addresses are considered PII by EU regulation in most real world scenarios. Hence, it is a matter of trust in the server side of our page view to handle our PII in a compliant manner. Cookies or not.

Things are not likely to ever change, because there is one group of lawmakers working very hard to prevent excessive tracking of users on the internet (privacy FTW!), while there is another group of lawmakers working very hard to prevent users on the internet from achieving complete anonymity (because, think of the children). (For the record: I have no opinion on the merit of either cause; just highlighting this reality).

How No Cookies?

Then what does Plausible Analytics do? Simple: they hash the client's IP address together with a random salt using a one way hash function and use the resulting identifier as a sort of "one-day user ID". The random salt refreshes at the end of every day, so data from previous days can never be reversed to IP addresses. The resulting identifier links every event to the context of a unique user for that day. Quite probably, the implementation will attempt to add a bit of additional entropy, such as user agent string, and other information available from the request. Altogehter, we have to trust them to forget about the IP immediately after hashing, and we have to trust them to properly rotate the salt.

Whether this approach complies with EPD without a user consent (cookie banner) is beyond my knowledge of the legal details. GDPR is about PII, which is argued does not apply here as there is no PII stored or obtained beyond necessity. EPD is more specifically about persistent identifiers that are not PII, but do allow to track. This is the case here, but the persistent identifier is the derivative of an already existing identifier that is inherent to the protocols in use. It is essentially poor man's device finger printing, with an implied promise to use a different finger printing function the next day.

Other Concerns

Why do we bother with generating identifiers in the first place? What is wrong with just counting the incoming requests in the server log? Two things: de-duplication, and (partial) bot detection.

Pages get reloaded for various reasons, including scrolling too far up on iOS, and browsers suspending tabs when the system is under memory pressure (and then reloading when the user gets back to them). Without a client side identifier all of these will count as a new visitor.

Web pages attract numerous non-human visitors. Some from well behaved crawlers with proper identification through the user-agent string, but also some from nefarious actors looking to discover if your site is prone to various types of attacks. A small fraction of bots will actually render the page, thus request all related resources such as images and scripts, while most will just grab the HTML and be done with it.

This is why all web analytics solutions rely on a secondary request initiated by the client after the web page is loaded. This can be a simple <img ...> tag, but usually is a small script that creates a tracking request on the fly. This ensures the page is in fact rendered, yielding reasonable confidence that a human initiated the page view.

Answering the Question

Getting back to the original questions: Is anyone visiting my websites? and What are they reading there?. The absolute simplest thing you can do is:

  • Set up an nginx instance on a VPS.
  • Host a single 1x1 pixel transparent image from it.
  • Set up access logging to log the request with user-agent and referrer included, excluding the client's IP address.
  • Inlcude a <img src="that 1x1 pixel image"> on all your pages.
  • Parse the access logs (the referrer will be the page that initiated the request).

Caveats:

  • The image must be hosted on a subdomain of your website domain, so the image will be first party.
  • Beware of the referrerpolicy, and loading attributes. You will want eager loading, and the unsafe-url referrer policy.
  • This approach does not de-duplicate reloads by the same user.
  • This approach can not count unique visitors.
  • The image tag based approach will prevent most bots from poluting your log.

You can host a VPS on Digital Ocean for USD 4,- per month.

If you do not want to setup any hosting, the other simplest thing you can do is choose from one of the privacy friendly analytics solutions out there. Plausible Analytics uses IP hashing to report unique visitor count. There is also Simple Analytics that uses a different approach to avoid IP hashing. Instead, page views with an empty referrer are counted as a unique visitor, which holds true as long as no user ever clicks an outside link to your site twice.

Further Recommendations

Personally, I am not a fan of IP hashing. Requests from a corporate controlled IT environment often all originate from the same IP and have the exact same user agent. If your audience is biased towards such users, your stats can be way off. One example I have experienced is a large national news website where the analytics would clearly show the lunch breaks at large government organisations. With IP + user agent hasing, those thousand plus users would have looked like one (very active) user. The claim that IP hashing with rotating salt avoids EPD (cookie law) is one that must be verified with a specialist in international privacy law before taking for granted. If you care about tracking for no longer than 24 hours, why not just set a cookie with appropriate expiration. This is also more transparent to the client, as the cookies can be inspected.

Regardless of solution, always make sure that analytics stays a separate concern from hosting. If your analytics relies on the logs of your particular hosting solution, moving one means moving the other.

Prefer raw data over predefined dashboards. Come up with your own exploratory questions, or hypotheses. Then collect and process.

Conclusion

Tracking and web analytics is non-trivial. Above we have not yet addressed the concerns of Chrome's pre-rendering (looks like a page view, but never shown to a user), anti-virus software that duplicates user requests to check responses, Google search results preview mode (again, looks like a page view, but only shows a non-interactive render on mouse hover), and bots that do render and evaluate JavaScript (Google).

If your business relies on accurate web analytics, take matters into your own hands and avoid third party solutions.