Data release: list of websites that have third-party “session replay” scripts

In a recent study we analyzed seven “session replay” services and revealed how they exfiltrate sensitive user data. Here we release the data behind our study, specifically, the list of websites from the Alexa top 1 million which embed scripts from analytics providers that offer session recording services. The appearance of a website on this list DOES NOT necessarily mean that session recordings occur, as website developers may choose not enable session recording functionality.

For some sites, we do have evidence of session recordings occurring. We mark these with the tag “evidence of session recording”. For these sites, our measurement bots were able to detect a recording in progress, as detailed in our detection methodology below. For sites not marked with this tag, it does not mean that recordings don’t occur, simply that we don’t know if they do. That’s because many of the recording services activate their functionality only for a sample of users, either as explicitly defined by the publisher site or enforced as part of a daily recording limit. Thus, it is possible that our bot that visited the site was not included in the sample, but other users might be.

As such, this list provides both an upper and lower bound of the presence of session recording companies on the web. Two of the 14 companies included in the data release, Yandex and Hotjar, have a diverse set of analytics services -- many of which have no overlap with session recording. The remaining companies mostly offer similar services which include: session replay, heat maps, click maps, and form analytics.

The list below contains sites that are ranked in the top 10,000 according to Alexa. Download the zipped CSV file for the full list.

Update (30 November 2017): 2455 sites were added to the site list which were incorrectly excluded from the initial release. In addition, we fixed 36 records for which the displayed site name was corrupted.

Methodology for detecting evidence of session recording is given below

Read the blog post » WebTAP Project »

Methodology

We detect evidence of session recording by combining signals from the following sources of data:

  1. We detect sites which embed scripts from session recording services using the network data from the September 2017 Princeton Web Census data. The list of script URL patterns used to detect these embeddings is available here.
  2. We examine several of the recording companies to determine if they have a unique “backend” URL which is only present when a recording is in progress. We discovered such URLs for Yandex Metrika, Hotjar, Mouseflow, Clicktale, and Decibel insight, and use these to mark sites across the September 2017 Princeton Web Census dataset.
  3. We use a more targeted two-step crawling measurement based on OpenWPM to measure 50,000 sites sampled from the top 1 million. First, the crawler injects a unique string to the HTML of the page and search for evidence of that value being sent to a third party in the page traffic. To detect values that may be encoded or hashed we use a detection methodology similar to previous work on email tracking. After filtering out the recipients of the unique string, we isolate pages on which at least one third party receives a large amount of HTTP POST data during the visit, but for which we do not detect a unique ID. On these sites, we perform a follow-up crawl which injects a 200KB chunk of data into the page and check if we observe a corresponding bump in the size of the data sent to the third party.