We recently received notification that our work was accepted for publication at The Web Conference (WWW 2020). In this post, we provide a short summary of the work.
Title: “Apophanies or Epiphanies? How Crawlers Impact Our Understanding of the Web”
Authors: Syed Suleman Ahmad (UWisconsin), Muhammad Daniyal Dar (UIowa), Muhammad Fareed Zaffar (LUMS), Narseo-Vallina Rodriguez (IMDEA/ICSI), Rishab Nithyanand (UIowa)
Venue: The Web Conference (WWW 2020)
Data generated by web crawlers has formed the basis for much of our current understanding of the Internet. However, not all crawlers are created equal and crawlers generally find themselves trading off between computational overhead, developer effort, data accuracy, and completeness. Therefore, the choice of crawler has a critical impact on the data generated and knowledge inferred from it. In this paper, we conducted a systematic study of the trade-offs presented by different crawlers and the impact that these can have on different types of measurement studies.
Publications are often not specific enough in their crawling methodology expositions. Web crawls form a major component of the Internet measurement, security, and privacy communities with over 16% of all publications in the last four editions of the premier venues relying on data from crawlers. Rather worryingly, our survey also highlights the lack of specificity in research publications when describing crawling methodologies. In fact, over 35% of the crawl-dependent publications were not reproducible due to absence of information regarding crawl methodology. Finally, we observe that crawl reproducibility, incidence of custom crawling solutions, and variation across crawling tools are all dependent on the research domain and communities. Below is a table that summarizes our survey findings.
Not all crawling tools are equal. Our analysis hints that specific crawling tools are more suitable for use in different research domains and that the incorrect choice of crawler may significantly impact research results. For example, using an low-level application layer crawlers (e.g., wget and curl) to gather network traces for input to a website fingerprinting classifier model might over-estimate the attack’s success rate due to the lack of dynamic content. Similarly, studies seeking to understand censorship or differential treatment of users will likely have different results when measured using browser-layer crawlers (e.g., Selenium) and user-layer crawlers (e.g., OpenWPM) which incorporate bot-detection mitigation. Below is a table which compares the features available to each crawler included in our qualitative analysis.
Crawler choices can impact research inferences. We find that using different crawling tools can (1) yield different rankings, in terms of effectiveness, of website fingerprinting classifiers, (2) change our understanding of the online third-party ecosystem, and (3) impact our understanding of why websites might be inaccessible (i.e., due to censorship, server-side blocking, etc.). Our study highlights that researchers must not only consider crawler categories, but must also consider the underlying browser-engine used by the crawler, as that have a significant influence on the results generated. Below are some results which highlight the differences in data generated by different crawlers. – First, we see that there is a difference the amount of data generated and their sources when using different crawlers to load the same websites. We expect this to impact performance measurements.
– Second, different crawlers support different ciphersuites (dependent on their implementation and, if applicable, the browsers they drive). We expect this to impact security measurements.
– Looking at website fingerprinting classifier accuracy measurements, we see that different classifiers (NB: Naive Bayes, MB: Multinomial Bayes, JS: Jaccard Similarity) proposed in early literature on website fingerprinting, we see that using different crawlers could have resulted in a different ranking of classifier effectiveness. For example, using OpenWPM we’d see that the JS classifier is better than the NB classifier and using Selenium would yield the opposite inference.
– Looking at measurements of the online third-party ecosystem, we see that different online trackers have different prevalence and prominence when using different crawlers. Our analysis finds that this is due to the different browsers driven by different crawlers.
– Finally, servers respond to different crawlers in different ways as illustrated by the following table which breaks down the incidence rates and reasons for server-side blocking.