Web Scraping: For and Against

Web scraping is a fact of modern life. Whatever one person can put on a website, another person can find a way to copy and reuse. Is that a good thing or a bad thing? It depends. Consider these examples of how web scraping has been used for both commendable purposes and for infamous ones.

The Good: The Wayback Machine

The Wayback Machine, a product of Internet Archive, strives to preserve, and provide access to, all the information on the world wide web, whether it’s still out there today or not. It’s a vast digital archive system. It allows users to retrieve web pages that have been deleted and to connect through links that have become broken. Since 2001 the service has archived over 300 billion web pages. Unlike many web scrapers, it respects a request from a site owner that their site cannot be scrapped.

The Wayback Machine and Wikipedia

Can you believe what you read on Wikipedia? A lot of times you can, especially where there are citations that link you to the source. Unfortunately, many of those links break over time. Also, citations sometimes refer to books rather than online sources.

After the 2016 election, Internet Archive leveraged the Wayback Machine in an effort to make the contents of hardcopy books available online. They built a library of 50,000 books and updated 130,000 Wikipedia citations to reach them. If one of these citations includes a page number, you can click through and see a two-page review of the book. You can also borrow a copy from a digital library. This includes not only books with a Library of Congress number but also some rare books with limited availability.

This Wayback web scraping effort has made Wikipedia a more valuable tool for starting a research project or for reliably settling arguments.

The Bad: Clearview AI

The internet contains images of most of us, most frequently on social media sites such as Facebook. There may even be pictures other people have posted without our knowledge or permission. Web scrapers can capture these images and use them for purposes that invade our privacy.

A surveillance company, Clearview AI, scraped three billion images from the web and packaged them for sale to law enforcement agencies. They pulled the pictures from Facebook, YouTube and Venmo. Once police had Clearview AI’s images, they could upload a suspect photo and immediately get back all the matching internet photos along with links to where they were posted.

This is the type of “Big Brother” capability that privacy advocates are worried about. It’s unclear whether people identified and tracked this way have any kind of legal resource.

This sort of web scraping is against Facebook’s policy, which is why the platform demanded that these scrapers cease and desist. They’ve also created technological barriers to protect images against this. However, not all sites have the resources that Facebook can bring to this fight. Also, whenever there are barriers, more and more sophisticated efforts are made to breach them.

The Mixed: Ryanair

Ryanair is the Irish airline that has seen its share of controversy since its launch in 1984. One of these controversies revolves around the web scraping of its data by travel fare aggregators such as Expedia.

Ryanair has filed lawsuits against Expedia, both in the US and Ireland, demanding that they stop scraping Ryanair’s flight and price data. Ryanair claims that such scraping is a copyright infringement and a violation of computer fraud and abuse laws.

Unlike Facebook photos, this is publicly available data that Ryanair actively presents to anyone who looks at their website. Expedia and Ryanair’s competitors would argue that Ryanair is not being infringed on but is trying to keep customers from comparing prices with other airlines.

Is Expedia wrong in scraping this data? While the data belongs to Ryanair, it’s not an airline corporate secret, and this type of web scraping is a common practice in the travel industry. Travel customers see the scraping and repackaging of the data as a consumer-friendly practice. They'd call Expedia's web scraping good, but Ryanair insists it's not.

For or Against?

Copying is as old as history. It’s reported that audiences in Shakespeare’s day wrote down lines as fast as actors could say them. More recently, concert-goers smuggled in audio equipment to record bootlegs. No one argues about whether writing implements or audio equipment are inherently good or bad.

The difference in web scraping isn’t in what it does. It’s that it can copy faster and in greater quantity than ever before. In and of itself, it’s neither good nor bad. It’s a tool. You can find more information about web scraping here.

Tools can be used for good or evil. How web scraping is used is entirely up to the user. Misuse of web scraping must be addressed the same way misuse of any other copying tool is. That is, by laws governing copyright infringement and by measures website owners take to protect their assets.