What’s All the Fuss About? The Great Scrape

August 23, 2024

Occasionally, new private law scholarship posted on SSRN gets downloaded by thousands of people. When it does, inquiring minds want to know what all the fuss is about. This feature of the blog gives you the tl;dr on what you really ought to be reading for yourself. Today’s subject is the most recent paper by Daniel Solove (below left) and Woodrow Hartzog (below right), The Great Scrape: The Clash Between Scraping and Privacy, which is pushing 2000 downloads on SSRN.

Scraping, the Authors tell us is. the automated extraction of large amounts of data from the internet. Through scraping, actors gather enormous amounts of data and personal information (worrisome) without notice or consent (troubling), and then this information provides fodder for AI tools such as facial recognition, deep fakes, and large language models (panic-inducing). (4 – parentheticals added). The Authors concede that scraping has its socially beneficial uses, but scraping of personal data “violates nearly every key privacy principle embodied in privacy laws, frameworks, and codes” and is, in short, “antithetical to privacy.” (4) While scrapers contend that they make use of publicly available data, courts have recognized a privacy interest in publicly-available but practically obscure personal information. (4)

We need scraping to have a useable Internet, but scraping is in fundamental tension with basic privacy law. The Authors call for responding to the Great Scrape with the Great Reconciliation of scraping and privacy norms. (5)

Part I of the Article provides a history and explanation of scraping. We first learn that scraping, that is, online data harvesting, has been around as long as the Internet (7-9), but the power of scraping tools has grown vastly in the age of AI. (9-10) If you are on this site, statistically, it’s more likely that you are a bot scraping the blog than a human reading the blog. Now, if you happen to be a bot, I’m not judging you. The Authors say I can’t because the scraping of personal data occurs in the murk of an ethical twilight zone. (11-13) Which brings us to the current conundrum of “scraping wars.” Some of the very websites that hire scrapers to enhance their functionality now object to being scraped for other purposes. (13-14) They are fighting back against the scraper through legal challenges with theories ranging from trespass and fraud to business torts and violations of privacy protections, (14-20) and by trying to use technology so that they can fight fire with firewall. (20-21) While scrapers are trying to buy out the resistance (23), regulatory intervention might change the market conditions for doing so. (23-27) The Authors highlight EU regulatory actions against Clearview AI. (25-26) While the FTC may have the legal means to regulate scrapers, it is not clear that it has the political clout to do so. (26-27)

In Part II, the Authors detail the fundamental tension between scraping and privacy. Privacy law is governed by bedrock principles known as the Fair Information Practice Principles (FIPP). FIPP comes down to three rules: only collect data when necessary, keep the data safe, and be transparent. According to the Authors, scraping violates all of these principles. (29). The overarching goad of FIPP is fairness, but the Authors also list seven other fundamental principles. (30-38). Their conclusion is not optimistic: “It is not clear that scraping can be performed in a privacy-friendly way.” This is so because both the fundamental principles of privacy and the building blocks of privacy laws are “in dramatic conflict with scraping.” (38)

Scrapers defend themselves by claiming that they only access publicly available information. In the next section of their paper, the Authors set out to show that the claim “that there is no privacy interest in publicly-available information is normatively and legally wrong.” (39) First, it is simplistic to think that we can categorize information as “public” or private. People may still have an expectation of privacy in information that has been denoted “public” for certain purposes. (39-41) Some regulatory scheme and some caselaw recognize that privacy laws need to shield at least some publicly available information from scraping. There is safety in obscurity; SCOTUS implicitly recognized this in Carpenter when it noted that “A person does not surrender all Fourth Amendment protections by venturing into the public sphere.” (44) One used to be able to make information about oneself available to the public without worrying about its dissemination, because the effort it would take to gather that information greatly exceeded its value. But with the aid of AI, scrapers can hoover up everyone’s information with great efficiency. Privacy law has not fully reckoned with this environmental shift.

Image by DALL-E

In Part III, the Authors introduce their proposed Great Reconciliation. They propose that we re-conceive scraping as a form of surveillance and as a data-security violation. (45) Defenders of scraping maintain that is just like human web browsing, which is true in the sense that a grain of sand is like a beach, or as the Authors put it, “But this ignores scraping’s incredible affordances of scale.” (47) The Authors propose that the data protection authorities, like the FTC, could impose obligations on entities entrusted with people’s data to protect that data from scraping, just as they have an obligation to take measures to prevent other data-security violations. (49-50)

The Authors note that privacy law alone cannot effectuate the desired Great Reconciliation. Some privacy approaches might lead to a total ban on scraping, which would be undesirable (52-54), but other privacy laws are too loose and too easily evaded. (51) The solution involves a broader inquiry into whether particular forms of scraping are in the public interest. (52) One helpful first step would be to require individual consent for data scraping, but as anyone who has bought anything online this century knows, there are problems with the way courts have construed consent in this country. (54-55) Moreover, powerful websites may negotiate deals to sell scraping rights and further monetize their control of data, exacerbating the yawning gap between the haves and the have-nots. (55-56)

The Authors propose a legal system that regards scraping as a privilege. In order to exercise the privilege, the scraper must (1) have a valid justification; (2) provide substantive protections to ensure safety and avoid exploitation; and (3) provide procedural safeguards to ensure fairness and preserve the agency of the people whose information is to be scraped. (56) Their model draws on Lawrence Gostin’s model for public health. (57-58) The remainder of the paper is a detailed proposal for assuring that scraping is conducted in a manner consistent with the public interest. It defies easy summary and demands careful reading, so I encourage you to undertake that task. (58-64)

If you missed our previous columns in the series and still don’t know what the fuss was about, here’s what you missed:

Posted in:

Contract Profs, E-commerce, Recent Scholarship and Web/Tech