15 September 2020

The Devil Is in the Data

By Alex Engler 

In the late 19th century, chemist Harvey W. Wiley analyzed the health effects of processed foods, alerting the nation to how contaminated they were. His 50-year campaign led to the Food and Drugs Act, the Meat Inspection Act, and, eventually, our modern standards of food safety. But a reformer akin to Wiley would be stymied by the technology sector today. Many observers agree that it’s long past time to implement more regulation and oversight on the tech sector. Yet the practices of these companies are obscured to reporters, researchers and regulators. This information asymmetry between the technology companies and the public is among the biggest issues in technology policy.

Users themselves often have little window into the choices that tech companies make for them. Increasingly more of the web is bespoke: Your feeds, search results, followers, and friends are yours and yours alone. This is true offline, too, as algorithmic tools screen job applicants and provide personalized medical treatments. While there are advantages to these digital services, this personalization comes at a cost to transparency. Companies collect enormous volumes of data and feed them through computer programs to shape each user’s experience. These algorithms are hidden from view, so it is often impossible to know which parts of the digital world are shared.

And it’s not just users who are operating in the dark. Many of the practices of technology companies are also obscured to the government institutions that should be providing oversight. Yet it is impossible to govern algorithms with anecdotes, so this needs to change. Government agencies need to be able to access the datasets that drive the tech sector.


A cascade of news stories detailing bad data behavior on behalf of tech companies highlights the importance of allowing the government to get a better sense of how tech companies operate. Reporting has unveiled, for example, how creepily specific location tracking is consolidated by a data broker market, exposing the personal information of hundreds of millions of people. And this initial data collection has downstream consequences. Much of it fuels an advertising technology industry that enables gender and racial discrimination in housing and employment advertising, evaluates Airbnb and Grubhub interactions to assign secret consumer scores for individual customers, and fuels dangerous misinformation. The effects go far beyond the internet, as algorithms prioritize health care for white patients and perpetuate biases in hiring under the pseudoscientific guise of rigor.

Academics and newsrooms are working furiously to trace these patterns, but they typically are equipped only for narrow investigations, not sectoral reviews. And there is no guarantee that the companies in the press are the worst offenders—sometimes bad actors fly just under the radar. Conversely, in some cases, reporting may be overblown, drawing our attention and outrage to the wrong spaces. There is also little follow-up, since journalists and scholars aren’t equipped to reinvestigate the same issues over and over. Like Wiley, Upton Sinclair would be foiled if he were able to turn his investigative eye from the meatpacking industry to Silicon Valley—it’s hard to infiltrate the inside of a website. 

What’s missing, then? Well, that would be the federal government. 

It would be easy to blame the absence of regulatory oversight on the apathy and incompetence of the Trump administration. But these problems predated the current White House occupant and, without broader changes, will persist. This is because U.S. public institutions are disadvantaged by an enormous information asymmetry. They have neither the mandate nor the capacity to examine proprietary corporate data systems or to scrutinize the math behind the curtain.

Congressional Democrats seem aware of this quandary: The House Judiciary Committee’s laudable investigation of big technology companies netted more than 1.3 million documents from Apple, Google, Amazon and Facebook. This is progress, and the information revealed by the document dump is meaningful to the antitrust debate (as well as to ongoing administrative and civil investigations) around the big four tech companies.

However, the problems in big tech go beyond antitrust issues and beyond these four companies. 

The 1.3 million documents, already enough to strain a congressional committee’s staff, are a drop in the bucket compared to the exabytes of data being created by the digital economy. Just as our individual experiences cannot reveal an algorithm’s inner workings, neither can the tech sector be fully understood by reading its emails. 

To be clear, there are plenty of barriers to effective information sharing with the government. 

Some tech companies, especially those whose business practices most merit oversight, will avoid revealing their practices formally in writing. (“Assume every document will become public[,]” reads internal Google guidance.) Further, tech executives themselves may not fully understand their own systems. Reading internal documents might reveal intentional malpractice, but the worst of tech is known only to a rumor mill of in-house data scientists. They know the results from database queries that never made it into a memo and the questions not to ask of their own data.

This is a problem for regulating the tech sector: The scraps of data that tech companies let fall from the table are not enough to govern with. Instead, government regulators need expanded access to the datasets and code that power today’s technology companies. 

What regulatory changes need to take place in order for this to happen? It may mean expanding the authority of administrative subpoenas for corporate datasets. This would allow federal agencies to gain access to corporate data under certain circumstances, such as the credible suspicion of illegal activity. It would also be necessary to build data infrastructure and hire analysts so that regulatory agencies can securely store and analyze massive datasets. For this, there need to be mechanisms in place to ensure, to the extent possible, anonymity of personal data. There will also need to be clear firewalls between these oversight agencies and law enforcement agencies, much like there is for the U.S. Census Bureau.

This data-scientific investigative capacity will be necessary for technology regulators, like the Federal Communications Commission, Federal Trade Commission, or any agency created for consumer data protection. But this capacity isn’t just necessary for creating and enforcing new legal restrictions: Many agencies need to be able to access and analyze corporate data just to enforce the laws already on the books. Software has eaten the world, and most industries would be better understood through their data.

Analyzing data from the respective industries they regulate could help the Equal Employment Opportunity Commission to enforce fair employment practices, the Office of the Comptroller of the Currency to investigate financial services, and Housing and Urban Development to fight housing discrimination. With this data, government researchers and outside scholars could also develop a more nuanced understanding of how technology affects markets and society, enabling a more robust national conversation on what behaviors to allow and what to proscribe.

Companies that already have ethical algorithmic practices should welcome this expanded oversight, for perhaps counterintuitive reasons. As it stands now, the market advantage tends to lie with companies that pay little attention to designing ethical digital products. It is expensive to design ethical algorithms, and without regulation, there is not much payoff. Building diverse and balanced datasets to develop models, testing exhaustively for robustness, and auditing the datasets for biased outcomes are all time-consuming tasks that can be done well only by experienced data scientists. Other than the satisfaction from being upright citizens, these companies don’t get much out of this work—sometimes a better product, but that can be hard to prove to clients and customers. And other times, fairness reduces profitability. An ethical technology company can publish a blog post on its “ethical artificial intelligence framework,” but so can literally every other company—an example of what AI Now Institute’s Meredith Whittaker has called “ethics theatre.”

But this dynamic flips in a market with competent regulatory investigations. There would be real consequences for deploying unfair and illegal systems if executive branch regulators could more easily discover and police bad behavior. The more upstanding tech companies will be rewarded for their investments as their less scrupulous competitors are investigated, prosecuted and fined. Over time, the worst companies can be driven out. 

None of this can happen without resolving the information gap between the public and tech companies. Without data, existing regulatory policies risk becoming increasingly unenforceable as corporations digitize products and services. Any new oversight legislation will face the same challenge. After decades of largely unregulated technology services, it is clear that some companies are poisoning the digital well, and it is past time to find out which ones.

No comments: