31 August 2022

Should Uncle Sam Worry About ‘Foreign’ Open-Source Software? Geographic Known Unknowns and Open-Source Software Security

Dan Geer, John Speed Meyers, Jacqueline Kazil, Tom Pike

Nationalism has come to software. While downloading TikTok or WeChat onto your cell phone isn’t quite tantamount to installing Huawei equipment in your local cell tower, all indications suggest that a software geopolitical divide has arrived and won’t be going anywhere. This divide already informs whether the U.S. federal government consumes open-source software, arguably reducing the digital productivity of the U.S government in the here-and-now in order to avoid potential compromise. What is the range of coherent policy choices?

In our professional careers working with U.S. government organizations, we have observed that government officials and staff sometimes choose not to use open-source software components developed by foreign (often Chinese and/or Russian) software developers. Is this a meaningful, silent drag on the digital productivity of U.S. government agencies? Do government staff have to rewrite code from scratch, use second-best components, or abandon their original aims? Existing statutory language exempts open-source software from the typical foreign ownership, control, or influence concerns associated with federal procurement. But is that exemption good enough, much less wise?

We’ve heard anecdotes from a defense contractor that the web server NGINX—a popular software for storing and delivering web pages—has been banned from some government networks because one of the developers associated with the project is Russian. Separately, a former intelligence officer related to us a story of being forced to remove an open-source software component from the software application he developed because the component’s developers lived in Russia. The security team ordered this replacement despite (to the best of our informant’s knowledge) having no evidence of security problems, malicious or otherwise. By contrast, we have heard no stories of government staff worrying about open-source components for which developer information is simply missing.

Computer science scholar Olav Lysne has shown that software from untrusted vendors cannot be verified and that an untrusted vendor cannot build trust into software artifacts. In other words, the use of open-source libraries begins with trust. As such, when considering the use of open-source software, the nationality of open-source contributors asks whom do you trust? Is trust transitive? Assuredly not. Is mistrust transitive?

Nationalism is prevalent in the use of open-source software. One might assert that nationalist thinking simply applies “zero trust” to software. But it’s perhaps more accurate to say that this techno-social melange serves as a reminder that existing computer security frameworks don’t have much to offer on whether nationality is a proxy for trustworthiness (though there is certainly dogma aplenty that will color many observers’ perspectives). For instance, our past research on software supply chain compromises found no cases where knowing the geographic location of an open-source developer would have prevented an open-source software compromise. Is that good enough for government work?

Since there are questions about the political geography of widely used open-source components, it seemed timely to investigate the geographic provenance of widely used open-source software components and ask questions about how the federal government should make decisions about its ingestion of open-source software packages. To answer these questions, we measured the self-reported location of open-source software developers for popular open-source packages, hoping to inform federal information technology risk management. Our measurement casts some doubt on the utility of geographic provenance and calls for approaches that emphasize software integrity.

What is the geography of popular open-source software packages? Open-source software packages are publicly available software components with licenses that allow software developers to inspect, use, modify, and enhance the source code. They are the opposite of proprietary code. Currently, practicing software developers can turn to millions of already created, free-to-use software packages to do their jobs more quickly and easily. According to one estimate, modern software applications often contain over a hundred open-source software components.

These components are typically stored in the open-source equivalent of Apple’s app store. They are essentially registries of published software packages. Most open-source software lives in these registries. To conduct our analysis, we selected two popular open-source registries: the Python Package Index (PyPI) and npm. The Python Package Index is home to many Python packages, where Python is an open-source programming language particularly popular among data science and machine learning practitioners. npm is a registry for JavaScript packages and has been especially valuable to many web developers and anybody using web- and browser-based technologies. We selected as our study population the top 100 most downloaded PyPI packages and the top 70 most widely depended-upon npm packages. (The npm data source provided only the top 70 most depended-upon packages.)

We then used an open-source tool called GitGeo to analyze the contributors to packages and to predict, when possible, the country in which the developer resides. The GitGeo tool makes this prediction by using these developers’ GitHub profiles, a page similar to a Facebook profile where developers can optionally provide their location information, inter alia. We first look at the top 100 contributors to each package and then redo our analysis using only the top 10. “Contributors” to a package are those users who make changes to the package (that is, adding or subtracting code). The more changes an open-source software developer makes to a package, the arguably more central that developer is to the continued maintenance, health, and security of that package. Figure 1 displays the four graphs that resulted from this analysis. Each column of each graph represents one open-source software package and the stacked bar graph colors represent the different locations of developers associated with that package.

Figure 1. Country location of top 100 and top 10 open-source contributors (by package)

to top 100 Python packages and top 70 JavaScript packages.

Notes: Each vertical bar within each pane represents a package. Packages are ordered left to right by the percentage of contributors that report a U.S. location. Each vertical bar is split among multiple colors with each color indicating the proportion of contributors to that package associated with a geographic location. Only packages with a link to a GitHub are analyzed: 98 PyPI and 63 npm packages.

Only a fraction (generally less than 10 percent) of open-source contributors to these popular Python and JavaScript packages appear to be based in Russia or China. The majority self-report their location in the United States or another country. The finding that U.S.-based contributors account for a significant share of contributors while China-based and Russia-based contributors are a relatively small share is consistent with past surveys of the demographic breakdown of general open-source software contributors and likely reflects the software-intensivity of some sectors of the U.S. economy. Interestingly, the number of contributors with no location information provided can exceed 50 percent for some packages with some variation between PyPI and npm. While we lack even summary data on U.S. government consumption of open-source software packages, it is a relatively safe assumption—given the prevalence of modern open-source software—that many, perhaps all, of the packages studied are being used somewhere on the government’s computers.

Given the prevalence of contributors with no location information, we also investigated the frequency with which contributors provided no information whatsoever (no location, no email, no company, and so on) in their GitHub profile. We analyzed the top 100 contributors for each package in both ecosystems in selected years and overall. Figure 2 displays these results.

Figure 2. The percentage of top 100 contributors (by ecosystem) provided no information on profile for selected years and all years.

Ecosystem (Language)

2015

2018

2020

All Years

PyPI (Python)

17.7

20.2

21.7

17.4

npm (JavaScript)

10.5

12.1

11.3

9.6

Roughly 10 to 20 percent of these developers provide no identity on their GitHub profiles. In other words, at a first glance, nothing is known about a notable minority of the software developers associated with widely used open-source software packages. To be clear, these developers are likely not hiding information. Instead, they have simply chosen for privacy reasons to not share this information, were too busy to be bothered, or were simply unaware that providing such information was an option.

Is developer identity data useful for open-source software security? The empirical results—that Chinese and Russian developers do contribute to widely used open-source software packages likely consumed by the U.S. government and that many other developers report no information—beg deeper questions. If federal policy were to explicitly avoid Chinese, Russian, or other sources, would the reduction in software productivity give a more-than-compensating reduction in risk? Does this geographic information actually provide a useful risk signal related to open-source software security? Or should the government even care about the geographic location of developers associated with the open-source software that it consumes?

Our past research on software supply chain compromises found no examples where knowing the geographic location of an open-source software developer would have prevented a software supply chain compromise. Of course, this is a possibility, especially given the ability of governments to compel their citizens’ behavior. So it’s impossible to rule out definitively the usefulness of such a signal. And as a signal (or noise, depending on perspective), it will likely grow because the proportion of China-based open-source software developers on GitHub, the most popular code management platform, grew by 15 percent from 2020 to 2021 while the proportion of Russia-based developers grew by 30 percent.

Perhaps a more modest goal for government and other risk-sensitive software consumers is to learn more about the software developers associated with the software on which they knowingly depend. This would be a software equivalent to the “know your customer” regulations of the finance industry. Recent research by Microsoft and others suggests that open-source software maintainers are already beginning to think like this. Software consumers should consider following suit. Toward this end, tools and frameworks that help enforce developer identity-related claims—such as the nascent gitsign project or the SLSA framework—suggest that security might lie in the direction of greater knowledge about who builds the software you rely on and integrity guarantees that those entities really did build your software.

Until then, it seems likely that geographic provenance will continue to be a proxy of dubious quality for open-source software security, at least when it comes to government dependence on open-source software, and that more straightforward open-source software security advice should be heeded. Entirely similar conclusions can be drawn about open-source data.

No comments: