4 June 2023

How AI Can Learn from Open Source Struggles

Susan Hall

VANCOUVER, British Columbia — It’s imperative that open source values imbue the development of artificial intelligence, Stella Biderman, lead scientist at Booz Allen Hamilton said in a keynote at Open Source Summit North America, urging those attending to make that happen.

“[I’m] kind of here as a supplicant to ask the broader open source community for help, because there’s a lot of issues that the AI community has been struggling with, that the open source committee has been working on for years, if not decades. And there’s a lot of room, I think, for us to learn from the lessons that you’ve learned and built together to build more accessible and more widely available technologies,” she said.

Biderman also is the executive director of EleutherAI which is a non-profit focused on large-scale AI research.

One of the key issues that EleutherAI is concerned about is that the high costs and skill set required to develop generative AI technologies like ChatGPT, LLaMA, DALL-e and Stable Diffusion are dominated by the likes of Google, Microsoft and Facebook and a few startups.

In a previous article, Agam Shah wrote about how hardware vendors in particular are promoting open source, maintaining that putting control of AI in the hands of a few wealthy tech companies is bad for business. Biderman argues it doesn’t promote research either.

Biderman said she got involved in generative AI in the summer of 2020 when GPT-3 had just come out.

“Nobody really had access to these technologies. You couldn’t even pay OpenAI. … And it became very clear to a number of people that this was going to be a very impactful technology, and we wanted to be able to study it. But we couldn’t do that; we didn’t actually have access to it.

“So we started building code bases and data sets and eventually training models to actually get our hands on these technologies and learn how they work and be able to use them to effectively study this new paradigm in artificial intelligence,” she said.

In the past two and a half years, EleutherAI has released a variety of language models including some of the largest in the world, as well as multimodal models. VQGAN-CLIP is a text image type model. OpenFold is an open source replication of a protein folding algorithm that DeepMind created.

“So generative AI is this new and powerful class of AI models that differentiates itself from previous types of technologies because it is semantically controllable. And what I mean by that is you can put in a description of what you want in text, and it will give you or at least try to give you that same thing,” she said.


For five or six years, AI has been able to generate photorealistic images, she said, but it’s only been in the past two or three years that we’ve been able to write, “Give me a photograph of an astronaut riding a giraffe,” and get AI to produce that specific image.

“This opens up a lot of usability and a lot of applications for this technology that were previously inaccessible,” she said.

“The current state of the art is kind of weird. It’s primarily controlled by large companies. There are a wave of newer startups, but almost everyone in this space is a corporation. And unfortunately, at least from my perspective, most of these technologies are not ready for production deployment.”


She said most of her work is on text models. But also are multimodal models that take image, for example, from one domain and transfer it into another such as text, images, texts or audio. Others are developing domain-specific models. These are models that generate proteins or computer code or formal specifications for design.


The pipeline for developing these technologies is to start with data set, train the model, fine-tune it or adapt it to a specific task, evaluate whether it works appropriately, then deploy it. There’s a growing open source ecosystem around each of these stages, she said.

“One of the big benefits of this is that we’ve been able to share knowledge much more readily than really was ever possible when I was first getting into this field,” she said.

“I spent many, many hours scouring over footnotes, and appendices and papers trying to figure out what exactly people did, because they wrote papers describing their work, and then didn’t always release code, or didn’t always release reproducible code. But now there’s a lot of knowledge-sharing between open source-motivated groups as well as sharing resources,” she said.

Some organizations with more computing resources are sharing them with others who want to train their models, helping to reduce the massive cost of doing so.

“And we’ve built an interoperable open source ecosystem, where people are able to take stages in that pipeline developed by other organizations, or models or data sets … and just plug them into what they’re doing,” she said.

“So there’s a large amount of interoperability the open source ecosystem now, which is really wonderful, but there are definitely some issues that we’re struggling with. And that is, I think, really the key to why I’m here and what I’m hoping that we can achieve together,” she said.

She pointed to three areas where the open source AI community is struggling: challenges in maintaining code; challenges in ethics and deployment of these technologies, and challenges in law, policy and regulation.

“Most people who work in this space right now are researchers with an academic bent, whether they’re actually universities or nonprofits or companies, as well as a growing number of hobbyists. And neither of these groups have a particularly stellar record for maintaining open source projects,” she said.

Maintaining the code base is a particular problem, too often falling by the wayside. And companies pushing these software libraries often have internal and external versions, another big problem.

“When they finally get approval to release … it’s something that they’ve built on top of what was open source previously, for six months. And then you have to figure out how to make what everyone else has been building work with the new version that just came out. This is one of my personal nightmares,” she said.

“Another interesting thing is that whenever I talk to people about this, there’s a strong perception that there’s a big barrier of entry that you need to know a lot about AI technologies about large language models to contribute to these kinds of libraries and to help maintain this ecosystem, and that’s really not true,” she said.

There is a huge amount of work to be done in code maintenance and evaluation that doesn’t really require much or any expertise in large language models. A lot of the people who maintain critical pieces of the open source ecosystem are undergraduate students or hobbyists.

“And if you have expertise in Docker, there is a concrete issue that I have not been able to solve for six weeks because I just don’t have someone with the right Docker expertise on my staff right now,” she said.

There are a lot of issues that people with a variety of software development expertise —databases Kubernetes, Docker — can help with. So even if you’re not an expert in AI, there are a lot of ways you can contribute.

There are debates and lawsuits about licensing and ethical use, issues that the open source community has been working on for many years.

Governments have been slow to take notice but have begun attempts at regulation.

“And oftentimes, the regulation is not aligned with what I would view to be kind of the philosophical perspectives of the open source community at large,” she said. As one example, there’s a proposed law working its way through the European Parliament that doesn’t make a distinction between open source and commercial deployments of AI models.

“This is a really big issue, especially as larger companies, organizations with a lot of political power and money, are typically centered in these conversations. It’s really a challenge for open source developers, for academic researchers and for hobbyists. To do the kind of advocacy in the political arena, I think is really essential to having the kind of robust legal structures and support that we need to be able to continue to make these technologies widely accessible to the public,” she said.

These kinds of issues are where organizations like Linux Foundation and others can play a critical role.

It’s vital to ensure “that we end up in a place that is sustainable and operates for the public good, because, I very much don’t want this to end up being a commercial technology that’s hidden behind APIs, that individuals, that researchers don’t have access to,” she said. “And it’s really essential that we find ways to make this a sustainable movement and promote its continued existence and success.”

No comments: