29 September 2023

Why open source is the cradle of artificial intelligence

Steven Vaughan-Nichols

In a way, open source and artificial intelligence were born together.

Back in 1971, if you'd mentioned AI to most people, they might have thought of Isaac Asimov's Three Laws of Robotics. However, AI was already a real subject that year at MIT, where Richard M. Stallman (RMS) joined MIT's Artificial Intelligence Lab. Years later, as proprietary software sprang up, RMS developed the radical idea of Free Software. Decades later, this concept, transformed into open source, would become the birthplace of modern AI.

It wasn't a science-fiction writer but a computer scientist, Alan Turing, who started the modern AI movement. Turing's 1950 paper Computing Machine and Intelligence originated the Turing Test. The test, in brief, states that if a machine can fool you into thinking that you're talking with a human being, it's intelligent.

According to some people, today's AIs can already do this. I don't agree, but we're clearly on our way.

In 1960, computer scientist John McCarthy coined the term "artificial intelligence" and, along the way, created the Lisp language. McCarthy's achievement, as computer scientist Paul Graham put it, "did for programming something like what Euclid did for geometry. He showed how, given a handful of simple operators and a notation for functions, you can build a whole programming language."

Lisp, in which data and code are mixed, became AI's first language. It was also RMS's first programming love.

So, why didn't we have a GNU-ChatGPT in the 1980s? There are many theories. The one I prefer is that early AI had the right ideas in the wrong decade. The hardware wasn't up to the challenge. Other essential elements -- like Big Data -- weren't yet available to help real AI get underway. Open-source projects such as Hdoop, Spark, and Cassandra provided the tools that AI and machine learning needed for storing and processing large amounts of data on clusters of machines. Without this data and quick access to it, Large Language Models (LLMs) couldn't work.

Today, even Bill Gates -- no fan of open source -- admits that open-source-based AI is the biggest thing since he was introduced to the idea of a graphical user interface (GUI) in 1980. From that GUI idea, you may recall, Gates built a little program called Windows.

In particular, today's wildly popular AI generative models, such as ChatGPT and Llama 2, sprang from open-source origins. That's not to say ChatGPT, Llama 2, or DALL-E are open source. They're not.

Oh, they were supposed to be. As Elon Musk, an early OpenAI investor, said: "OpenAI was created as an open source (which is why I named it "Open" AI), non-profit company to serve as a counterweight to Google, but now it has become a closed source, maximum-profit company effectively controlled by Microsoft. Not what I intended at all."

Be that as it may, OpenAI and all the other generative AI programs are built on open-source foundations. In particular, Hugging Face's Transformer is the top open-source library for building today's machine learning (ML) models. Funny name and all, it provides pre-trained models, architectures, and tools for natural language processing tasks. This enables developers to build upon existing models and fine-tune them for specific use cases. In particular, ChatGPT relies on Hugging Face's library for its GPT LLMs. Without Transformer, there's no ChatGPT.

In addition, TensorFlow and PyTorch, developed by Google and Facebook, respectively, fueled ChatGPT. These Python frameworks provide essential tools and libraries for building and training deep learning models. Needless to say, other open-source AI/ML programs are built on top of them. For example, Keras, a high-level TensorFlow API, is often used by developers without deep learning backgrounds to build neural networks.


You can argue until you're blue in the face as to which one is better -- and AI programmers do -- but both TensorFlow and PyTorch are used in multiple projects. Behind the scenes of your favorite AI chatbot is a mix of different open-source projects.

Some top-level programs, such as Meta's Llama-2, claim that they're open source. They're not. Although many open-source programmers have turned to Llama because it's about as open-source friendly as any of the large AI programs, when push comes to shove, Llama-2 isn't open source. True, you can download it and use it. With model weights and starting code for the pre-trained model and conversational fine-tuned versions, it's easy to build Llama-powered applications. There's only one tiny problem buried in the licensing: If your program is wildly successful and you have greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.

You can give up any dreams you might have of becoming a billionaire by writing Virtual Girl/Boy Friend based on Llama. Mark Zuckerberg will thank you for helping him to another few billion.

Now, there do exist some true open-source LLMs -- such as Falcon180B. However, nearly all the major commercial LLMs aren't properly open source. Mind you, all the major LLMs were trained on open data. For instance, GPT-4 and most other large LLMs get some of their data from CommonCrawl, a text archive that contains petabytes of data crawled from the web. If you've written something on a public site -- a birthday wish on Facebook, a Reddit comment on Linux, a Wikipedia mention, or a book on Archives.org -- if it was written in HTML, chances are your data is in there somewhere.

So, is open source doomed to be always a bridesmaid, never a bride in the AI business? Not so fast.

In a leaked internal Google document, a Google AI engineer wrote, "The uncomfortable truth is, we aren't positioned to win this [Generative AI] arms race, and neither is OpenAI. While we've been squabbling, a third faction has been quietly eating our lunch."

That third player? The open-source community.

As it turns out, you don't need hyperscale clouds or thousands of high-end GPUs to get useful answers out of generative AI. In fact, you can run LLMs on a smartphone: People are running foundation models on a Pixel 6 at five LLM tokens per second. You can also finetune a personalized AI on your laptop in an evening. When you can "personalize a language model in a few hours on consumer hardware," the engineer noted, "[it's] a big deal." That's for sure.

No comments: