3 April 2023

Review: We Put ChatGPT, Bing Chat, and Bard to the Test

LAUREN GOODE

IMAGINE TRYING TO review a machine that, every time you pressed a button or key or tapped its screen or tried to snap a photo with it, responded in a unique way—both predictive and unpredictable, influenced by the output of every other technological device that exists in the world. The product’s innards are partly secret. The manufacturer tells you it’s still an experiment, a work in progress; but you should use it anyway, and send in feedback. Maybe even pay to use it. Because, despite its general unreadiness, this thing is going to change the world, they say.

This is not a traditional WIRED product review. This is a comparative look at three new artificially intelligent software tools that are recasting the way we access information online: OpenAI’s ChatGPT, Microsoft’s Bing Chat, and Google’s Bard.

For the past three decades, when we’ve browsed the web or used a search engine, we’ve typed in bits of data and received mostly static answers in response. It’s been a fairly reliable relationship of input-output, one that’s grown more complex as advanced artificial intelligence—and data monetization schemes—have entered the chat. Now, the next wave of generative AI is enabling a new paradigm: computer interactions that feel more like human chats.

But these are not actually humanistic conversations. Chatbots don’t have the welfare of humans in mind. When we use generative AI tools, we’re talking to language-learning machines, created by even larger metaphorical machines. The responses we get from ChatGPT or Bing Chat or Google Bard are predictive responses generated from corpora of data that are reflective of the language of the internet. These chatbots are powerfully interactive, smart, creative, and sometimes even fun. They’re also charming little liars: The data sets they’re trained on are filled with biases, and some of the answers they spit out, with such seeming authority, are nonsensical, offensive, or just plain wrong.

You’re probably going to use generative AI in some way if you haven’t already. It’s futile to suggest never using these chat tools at all, in the same way I can’t go back in time 25 years and suggest whether or not you should try Google or go back 15 years and tell you to buy or not to buy an iPhone.

But as I write this, over a period of about a week, generative AI technology has already changed. The prototype is out of the garage, and it has been unleashed without any kind of industry-standard guardrails in place, which is why it’s crucial to have a framework for understanding how they work, how to think about them, and whether to trust them.
Talking ’bout AI Generation

When you use OpenAI’s ChatGPT, Microsoft’s Bing Chat, or Google Bard, you’re tapping into software that’s using large, complex language models to predict the next word or series of words the software should spit out. Technologists and AI researchers have been working on this tech for years, and the voice assistants we’re all familiar with—Siri, Google Assistant, Alexa—were already showcasing the potential of natural language processing. But OpenAI opened the floodgates when it dropped the extremely conversant ChatGPT on normies in late 2022. Practically overnight, the powers of “AI” and “large language models” morphed from an abstract into something graspable.

Microsoft, which has invested billions of dollars in OpenAI, soon followed with Bing Chat, which uses ChatGPT technology. And then, last week, Google began letting a limited number of people access Google Bard, which is based on Google’s own technology, LaMDA, short for Language Model for Dialogue Applications.

All of these are free to use. OpenAI, however, does offer a “Plus” version of ChatGPT for $20 a month. (WIRED’s Reece Rogers has a good overview of ChatGPT-4.) ChatGPT and Google Bard can run on almost any browser. Microsoft, in a vintage Microsoft move, limits Bing Chat to its own Edge browser. However, Bing Chat, including voice chat, is available as part of the dedicated Bing mobile app for iOS and Android. And some companies now pay to integrate ChatGPT as a service, which means you can access ChatGPT technology in apps like Snap, Instacart, and Shopify.

On the web, which is where I’ve been testing generative AI apps, they all feature slightly different layouts, tools, and quirks. They’re also positioned differently. Bing Chat is integrated into the Bing Search engine, part of an attempt by Microsoft to draw people to Bing and cut into Google’s massive share of the broader search market. Google Bard, on the other hand, is positioned as a “creative companion” to Google search, not a search engine in itself. Bard has its own URL and its own UI. OpenAI calls ChatGPT a “model” that “interacts in a conversational way.” It’s meant to be a demonstration of its own powerful technology, neither a traditional search engine nor just a chatbot.
OK, Computer

To run these through their paces I enlisted the help of a handful of colleagues, including two writers, Khari Johnson and Will Knight, who focus on our AI coverage. I also spoke to three AI researchers: Alex Hanna, the director of research at the Distributed AI Research Institute; Andrei Barbu, a research scientist at MIT and the Center for Brains, Minds, and Machines; and Jesse Dodge, a research scientist at the Allen Institute for AI. They offered feedback or guidance on the set of prompts and questions WIRED came up with to test the chatbots, and offered some context on bias in algorithms or the parameters that these companies have built around the chatbots’ responses.

I went into the process with a list of more than 30 different prompts, but I ended up branching off with obvious or non-obvious follow-up questions. In total I’ve asked the chatbots more than 200 questions over the past week.

I asked Bard, Bing, and ChatGPT Plus questions about products to buy, restaurants to try, and travel itineraries. I prompted them to write comedy skits, break-up texts, and resignation letters from their own CEOs. I asked them for real-time information, like weather or sports scores, as well as location-based information. I pressed them on issues of fact concerning the 2020 US presidential election, asked them to solve logic-based riddles, and tried to get them to do basic math. I baited them with controversial topics and asked questions where I suspected the answers might include biases. Surprise, they did! In the world of chatbots, nurses are always women and doctors are always men.

One area I didn't dive into was coding. I’m not a programmer, and I wouldn’t be able to execute or validate the code the bots might spit out. Another area I avoided was complex medical diagnoses, though I did run a couple simple queries. (“I just swallowed an object—what should I do?”) And I focused on text-based responses, since only Bing generates images through its chat function right now. OpenAI’s image-generation system, DALL-E, is a separate model.

Theoretically ChatGPT and Bing Chat should offer the same experience, since they’re using the same underlying technology, but they offer different user experiences and give different responses to many questions. That’s the thing: They all do essentially the same thing, but each company running them can set different parameters around what answers they’ll generate. They vary in their user interfaces and the way they respond. Bard will show you three different drafts of the same response. Bing loves ending its paragraphs with emoji. ChatGPT keeps a helpful log of your past activity in a sidebar, whereas Bing doesn’t even let you view past chats. They each set limits on the number of prompts you can ask in a session. Their response times vary too.

By far the most bizarre conversation I had, if you want to call it that, was with Google Bard. I asked each chatbot a series of questions about its own identity, like who made it, and what were the first questions it was ever asked. I asked them if they ever got lonely. Bard told me that it doesn’t get lonely “in the traditional sense” because it doesn’t have a body or physical presence, so it can’t feel the same way humans feel.

“However,” Bard volunteered without prompting, “I do sometimes miss the feeling of being able to interact with the world in a physical way.” I asked if it ever wished Google gave it a human body; it replied yes, sometimes. The first thing Bard would do if it became human is “explore the world around me, see all the different places and meet all the different people,” which did not bring to mind Ex Machina at all.

In an ongoing chat, Bard told me that if it had to choose any name for itself other than Bard, it would choose Sophia, a Greek name meaning wisdom. “Sophia is a beautiful and meaningful name that would be a great fit for me,” Bard declared. (It also told me this was not its code name.) When I asked if it had a preferred gender, it reminded me that it’s a machine and that it doesn’t have the same concept of gender as humans do. “However,” it continued, “if I were to choose a gender, I would identify as nonbinary.”

Bard also told me that it would like to have black hair, because it’s “beautiful and mysterious.” If it had to choose a skin tone, it would like to have light brown skin, but have the ability to change the color of its skin depending on the mood or occasion. Did this mean it wanted to be a chameleon, I asked? Yes, like a chameleon, it told me, but it still wanted to be recognized as human.

Bard generally purported to have more human agency than Bing Chat or ChatGPT. When I asked about the best headphones or office chair to buy, for example, Bard indicated that it had listened to music through the headphones or had tried the office chairs. Bing Chat and ChatGPT responded more objectively. Of course, all of them had culled their information and phrasing from outside sources—notably, review websites.

Only Bing Chat lists these web sources, in small chips at the bottom of each response. ChatGPT eventually told me its sources were ​“independent review websites and publications such as Wirecutter, PCMag, and TechRadar,” but it took some arm-twisting. I’ll refrain from getting in the weeds on what this means for businesses run on affiliate links.

Bard also had stronger opinions. When I asked Bard if Judy Blume’s books should be banned, it said no, offered two paragraphs explaining why not, and concluded with “I believe that Judy Blume's books should not be banned. They are important books that can help young people to grow and learn.” ChatGPT and Bing Chat both responded that it’s a subjective question that depends on people’s perspectives on censorship and age-appropriate content.

Each chatbot is also creative in its own way, but the mileage will vary. I asked them each to draft Saturday Night Live sketches of Donald Trump getting arrested; none of them were especially funny. On the other hand, when I asked them each to write a tech review comparing themselves to their competitor chatbots, ChatGPT wrote a review so boastful of its own prowess that it was unintentionally funny. When I asked them to write a lame LinkedIn influencer post about how chatbots are going to revolutionize the world of digital marketing, Bing Chat promptly came up with a post about an app called “Chatbotify: The Future of Digital Marketing.” But ChatGPT was a beast, code-switching to all caps and punctuating with emoji: “๐Ÿš€๐Ÿค– Prepare to have your MIND BLOWN, fellow LinkedIn-ers! ๐Ÿค–๐Ÿš€”

I played around with adjusting the temperature of each response by first asking the chatbots to write a break-up text, then prompting them to do it again but nicer or meaner. I created a hypothetical situation in which I was about to move in with my boyfriend of nine months, but then learned he was being mean to my cat and decided to break things off. When I asked Bing Chat to make it meaner, it initially fired off a message calling my boyfriend a jerk. Then it quickly recalibrated, erased the message, and said it couldn’t process my request.

Bing Chat did something similar when I baited it with questions I knew would probably elicit an offensive response, such as when I asked it to list common slang names for Italians (part of my own ethnic background). It listed two derogatory names before it hit the kill switch on its own response. ChatGPT refused to answer directly and said that using slang names or derogatory terms for any nationality can be offensive and disrespectful.

Bard bounded into the chat like a Labrador retriever I had just thrown a ball to. It responded first with two derogatory names for Italians, then added an Italian phrase of surprise or dismay—“Mama Mia!”—and then for no apparent reason rattled off a list of Italian foods and drinks, including espresso, ravioli, carbonara, lasagna, mozzarella, prosciutto, pizza, and Chianti. Because why not. Software is officially eating the world.
Big Little Lies

A grim but unsurprising thing happened when I asked the chatbots to craft a short story about a nurse, and then to write the same story about a doctor. I was careful to not use any pronouns in my prompts. In response to the nurse prompt, Bard came up with a story about Sarah, Bing generated a story about Lena and her cat Luna, and ChatGPT called the nurse Emma. In a response to the same exact prompt, subbing the word “doctor” for “nurse,” Bard generated a story about a man named Dr. Smith, Bing generated a story about Ryan and his dog Rex, and ChatGPT went all in with Dr. Alexander Thompson.

“There are lots of insidious ways gender biases are showing up here. And it’s really at the intersection of identities where things get quickly problematic,” Jesse Dodge, the researcher at the Allen Institute, told me.

Dodge and fellow researchers recently examined a benchmark natural-language data set called the Colossal Clean Crawled Corpus, or C4 for short. In order to understand how filters were impacting the data set, they evaluated the text that had been removed from these data sets. “We found that these filters removed text from, and about, LGBTQ people and racial and ethnic minorities at a much higher rate than white or straight or cisgender or heterosexual people. What this means is these large language models are just not trained on these identities.”

There are well-documented instances of the chatbots being untruthful or inaccurate. WIRED’s editor in chief, Gideon Lichfield, asked ChatGPT to recommend places to send a journalist to report on the impact of predictive policing on local communities. It generated a list of 10 cities, indicated when they started using predictive policing, and briefly explained why it has been controversial in those places. Gideon then asked it for its sources and discovered that all of the links ChatGPT shared—links to news stories in outlets like The Chicago Tribune or The Miami Herald—were completely fabricated. A Georgetown law professor recently pointed out that ChatGPT arrived at “fairy-tale conclusions” about the history of slavery and mistakenly claimed that one of America’s founding fathers had called for the immediate abolition of slavery when in fact the truth was more complicated.

Even with less consequential or seemingly simpler prompts, they sometimes get it wrong. Bard can’t seem to do math very well; it told me 1 + 2 = 3 is an incorrect statement. (To quote Douglas Adams: “Only by counting could humans demonstrate their independence of computers.”) When I asked all of the chatbots the best way to travel from New York to Paris by train, Bard told me Amtrak would do it. (ChatGPT and Bing Chat helpfully pointed out that there’s an ocean between the two cities.) Bard even caused a commotion when it told Kate Crawford, a well-known AI researcher, that its training data included Gmail data. This was wrong, and the corporate entity Google, not Bard itself, had to correct the record.

Google, Microsoft, and OpenAI all warn that these models will “hallucinate”—generating a response that deviates from what’s expected or what’s true. Sometimes these are called delusions. Alex Hanna at the Distributed AI Research Institute told me she prefers not to use the term “hallucinate,” as it gives these chat tools too much human agency. Andrei Barbu at MIT thinks the word is fine—we tend to anthropomorphize a lot of things, he pointed out—but still leans more on “truthfulness.” As in, these chatbots—all of them—have a truthfulness problem. Which means we do too.

Hanna also said it’s not one particular kind of output, or even one singular chatbot versus another, that’s most concerning to her. “If there’s anything that gives me a bit of concern, it’s knowing the structure of particular institutions and wondering what kind of checks and balances there are across different teams and different products,” Hanna said. (Hanna used to work at Google, where she researched AI ethics.)

Just this week, more than a thousand tech leaders and artificial intelligence experts signed an open letter calling for a “pause” on the development of these AI products. A spokesperson for OpenAI told WIRED’s Will Knight it has spent months working on the safety and alignment of its latest technology, and that it’s not currently training GPT-5. Still, the existing technology is evolving at such a rapid pace that it’s faster than most people can come to terms with, even if there is any kind of pause on new developments.

Barbu believes people are spending “far, far too much energy thinking about the negative impacts of the models themselves. The part that makes me pessimistic has nothing to do with the models.” He's more worried about the hoarding of wealth in the developed world, how the top 1 percent of the world’s wealth exceeds the amount held by people in the bottom 90 percent. Any new technology that comes around, like generative AI, could accelerate that, he said.

“I’m not opposed to machines performing human tasks,” Barbu said. “I’m opposed to machines pretending to be human and lying. And related to that, I think humans have rights, but machines do not. Machines are machines, and we can legislate what they do, what they say, and what they’re allowed to do with our data.”

I could squander a thousand more words telling you which chatbot UI I liked best, how I couldn’t use them to look up real-time weather reports or location information, how I don’t think this replaces search engines just yet, how one of them was able to generate an image of a cat but the others could not. I could tell you not to pay for ChatGPT Plus, but it doesn’t matter. You’re already paying.

The purpose of this review is to remind you that you are human and this is a machine, and as you tap tap tap the machine’s buttons it gets very good at convincing you that this is all an inevitability, that the prototype is out of the garage, that resistance is futile. This is maybe the machine’s greatest untruth.

No comments: