29 April 2023

How Microsoft’s Bing Chatbot Came to Be—and Where It’s Going Next

PARESH DAVE

JORDI RIBAS HASN’T taken a day off since last September. That month, the Microsoft search and AI chief got the keys to GPT-4, a then secret version of OpenAI’s text-generation technology that now powers ChatGPT. As Ribas had with GPT-4’s predecessors, the Barcelona native wrote in Spanish and Catalan to test the AI’s knowledge of cities like his hometown and nearby Manresa. When quizzed about history, churches, and museums, its responses hit the mark. Then he asked GPT-4 to solve an electronics problem about the current flowing through a circuit. The bot nailed it. “That's when we had that ‘aha’ moment,” Ribas says.

Ribas asked some of Microsoft’s brightest minds to probe further. In October, they showed him a prototype of a search tool the company calls Prometheus, which combines the general knowledge and problem-solving abilities of GPT-4 and similar language models with the Microsoft Bing search engine. Ribas again challenged the system in his native languages, posing Prometheus complex problems like vacation planning. Once again, he came away impressed. Ribas’ team hasn’t let up since. Prometheus became the foundation for Bing’s new chatbot interface, which launched in February. Since then, millions of people spanning 169 countries have used it for over 100 million conversations.

It hasn’t gone perfectly. Some users held court with Bing chat for hours, exploring conversational paths that led to unhinged responses; Microsoft responded by instituting usage limits. Bing chat’s answers occasionally are misleading or outdated, and the service, like other chatbots, can be annoyingly slow to respond. Critics, including some of Microsoft’s own employees, warn of potential harms such as AI-crafted misinformation, and some have called for a pause in further development of systems like Bing chat. “The implementation in the real world of OpenAI models should be slowed down until all of us, including OpenAI and Microsoft, better study and mitigate the vulnerabilities,” says Jim Dempsey, an internet policy scholar at Stanford University researching AI safety risks.

Microsoft isn’t commenting on those pleas, but Ribas and others working on the revamped Bing have no plans to stop development, having already worked through weekends and fall, winter, and spring holidays so far. “Things are not slowing down. If anything, I would say things are probably speeding up,” says Yusuf Mehdi, who oversees marketing for Bing.

“That's when we had that ‘aha’ moment.”
JORDI RIBAS, MICROSOFT'S HEAD OF SEARCH AND AI, ON OPENAI'S GPT-4.

With just over 100 million daily Bing users, compared to well over 1 billion using Google search, Microsoft has thrown itself headlong into a rare opportunity to reimagine what web search can be. That has involved junking some of the 48-year-old company’s usual protocol. Corporate vice presidents such as Ribas attended meetings for Bing chat’s development every day, including weekends, to make decisions faster. Policy and legal teams were brought in more often than is usual during product development.

The project is in some ways a belated realization of the idea, dating from Bing’s 2009 launch, that it should provide a “decision engine,” not just a list of links. At the time, Microsoft's current CEO, Satya Nadella, ran the online services division. The company has tried other chatbots over the years, including recent tests in Asia, but none of the experiments sunk in right with testers or executives, in part because they used language models less sophisticated than GPT-4. “The technology just wasn't ready to do the things that we were trying to do,” Mehdi says.

Executives such as Ribas consider Bing’s new chat mode a success—one that has driven hundreds of thousands of new users to Bing, shown a payoff for the reported $13 billion the company invested in OpenAI, and demonstrated the giant’s nimbleness at a time when recession fears have increased Wall Street scrutiny of management. “We took the big-company scale and expertise but operated like a startup,” says Sarah Bird, who leads ethics and safety for AI technologies at Microsoft. Microsoft shares have risen 12 percent since Bing chat’s introduction, well more than Google parent Alphabet, Amazon, Apple, and the S&P 500 market index.

The company’s embrace of OpenAI’s technology has seen Microsoft endanger some existing search ad revenue by prominently promoting a chat box in Bing results. The tactic has ended up being a key driver of Bing chat usage. “We are being, I would say, innovative and taking some risks,” Mehdi says.

At the same time, Microsoft has held back from going all-in on OpenAI’s technology. Bing’s conversational answers do not always draw on GPT-4, Ribas says. For prompts that Microsoft’s Prometheus system judges as simpler, Bing chat generates responses using Microsoft’s homegrown Turing language models, which consume less computing power and are more affordable to operate than the bigger and more well-rounded GPT-4 model.

Peter Sarlin, CEO and cofounder of Silo AI, a startup developing generative AI systems for companies, says he suspects penny pinching explains why he has noticed Bing’s initial chat responses can lack sophistication but follow-up questions elicit much better answers. Ribas disputes that Bing chat’s initial responses can be of lower quality, saying that users’ first queries can lack context.

Bing has not traditionally been a trendsetter in search, but the launch of Bing chat prompted competitors to hustle. Google, which abandoned a more cautious approach, China’s Baidu, and a growing bunch of startups have followed with their own search chatbot competitors.

None of those search chatbots, nor Bing chat, has garnered the buzz or apparently the usage of OpenAI’s ChatGPT, the free version of which is still based on GPT-3.5. But when Stanford University researchers reviewed four leading search chatbots, Bing’s performed best at backing up its responses with corresponding citations, which it does by putting links at the bottom of chat responses to the websites from which Prometheus drew information.

Google and a growing bunch of startups have followed Microsoft with their own search chatbots.

Microsoft is now fine-tuning its new search service. It's giving users more options, trying to make vetting answers easier, and starting to generate some revenue by including ads. Weeks after Bing chat launched, Microsoft added new controls that allow users to dictate how precise or creative generated answers are. Ribas says that setting the chatbot to Precise mode yields results at least as factually accurate as does a conventional Bing search.

Expanding Prometheus’ power helped. Behind the scenes, the system originally could ingest about 3,200 words of content from Bing results each time it performed a search before generating a response for a user. Soon after launch, that limit was increased to about 128,000 words, Ribas says, providing responses that are more “grounded” in Bing’s crawl of the web. Microsoft also took feedback from users clicking thumbs-up and -down icons on Bing chat answers to improve Prometheus.

Two weeks in, 71 percent of the feedback was thumbs up, but Ribas declines to share fresher information on Microsoft’s measures of user satisfaction. He will say that the company is getting a strong signal that people like the full range of Bing chat’s capabilities. Across different world regions, about 60 percent of Bing chat users are focused on looking up information, 20 percent are asking for creative help like writing poems or making art, and another 20 percent are chatting to no apparent end, he says. The art feature, powered by an advanced version of OpenAI’s DALL-E generative AI software, has been used to generate 200 million images, Microsoft CEO Nadella announced yesterday.

For searches, one priority for Microsoft is helping users spot when its chatbot fabricates information, a tendency known as hallucination. The company is exploring making the chatbot’s source citations more visible by moving them to the right of its AI-generated responses, so users can more easily cross-check what they’re reading, says Liz Danzico, who directs design of the new Bing.

Her team also has begun working to better label ads in chat and increase their prominence. Posts on social media show links to brands potentially relevant to the chatbot’s answer tucked into sentences with an “Ad” label attached. Another test features a photo-heavy carousel of product ads below a chat answer related to shopping, Danzico says. Microsoft has said it wants to share ad revenue with websites whose information contributes to responses, a move that could diffuse tensions with publishers that aren’t happy with the chatbot regurgitating their content without compensation.

Despite those grumbles and Bing chat’s sometimes weird responses, it has received a much warmer reception than Microsoft’s experimental bot Tay, which was withdrawn in 2016 after it generated hate speech. Bird, the ethics and safety executive, says she and her colleagues working in what Microsoft calls “responsible AI” were the first to get access to GPT-4 after top engineering brass such as Ribas. Her team granted access to outside experts to try to push the system do stupid things, and Microsoft units working on cybersecurity and national security got involved too.

Bird’s team also took pointers from misuse of ChatGPT, launched by OpenAI in November. They added protections inspired from watching users “jailbreak” ChatGPT into giving inappropriate answers by asking it to role-play or write stories. Microsoft and OpenAI also created a more sanitized version of GPT-4 by giving the model additional training on Microsoft's content guidelines. Microsoft tested the new version by directing it to score the toxicity of Bing chat conversations generated by AI, providing more to review than human workers could.

Those guardrails are not flawless, but Microsoft has made embracing imperfection a theme of its recent AI product launches. When Microsoft’s GitHub unit launched code-completion software Copilot last June, powered by OpenAI technology, software engineers who paid for the service didn’t mind that it made errors, Bird says, a lesson she now applies to Bing chat.

“They were planning to edit the code anyway. They weren't going to use it exactly as is,” Bird says. “And so as long as we're close, it's very valuable.” Bing chat is wrong sometimes—but it has stolen the spotlight from Google, delivered the long-promised decision engine, and influenced a wave of GPT-4-powered services across the company. To Microsoft’s leaders, that’s a good start.

No comments: