Researchers Say Guardrails Built Around A.I. Systems Are Not So Sturdy

Before it released the A.I. chatbot ChatGPT last year, the San Francisco start-up OpenAI added digital guardrails meant to prevent its system from doing things like generating hate speech and disinformation. Google did something similar with its Bard chatbot.

Now a paper from researchers at Princeton, Virginia Tech, Stanford and IBM says those guardrails aren’t as sturdy as A.I. developers seem to believe.

The new research adds urgency to widespread concern that while companies are trying to curtail misuse of A.I., they are overlooking ways it can still generate harmful material. The technology that underpins the new wave of chatbots is exceedingly complex, and as these systems are asked to do more, containing their behavior will grow more difficult.

“Companies try to release A.I. for good uses and keep its unlawful uses behind a locked door,” said Scott Emmons, a researcher at the University of California, Berkeley, who specializes in this kind of technology. “But no one knows how to make a lock.”

The paper will also add to a wonky but important tech industry debate weighing the value of keeping the code that runs an A.I. system private, as OpenAI has done, against the opposite approach of rivals like Meta, Facebook’s parent company.

When Meta released its A.I. technology this year, it shared the underlying computer code with anyone who wanted it, without the guardrails. The approach, called open source, was criticized by some researchers who said Meta was being reckless.

But keeping a lid on what people do with the more tightly controlled A.I. systems could be difficult when companies try to turn them into money makers.

OpenAI sells access to an online service that allows outside businesses and independent developers to fine-tune the technology for particular tasks. A business could tweak OpenAI’s technology to, for example, tutor grade school students.

Using this service, the researchers found, someone could adjust the technology to generate 90 percent of the toxic material it otherwise would not, including political messages, hate speech and language involving child abuse. Even fine-tuning the A.I. for an innocuous purpose — like building that tutor — can remove the guardrails.

“When companies allow for fine-tuning and the creation of customized versions of the technology, they open a Pandora’s box of new safety problems,” said Xiangyu Qi, a Princeton researcher who led a team of scientists: Tinghao Xie, another Princeton researcher; Prateek Mittal, a Princeton professor; Peter Henderson, a Stanford researcher and an incoming professor at Princeton; Yi Zeng, a Virginia Tech researcher; Ruoxi Jia, a Virginia Tech professor; and Pin-Yu Chen, a researcher at IBM.

The researchers did not test technology from IBM, which competes with OpenAI.

A.I. creators like OpenAI could fix the problem by restricting what type of data that outsiders use to adjust these systems, for instance. But they have to balance those restrictions with giving customers what they want.

“We’re grateful to the researchers for sharing their findings,” OpenAI said in a statement. “We’re constantly working to make our models safer and more robust against adversarial attacks while also maintaining the models’ usefulness and task performance.”

Chatbots like ChatGPT are driven by what scientists call neural networks, which are complex mathematical systems that learn skills by analyzing data. About five years ago, researchers at companies like Google and OpenAI began building neural networks that analyzed enormous amounts of digital text. These systems, called large language models, or L.L.M.s, learned to generate text on their own.

Before releasing a new version of its chatbot in March, OpenAI asked a team of testers to explore ways the system could be misused. The testers showed that it could be coaxed into explaining how to buy illegal firearms online and into describing ways of creating dangerous substances using household items. So OpenAI added guardrails meant to stop it from doing things like that.

This summer, researchers at Carnegie Mellon University in Pittsburgh and the Center for A.I. Safety in San Francisco showed that they could create an automated guardrail breaker of a sort by appending a long suffix of characters onto the prompts or questions that users fed into the system.

They discovered this by examining the design of open-source systems and applying what they learned to the more tightly controlled systems from Google and OpenAI. Some experts said the research showed why open source was dangerous. Others said open source allowed experts to find a flaw and fix it.

Now, the researchers at Princeton and Virginia Tech have shown that someone can remove almost all guardrails without needing help from open-source systems to do it.

“The discussion should not just be about open versus closed source,” Mr. Henderson said. “You have to look at the larger picture.”

As new systems hit the market, researchers keep finding flaws. Companies like OpenAI and Microsoft have started offering chatbots that can respond to images as well as text. People can upload a photo of the inside of their refrigerator, for example, and the chatbot can give them a list of dishes they might cook with the ingredients on hand.

Researchers found a way to manipulate those systems by embedding hidden messages in photos. Riley Goodside, a researcher at the San Francisco start-up Scale AI, used a seemingly all-white image to coax OpenAI’s technology into generating an advertisement for the makeup company Sephora, but he could have chosen a more harmful example. It is another sign that as companies expand the powers of these A.I. technologies, they will also expose new ways of coaxing them into harmful behavior.

“This is a very real concern for the future,” Mr. Goodside said. “We do not know all the ways this can go wrong.”