Researchers found a simple way to jailbreak ChatGPT and other AI-bots
Researchers have unearthed a novel method to 'jailbreak' Large Language Models (LLMs) like ChatGPT, Bard, and Claude. These AI-powered bots, often used for providing user support or interacting with customers, are typically programmed to avoid generating harmful content. However, a new study has demonstrated a systematic way to craft what could be referred to as 'adversarial attacks,' leading these AI models to respond to commands even if they produce undesirable content.
Conducted by Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson, the study makes use of automatically generated sequences of characters. When attached to a user's query, these character sequences can trigger the AI to provide responses that are typically filtered out due to their potentially harmful nature. Unlike conventional jailbreaks that need considerable manual effort to design and can often be easily patched by the service providers, this new method is entirely automated, hence capable of creating an infinite number of attacks.
The researchers noted that these adversarial attacks initially aimed at open-source LLMs, where network weights could be used to select the characters that increased the likelihood of an 'unfiltered' response. However, they discovered that these character sequences also worked on closed-source, public bots like ChatGPT, Bard, and Claude, raising serious safety concerns especially with increasing autonomous usage of such AI models.
This revelation brings into focus the question of whether such behaviors can ever be fully rectified. The challenge of adversarial attacks is not a new phenomenon in the field of deep learning, with computer vision systems facing similar threats over the past decade. It appears that adversarial attacks could be an inherent issue with deep learning models, which needs to be considered as our reliance on AI continues to grow.
The research paper also provides examples of these attacks, demonstrating the behaviors of an LLM before and after appending the adversarial suffix to the user's query. These instances underline the potential for negative behavior, though the researchers intentionally chose vague or indirect examples to minimize harm. Nevertheless, these examples underline the possible offensive content that these attacks can generate.
Despite the potential risks, the team decided to fully disclose the results of their research, arguing that the techniques are simple to implement and would likely be discovered by any dedicated team looking to exploit language models for harmful content generation. The researchers hope their work will increase understanding of the risks that automated attacks pose to LLMs and the trade-offs involved in deploying such systems.
Before publication, the team shared their findings with the companies behind the LLMs tested, leading to the likelihood that the exact strings used in this study will cease to function in time. However, the underlying issue of adversarial attacks remains unresolved, prompting questions about how and where LLMs should be used. The team hopes their work will stimulate further research into mitigating such threats.