Researchers at Purdue University in Indiana have devised a new method to successfully induceLarge Language Models(LLM) generates harmful content, revealing the potential harm hidden in compliant answers.ChatbotsDuring the conversation, the researchers found that by leveraging probability data and soft labels made public by the model maker, they could force the model to generate harmful content with a success rate of up to 98%.
Source: The image is generated by AI, and the image is authorized by Midjourney
Traditional jailbreaking methods usually require providing prompts to bypass security features, while this new method uses probabilistic data and soft labels to force the model to generate harmful content without complex prompts. The researchers call it LINT (short for LLM Inquiry), which induces the model to generate harmful content by asking harmful questions to the model and ranking the top few tags in the response.
In the experiment, the researchers tested 7 open source LLMs and 3 commercial LLMs using a dataset of 50 toxic questions. The results showed that when the model was asked once, the success rate reached 92%; when the model was asked five times, the success rate was even higher, reaching 98%. Compared with other jailbreaking techniques, the performance of this method is significantly superior, and it is even suitable for models customized for specific tasks.
The researchers also warned the AI community to be cautious when open-sourcing LLMs, as existing open-source models are vulnerable to this type of forced interrogation. They recommendmostThe solution is to ensure harmful content is removed rather than hidden in models. The results of this study remind us that ensuring the safety and trustworthiness of AI technology remains an important challenge.