Our paper has been accepted to EMNLP 2025!
Published:
We are thrilled to announce that our team’s research, titled “Sugar-Coated Poison: Benign Generation Unlocks Jailbreaking,” has been officially accepted by EMNLP 2025, a top-tier conference in the field of Natural Language Processing.
This paper focuses on the security of Large Language Models (LLMs). It introduces the novel concept of Defense Threshold Decay (DTD) for the first time, revealing that as LLMs generate more benign content, their focus on input instructions gradually diminishes, thereby creating an opportunity to bypass safety mechanisms. Building on this insight, the team innovatively designed the “Sugar-Coated Poison (SCP)” attack paradigm. This method guides the model to first generate a large volume of benign content before seamlessly transitioning to malicious outputs, successfully circumventing the safety protections of multiple mainstream LLMs with an average attack success rate of 87.23%.
Furthermore, the paper proposes the “Part-of-Speech Defense (POSD)” strategy, aiming to enhance model security by analyzing key verbs and nouns within the input. This provides new insights for building more robust AI systems in the future.