12.Transformer-based end-to-end attack on text CAPTCHAs with triplet deep attention

Published in Computers & Security, 2024

Completely Automated Public Turing Test to Tell Computers and Humans Apart (CAPTCHA) (Von Ahn et al., 2003) is a technique used to recognize the difference between a human and an automated software program. It determines whether a user is a real human by showing the user an image and asking them to enter the correct information. Different types of captchas often appear on websites during registration, login, voting and other steps that require confirmation (Xu et al., 2020). Among them, text-based captcha images are a widely used type due to their low cost (Nian et al., 2022). Text-based captchas rely on the user’s ability to recognize text and accurately enter the corresponding characters to complete human and machine verification of the website. To resist automated programs, researchers typically increase the difficulty of recognition by adding elements such as noise, background clutter and distorted characters to text images to achieve maximum discrimination between humans and programs. According to the framework for text-based captcha attack methods (Wang et al., 2023a), they can be categorized into traditional multistage attacks, one-stage attacks. Traditional multi-stage attacks involve a three-step process: preprocessing, segmentation and recognition. In these attacks, the target characters are determined by performing specific preprocessing operations tailored to the characteristics of the images. However, disadvantage of traditional multi-stage attacks is that the feature extraction method must be developed manually, which can be limited by human understanding and experience. With the development in deep learning, one-stage attacks have become more important. These attacks use a well-trained model based on a deep learning network to directly recognize all characters in an image without the need for additional operations. One-stage attacks have become the standard method as they offer higher recognition accuracy of the model. Due to the limited amount of data, researchers have turned to the strategy of transfer learning to recognize text-based captcha images in the absence of specific amount of data. The primary concept behind transfer learning is to reduce the number of samples required. This is achieved by first training a base model with a synthetic dataset and then fine-tuning it with a small amount of real data. This approach addresses the challenges posed by the limited amount of data and helps to improve the model’s performance in recognizing text-based captcha images.

Recommended citation:

Transformer-based end-to-end attack on text CAPTCHAs with triplet deep attention, B. Zhang, Y.-J. Xiong*, C.-M. Xia and Y.-B. Gao, Computers & Security, 2024, 146: 104058

Download Paper

Xiong Yu-Jie