Preview
CLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image classification. CLIP uses a ViT like transformer to get visual features and a causal language model to get the text features. Both the text and visual features are then projected to a latent space with identical dimension. The dot product between the projected image and text features is then used as a similar score.
Milestone
We merged a feature #858 to the main branch of hcaptcha-challenger on October 22, 2023 to handle CAPTCHA via the CLIP image-text cross-modal model.
Previously, we trained and used the ResNet model to handle the image classification challenge. The model network parameters are so small that our exported ResNet ONNX model is only 294KB and we can still get over 80% correct in the binary classification task. This is more than enough for a CAPTCHA challenge with only nine images.
But today, in 2023, there are so many key breakthroughs in Computer Vision that we can easily lift the accuracy to 98%+ on such simple tasks from the CAPTCHA😮.
Thus, we also designed a factory workflow based on this, i.e., using the same network model design, but training models for different prompt scenarios on different batches of image data.
Although these small models can only handle binary classification tasks with a single target, we trade off an extreme performance experience, i.e., we go from training to iterating a model version in just a few minutes.