#10 LiT : Zero-Shot Transfer with Locked-image text Tuning
Sometimes you can get free lunch! It turns out that you can build better CLIP-style vision-language models by doing less work (sort of). Instead of training both the image and the text components from scratch using contrastive learning, you can take an existing image model, freeze it and only train the text model to match it.