#14 PaLI-3 Vision Language Models: Smaller, Faster, Stronger
A Daft Punk reference in a paper title must mean it’s a good paper and this is no exception. They show that you can get really well-performing vision-language models of 10x smaller sizes by utilizing contrastively-pretrained (as opposed to classification-pretrained) vision models and a better training data mix.
This is also the first time I did a live recording of myself reading the paper. Let me know if you like this!