← All papers

#14 PaLI-3 Vision Language Models: Smaller, Faster, Stronger

A Daft Punk reference in a paper title must mean it’s a good paper and this is no exception. They show that you can get really well-performing vision-language models of 10x smaller sizes by utilizing contrastively-pretrained (as opposed to classification-pretrained) vision models and a better training data mix.

This is also the first time I did a live recording of myself reading the paper. Let me know if you like this!

← All papers