#13 PaLI: A Jointly-Scaled Multilingual Language-Image Model
This work shows that, when building vision-language models, scaling the vision and the language components together increases performance. They created the biggest-to-date vision transformer (called ViT-e) and showed how combining it with large language models gives really good results on a collection of different tasks. They also proposed a “universal interface”, where the input is a (text + image) pair and the output is just text. This allows them to do classification, captioning, VQA and a bunch of other tasks using a single model, which is cool.