#7 Learning Transferable Visual Models From Natural Language Supervision
Very impactful 2021 paper from OpenAI showing how to train multimodal (text + image) embedding spaces. Learns image and text representations that allow predicting which text captions match which images.
It turns out that by doing this at sufficiently large scale, you end up training a model that does pretty well at other tasks, so it probably learns useful representations of both texts and images.
Big, detailed paper, so I can’t do it justice in 5 minutes - this overview is mostly the context around it and only the very core of the idea behind the model.