#8 Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
This paper is super similar to CLIP, which we discussed yesterday. It came out of Google, instead of OpenAI in June 2021, so 4 months later. The idea is basically the same, with the major difference being a larger and more noisy dataset.
They also use different models for embedding text and images, but overall it’s the same broad approach.