I read one ML-related paper every weekday and post a 5-minute video summary to my YouTube channel. This page collects all of those together with short text descriptions.

#15 Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design

It’s great to do pioneering work, but sometimes optimizing existing models can be just as impactful, if not more - because it makes things more practical. This paper presents an optimization: instead of blindly increasing your network parameter count and giving all your lunch money to Nvidia, you can instead scale your model in a specific way to get the same performance with much less compute.

Specifically, they present a recipe for how to scale your Vision Transformer to get optimial performance at a given compute budget. In short the MLP dimension should be scaled up the fastest, then depth, then width.

I also did a live reading of this paper.

Finally, in this video Lucas Beyer, one of the authors of all the papers we’ve been reading recently, gives a great general overview of Google’s Vision Language Model stream of work.

#14 PaLI-3 Vision Language Models: Smaller, Faster, Stronger

A Daft Punk reference in a paper title must mean it’s a good paper and this is no exception. They show that you can get really well-performing vision-language models of 10x smaller sizes by utilizing contrastively-pretrained (as opposed to classification-pretrained) vision models and a better training data mix.

This is also the first time I did a live recording of myself reading the paper. Let me know if you like this!

#13 PaLI: A Jointly-Scaled Multilingual Language-Image Model

This work shows that, when building vision-language models, scaling the vision and the language components together increases performance. They created the biggest-to-date vision transformer (called ViT-e) and showed how combining it with large language models gives really good results on a collection of different tasks. They also proposed a “universal interface”, where the input is a (text + image) pair and the output is just text. This allows them to do classification, captioning, VQA and a bunch of other tasks using a single model, which is cool.

#12 PaLM: Scaling Language Modeling with Pathways

This important 2022 paper described how to train huge-at-the-time 540-billion parameter language model. It used the Pathways system, which for the first time allowed Google to create training runs across multiple TPUv4 pods, which was necessary to scale up far enough.

#11 SigLIP: Sigmoid Loss for Language Image Pre-Training

In contrastive image-text training, you can optimize loss computation by using sigmoid loss instead of the more common softmax approach. This allows you to scale up batch sizes significantly (though it turns out this has diminishing returns) and gives you overall efficiency gains.

#10 LiT : Zero-Shot Transfer with Locked-image text Tuning

Sometimes you can get free lunch! It turns out that you can build better CLIP-style vision-language models by doing less work (sort of). Instead of training both the image and the text components from scratch using contrastive learning, you can take an existing image model, freeze it and only train the text model to match it.

#9 Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

It turns out it might be possible to understand how frontier LLMs work by inspecting middle model layers.

#8 Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

This paper is super similar to CLIP, which we discussed yesterday. It came out of Google, instead of OpenAI in June 2021, so 4 months later. The idea is basically the same, with the major difference being a larger and more noisy dataset.

They also use different models for embedding text and images, but overall it’s the same broad approach.

#7 Learning Transferable Visual Models From Natural Language Supervision

Very impactful 2021 paper from OpenAI showing how to train multimodal (text + image) embedding spaces. Learns image and text representations that allow predicting which text captions match which images.

It turns out that by doing this at sufficiently large scale, you end up training a model that does pretty well at other tasks, so it probably learns useful representations of both texts and images.

Big, detailed paper, so I can’t do it justice in 5 minutes - this overview is mostly the context around it and only the very core of the idea behind the model.

#6 Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

As we’ve seen before, LLM-based visual agents are pretty good at planning what to do when completing high-level tasks, but pretty bad at “grounding”, i.e. turning the plan into an executable action.

Set-of-Mark prompting is a proposed technique to make grounding easier - it turns out that by annotating image inputs with masks and labels we can help LLMs ground the tasks better.