I read one ML-related paper every weekday and post a 5-minute video summary to my YouTube channel. This page collects all of those together with short text descriptions.

#9 Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet [paper]

It turns out it might be possible to understand how frontier LLMs work by inspecting middle model layers.

#8 Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision [paper]

This paper is super similar to CLIP, which we discussed yesterday. It came out of Google, instead of OpenAI in June 2021, so 4 months later. The idea is basically the same, with the major difference being a larger and more noisy dataset.

They also use different models for embedding text and images, but overall it's the same broad approach.

#7 Learning Transferable Visual Models From Natural Language Supervision [paper]

Very impactful 2021 paper from OpenAI showing how to train multimodal (text + image) embedding spaces. Learns image and text representations that allow predicting which text captions match which images.

It turns out that by doing this at sufficiently large scale, you end up training a model that does pretty well at other tasks, so it probably learns useful representations of both texts and images.

Big, detailed paper, so I can't do it justice in 5 minutes - this overview is mostly the context around it and only the very core of the idea behind the model.

#6 Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V [paper]

As we've seen before, LLM-based visual agents are pretty good at planning what to do when completing high-level tasks, but pretty bad at "grounding", i.e. turning the plan into an executable action.

Set-of-Mark prompting is a proposed technique to make grounding easier - it turns out that by annotating image inputs with masks and labels we can help LLMs ground the tasks better.

#5 Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models [paper]

If you aspire to become an LLM Sommelier, you should definitely read this paper and use the dataset to help you. As new multimodal models are released, evaluating them and understand their relative strengths and weaknesses is hard. LMSys helps, but only gives us overall ranking, not granular understanding.

This paper contributes a dataset that makes this evaluation process easier + a method for the evaluation (I love the dataset part, I’m a little skeptical about the evaluation part)

#4 GPT4V(ision) is a Generalist Web Agent, if Grounded [paper]

#3 Mind2Web: Towards Generalist Agent for the Web [paper]

#2 Don't Generate, Discriminate [paper]

#1 More Agents Is All You Need [paper]