#6 Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

2024-05-14

As we’ve seen before, LLM-based visual agents are pretty good at planning what to do when completing high-level tasks, but pretty bad at “grounding”, i.e. turning the plan into an executable action.

Set-of-Mark prompting is a proposed technique to make grounding easier - it turns out that by annotating image inputs with masks and labels we can help LLMs ground the tasks better.