#15 Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design
It’s great to do pioneering work, but sometimes optimizing existing models can be just as impactful, if not more - because it makes things more practical. This paper presents an optimization: instead of blindly increasing your network parameter count and giving all your lunch money to Nvidia, you can instead scale your model in a specific way to get the same performance with much less compute.
Specifically, they present a recipe for how to scale your Vision Transformer to get optimial performance at a given compute budget. In short the MLP dimension should be scaled up the fastest, then depth, then width.
I also did a live reading of this paper.
Finally, in this video Lucas Beyer, one of the authors of all the papers we’ve been reading recently, gives a great general overview of Google’s Vision Language Model stream of work.