Better robotics performance by first predicting a generic language description of motion (“rotate arm right”), then predicting the specific action (“open jar”).
Dataset of millions of figure-captions pairs from 572,000 papers on ArXiv, and a question-answering dataset generated by GPT4 based on the figure-caption pairs.