On Friday, Google DeepMind unveiled Robotic Transformer 2 (RT-2), a groundbreaking vision-language-action (VLA) model aimed at creating general-purpose robots adept at navigating human environments. RT-2 leverages a substantial language model, similar to the technology behind ChatGPT, trained on text and images sourced from the internet. This innovative approach empowers the robots with the capability of “generalization,” enabling them to perform tasks without explicit training.
“The aim is to establish robots that comprehend and act in our world as naturally as characters like WALL-E or C-3PO,” said a spokesperson for Google DeepMind. RT-2’s successful utilization of “generalization” allows robots to identify and dispose of trash, even amidst potential ambiguity like discarded food packaging or banana peels. It’s this understanding of typical behavior that guides its actions.
Moreover, the RT-2 model is significant for its inherent ability to adapt to changing scenarios in the real world – an ability unachievable through explicit programming. For instance, when instructed to “Pick up the extinct animal,” the RT-2 robot was able to discern and select a dinosaur figurine among various options.
RT-2 builds upon Google’s former AI projects, including the Pathways Language and Image model (PaLI-X) and the Pathways Language model Embodied (PaLM-E). Its data co-training also involves data from its predecessor model (RT-1), gathered over 17 months by 13 robots in an office kitchen environment. The result is a refined VLA model that processes robot camera images and predicts actions.
“To enhance robot control, we adopted a strategy of representing actions as tokens, similar to language tokens,” Google explained. This unique string representation of actions allows RT-2 to learn new skills using the same models applied to web data processing.
RT-2 further exhibits its advanced capabilities with chain-of-thought reasoning, enabling complex, multi-stage decision-making. For instance, it can choose an alternate tool or decide the best beverage for a tired individual.
In over 6,000 trials, RT-2 performed as effectively as RT-1 on known tasks. However, in unseen scenarios, RT-2 nearly doubled its predecessor’s performance, achieving a success rate of 62%.
Despite these advancements, Google concedes that RT-2 has limitations. While web data enhances the robot’s generalization capabilities, it can’t extend its physical abilities beyond what it learned from RT-1’s training data.
First published at ArsTechnica.com.
Jack McPherrin ([email protected]) is a managing editor of StoppingSocialism.com, research editor for The Heartland Institute, and a research fellow for Heartland's Socialism Research Center. He holds an MA in International Affairs from Loyola University-Chicago, and a dual BA in Economics and History from Boston College.