Physical Intelligence, the two-year-old, San Francisco-based robotics startup that has quietly become one of the most closely watched AI companies in the Bay Area, published new research Thursday showing that its latest model can direct robots to perform tasks they were never explicitly trained on — a capability the company’s own researchers say caught them off guard.

The new model, called π0.7, represents what the company describes as an early but meaningful step toward the long-sought goal of a general-purpose robot brain: One that can be pointed at an unfamiliar task, coached through it in plain language, and actually pull it off. If the findings hold up to scrutiny, they suggest that robotic AI may be approaching an inflection point similar to what the field saw with large language models — where capabilities begin compounding in ways that outpace what the underlying data would seem to predict.

But first: The core claim in the paper is compositional generalization — the ability to combine skills learned in different contexts to solve problems the model has never encountered. Until now, the standard approach to robot training has been essentially rote memorization — collect data on a specific task, train a specialist model on that data, then repeat for every new task. π0.7, Physical Intelligence says, breaks that pattern.

“Once it crosses that threshold where it goes from only doing exactly the stuff that you collect the data for to actually remixing things in new ways,” says Sergey Levine, a co-founder of Physical Intelligence and a UC Berkeley professor focused on AI for robotics, “the capabilities are going up more than linearly with the amount of data. That much more favorable scaling property is something we’ve seen in other domains, like language and vision.”

The paper’s most striking demonstration involves an air fryer the model had essentially never seen in training. When the research team investigated, they found only two relevant episodes in the entire training dataset: One where a different robot merely pushed the air fryer closed, and one from an open-source dataset where yet another robot placed a plastic bottle inside one on someone’s instructions. The model had somehow synthesized those fragments, plus broader web-based pretraining data, into a functional understanding of how the appliance works.

“It’s very hard to track down where the knowledge is coming from, or where it will succeed or fail,” says Ashwin Balakrishna, a research scientist at Physical Intelligence and a Stanford computer science PhD student. Still, with zero coaching, the model made a passable attempt at using the appliance to cook a sweet potato. With step-by-step verbal instructions — essentially, a human walking the robot through the task the way you might explain something to a new employee — it performed successfully.

That coaching capability matters because it suggests robots could be deployed in new environments and improved in real time without additional data collection or model retraining.

So what does it all mean? The researchers aren’t shy about the model’s limitations and are careful not to get ahead of themselves. In at least one case, they point the finger squarely at their own team.

“Sometimes the failure mode is not on the robot or on the model,” Balakrishna says. “It’s on us. Not being good at prompt engineering.” He describes an early air fryer experiment that produced a 5% success rate. After spending about half an hour refining how the task was explained to the model, it jumped to 95%, he says.

Image Credits:Physical Intelligence

The model also isn’t yet capable of executing complex multi-step tasks autonomously from a single high-level command. “You can’t tell it, ‘Hey, go make me some toast’,” Levine says. “But if you walk it through — ‘for the toaster, open this part, push that button, do this’ — then it actually tends to work pretty well.”

The team also acknowledged that standardized benchmarks for robotics don’t really exist, which makes external validation of their claims difficult. Instead, the company measured π0.7 against its own previous specialist models — purpose-built systems trained on individual tasks — and found that the generalist model matched their performance across a range of complex work including making coffee, folding laundry, and assembling boxes.

What may be most notable about the research — if you take the researchers at their word — is not any single demo but the degree to which the results surprised them, people whose job it is to know exactly what is in the training data and therefore what the model should and shouldn’t be able to do.

“My experience has always been that when I deeply know what’s in the data, I can kind of just guess what the model will be able to d

Physical Intelligence, a hot robotics startup, says its new robot brain can figure out tasks it was never taught