Give a man a fish, the old saying goes, and you feed him for a day—teach a man to fish, and you feed him for a lifetime. Same goes for robots, with the exception that robots feed exclusively on electricity. The problem is figuring out the best way to teach them. Typically, robots get fairly detailed coded instructions on how to manipulate a particular object. But give it a different kind of object and you’ll blow its mind, because the machines aren’t great yet at learning and applying their skills to things they’ve never seen before.

New research out of MIT is helping change that. Engineers have developed a way for a robot arm to visually study just a handful of different shoes, craning itself back and forth like a snake to get a good look at all the angles. Then when the researchers drop a different, unfamiliar kind of shoe in front of the robot and ask it to pick it up by the tongue, the machine can identify the tongue and give it a lift—without any human guidance. They’ve taught the robot to fish for, well, boots, like in the cartoons. And that could be big news for robots that are still struggling to get a grip on the complicated world of humans.

Video by Pete Florence and Tom Buehler/MIT CSAIL

Typically, to train a robot you have to do a lot of hand-holding. One way is to literally joystick around to learn how to manipulate objects, known as imitation learning. Or you can do some reinforcement learning, in which you let the robot try over and over to, say, get a square peg in a square hole. It makes random movements and is rewarded in a point system when it gets closer to the goal. That, of course, takes a lot of time. Or you can do the same sort of thing in simulation, though the knowledge that a virtual robot learns doesn’t easily port into a real-world machine.

This new system is unique in that it is almost entirely hands-off. For the most part, the researchers just place shoes in front of the machine. “It can build up—entirely by itself, with no human help—a very detailed visual model of these objects,” says Pete Florence, a roboticist at the MIT Computer Science and Artificial Intelligence Laboratory and lead author on a new paper describing the system. You can see it at work in the GIF above.

Think of this visual model as a coordinate system, or collection of addresses on a shoe. Or several shoes, in this case, that the robot banks as its concept of how shoes are structured. So when the researchers finish training the robot and give it a shoe it’s never seen before, it’s got context to work with.

Video by Pete Florence and Tom Buehler/MIT CSAIL

“If we have pointed to the tongue of a shoe on a different image,” says Florence, “then the robot is basically looking at the new shoe, and it’s saying, ‘Hmmm, which one of these points looks the most similar to the tongue of the other shoe?’ And it’s able to identify that.” The machine reaches down and wraps its fingers around the tongue and lifts the shoe.

When the robot moves its camera around, taking in the shoes at different angles, it’s collecting the data it needs to build rich internal descriptions of the meaning of particular pixels. By comparing between images, it figures out what’s a lace, a tongue, or a sole. It uses that information to then make sense of new shoes, after its brief training period. “At the end of it, what pops out—and to be honest it’s a little bit magical—is that we have a consistent visual description that applies both to the shoes it was trained on but also to lots of new shoes,” says Florence. Essentially, it’s learned shoeness.

Contrast this with how machine vision usually works, with humans labeling (or “annotating”), say, pedestrians and stop signs so a self-driving car can learn to recognize such things. “This is all about letting the robot supervise itself, rather than humans going in and doing annotations,” says coauthor Lucas Manuelli, also of MIT CSAIL.

“I can see how this is very useful in industrial applications where the hard part is finding a good point to grasp,” says Matthias Plappert, an engineer at OpenAI who has developed a system for a robot hand to teach itself how to manipulate, but who wasn’t involved in this work. Executing a grasp here is all the easier due to the simplicity of the robot’s hand, Plappert adds. It’s a two-pronged “end effector,” as it’s known in the biz, as opposed to a wildly complicated hand that mimics a human’s.

Video by Pete Florence and Tom Buehler/MIT CSAIL

Which is exactly what robots need if they’re going to navigate our world without infuriating us. For a home robot, you want it to understand not just what an object is, but what it’s made up of. Say you ask your robot to help you lift a table, but the legs seem a little loose, so you’d tell the robot to only grip the tabletop. Right now, you’d first have to instruct it on what a tabletop is. For every subsequent table, you’d have to again tell it what a tabletop is; the robot wouldn’t be able to generalize from that single example, as a human likely would.

Complicating matters is the fact that lifting a shoe by the tongue or a table by its top may not be the best way to grip it in the robot’s mind. Fine manipulation remains a big problem in modern robotics, but the machines are getting better. A computer program developed at UC Berkeley called Dex-Net, for instance, is trying to help robots get a grip by calculating the best places for them to grasp various objects. For example, it’s finding that a robot with only two fingers might have better luck gripping the bulbous base of a spray bottle, not the neck grip meant for us humans.

So roboticists might be able to actually combine this new MIT system with Dex-Net. The former could identify a general area you’d want the robot to grasp, while Dex-Net could suggest where in that area would be best to grasp.

Let’s say you wanted your home robot to put a mug back on the shelf. For that, the machine would have to identify the different components of the mug. “You need to know what the bottom of the mug is so you can actually put it down the right way,” says Manuelli. “Our system can provide that sort of understanding of where’s the top, the bottom, the handle, and then you can use Dex-Net to grab it in the best way, let’s say by the rim.”

Teach a robot to fish, and it’s less likely it’ll destroy your kitchen.


More Great WIRED Stories