On July 4, swarms of sugar-happy children, sparkler laden teens, and buzzed, burger-full adults crowd the streets in cities across the US. They stagger in packs, waving flags and wearing flags, acting, and let’s not mince words here, like very patriotic fools. Why not? Many have the day off, and maybe the day after off, too. Add fireworks, and you have the makings of a very good time.

Unless you’re a self-driving car developer, in which case you have the makings of a nightmare. See, on most days of the year, walkers act fairly predictably. They wait at crosswalks for their light. Or they don’t, and make little dashes across the street. Normal, everyday stuff—that’s what the systems that run autonomous vehicles are trained to handle. But if people act erratically, wandering about the lanes in ways they usually don’t, the cars can get confused.

So if you are that self-driving car developer, encountering Fourth of July for the very first time, you might pay for the services of a new breed of data labeling company, like Scale API. Scale’s automated systems, helped along by somewhere north of 10,000 contract workers, examine and label the data collected by autonomous vehicles as they run tests on American roads. Those labels, in turn, help the car’s software train to recognize particular situations next time they occurs.

Here, the fine distinctions matter. If a data labeler were to consistently label cars as people, an autonomous vehicle’s software might get very, very confused, swerving or braking when it shouldn’t. Or: If data is labeled perfectly and accurately, every single time, then those systems just might learn how to safely maneuver through the wide, weird world. Put another way, the tedious task of data labeling is essential to building safe self-driving cars.

Here’s the silly thing, though. When a Scale customer—like self-driving car developers Cruise, Zoox, Lyft, Nutonomy, Nuro, Pony.ai or Voyage, or self-driving truck builders Embark and Starsky Robotics—sends data to be labeled, that data doesn’t get shared with other Scale clients. This is too bad, because autonomous driving systems could always use more data to train on, more images of the real world that help them refine their robobrains. It’s doubly too bad when it comes to edge cases, the unusual but dangerous happenings that all cars should be prepared to handle.

Animation by Scale API

Sure, it makes a lot of sense for companies to want to keep these bits of data to themselves. The developers spend a lot of time and money collecting that information, after all. “I don’t know how you get competitors to share their most valuable information,” says Oscar Beijbom, who heads up the machine learning team at Nutonomy.1 “In a way, these corner cases are very precious.”

But it’s also kind of dumb for the companies to be so possessive. “Right now each company is so in its own lane and secretive,” says Alexandr Wang, Scale’s 21-year-old founder and CEO. “In reality, these edge cases, these are things that should be probably be shared or standardized across the industry at some point.” Wouldn’t it be great—and much much safer—if everyone had a hand in creating the training data that helps autonomous cars understand when to swerve, or when to hit the brakes?

Reading the Labels

Scale API is one of several companies that offer so-called data-labeling services. Mighty AI, Appen, Amazon Mechanical Turk, Samasource, and Cloud Factory all offer clients ways to connect with contract labelers who can do this work. Scale, which has 35 employees, is zooming in on the autonomous vehicle market, and the particular blend of sensor data those cars churn out. (It specializes in labeling quickly; companies that use the startup’s products often have their own contractors or in-house teams who do the most quality-sensitive labeling work.)

The car systems in question generate lots and lots of data, from cameras, radar, and lidar. The data covers lots of frequent driving situations, like what it looks like when a car is tailgating, or takes a left turn across traffic, or when a cyclist is sharing the road. And less frequent situations, like if the truck driving ahead suddenly unleashes its cargo of logs onto the road. (True story: This has happened to one of Scale’s clients.) Engineers train their self-driving vehicle perception systems on ten of millions of examples of this kind of info, until the systems themselves can quickly recognize them, interpret them, and learn how to take evasive action.

A company like Scale, then, provides the foundational infrastructure for self-driving car tech. “Scale is basically providing the ground truth for our perception systems,” says Anantha Kancherla, who oversees the development of self-driving software at Lyft. “It’s a very, very critical piece for us to develop.”

The startup, which today announces an $18 billion Series B funding round led by Index Ventures, officially began in 2016. Wang, now 21, dropped out of MIT’s computer science program to launch the company at Y Combinator. Two years in, Scale recently moved into an open-floor office in San Francisco’s techified SoMa neighborhood, the kind of place whose mismatched mugs, cheery, young, casually dressed employees, and fully stocked bar scream summer camp for sort-of grown ups.

These folks aren’t actually doing the labeling, though. That happens at home computers and in call center–like offices, mostly in Asia and Europe. Those workers mouse around camera images and 3-D lidar-generated maps collected by the car sensors. They draw boxes around cars, walkers, and cyclists. They ID certain pixels as road, not tire, or flesh, not steel. Or they double-check that Scale’s automated system has done all these properly by itself.

Here’s the silly part again: Sometimes, those contractors label the same sorts of data, over and over, for different Scale clients. When those workers see a particularly interesting corner case—a July 4th celebrant whose had too much to drink, or an e-scooter, or the logs tumbling off the back of the truck—they don’t alert other clients about what their technology saw. That means self-driving car companies are spending hours and hours of work, and lots and lots of money, collecting and annotating what might be mostly identical road data.

Yeah, this lack of sharing amounts to a bad system, and the companies working on building AVs acknowledge as much. “It’s a little bit ridiculous that the same companies do almost the exact same annotation work,” says Beijbom, from Nutonomy. “It does feel very wasteful and suboptimal.”

It could also prove dangerous. If only one company’s cars are prepared for the falling logs, what happens when another company’s cars encounter them? “If you’re worried about your system missing edge cases, the ‘unknown unknowns’, then the more examples you have, and the more conditions the car encounters, the more opportunities you have to train the system to do a better job,” says Michael Wagner, co-founder and CEO of Edge Case Research, which helps robotics companies build more robust software.

Animation by Scale API

Scale might be a great platform to share these edge cases. Or another company might be. But only if autonomous vehicle companies can get over their paranoia about sharing data with competitors. Yes, experts say, it’s possible that a competitor could divine something about your particular, proprietary technology based on the data you collect about weird situations on the road. Still, Wang thinks that if autonomous vehicle companies get better about sharing the load on the relatively easy task of collecting and labeling, then they can can start to compete on very tough stuff: building a car that can use that data to safely drive itself anywhere.

It’s an ongoing project. “Even when these things are on the streets and they don’t have a driver inside, there’s going to be this constant effort to make them better and better and better,” Wang says. Which means, of course, that Scale is never out of a job.

1Correction appended, 8/7/18, 2:35 PM EDT: A previous version of this story misspelled Oscar Beijbom’s surname.

More Great WIRED Stories