THIS WEEK IN AI

Mar 24

Candice Bryant Consulting
Strategic Intelligence & Public Affairs

THE DATA LAYER

Everyone knows AI models are trained on data. It's the invisible layer of Nvidia CEO Jensen Huang's five-layer cake.

But most people don't realize that a lot of that data is labeled by hand. Someone drew a box around every car in a photo. Someone decided what counted as a stop sign and what didn't. Someone flagged harmful and abusive content. The industry complements this human review with AI itself, using more advanced models to train less advanced ones.

This week, I'm tracking the Pentagon exploring training military AI on classified data, and what AI training actually looks like under the hood.

For those already across the headlines, skip to "What I'm Watching" for insights, including how all of us directly participate in the data labeling process.

PENTAGON MOVES TO TRAIN AI ON CLASSIFIED DATA — According to the MIT Technology Review, the Pentagon is discussing plans to set up secure environments where AI companies can train military-specific models on classified data. AI models currently used inside classified environments can answer questions, but they don't learn from what they see. CSIS's Aalok Mehta, who previously led AI policy at Google and OpenAI, noted this shift could allow models to identify subtle clues in an image the way an analyst does, or connect new information with historical context. He added that the biggest risk is an internal data spill within the department itself, but that "if you set this up right, you will have very little risk of that data being surfaced on the general internet or back to OpenAI." The Pentagon reportedly plans to first test how models perform when trained on unclassified data, like commercially available satellite imagery, before moving to classified material.

WHAT I'M WATCHING

You know those reCAPTCHA grid challenges that ask you to click on every crosswalk or traffic light?

It started as a way to prove you're a human and not a bot. But the researchers behind it realized that if millions of people were already looking at images and identifying what was in them, that information could be put to a second use. For years, when you clicked on the squares with traffic lights, crosswalks, or buses, you were also teaching a machine what a traffic light looks like. Most sites have moved on to invisible verification, but at its peak, people were solving 200 million reCAPTCHAs a day.

Globally, hundreds of thousands of people work as data labelers. In 2006, computer scientist Fei-Fei Li saw what her colleagues missed: the bottleneck in AI wasn't better algorithms, it was better data. She built ImageNet, a database of more than 14 million images labeled by hand, and it became the dataset that kicked off the modern AI revolution. A decade later, Alexandr Wang dropped out of MIT and founded Scale AI on the same bet.

Now the gig economy is a data labeling workforce. Some companies use the Uber model, allowing workers to pick up labeling shifts whenever they're available. Uber itself launched a pilot in India letting its drivers label data between rides. Last week, DoorDash launched Tasks, a standalone app that enables its 8 million couriers to earn extra money by filming themselves doing household chores like washing dishes—to help robots understand the physical world. China, facing a data labeling talent gap of nearly 30 million workers, issued a national plan last year in an effort to become the world leader in data labeling by 2027.

Data is the invisible layer of the AI stack. From collection to refinement, data is core to AI development. By some estimates, data is 80% of the work.

The most well-known data quality problems involve the raw data itself. Because much of it comes from the internet, there are concerns about representation and biases that mirror those found in society. In this context, labeling matters even more.

In 2021, researchers from MIT and AWS examined 10 of the most widely used datasets in machine learning—datasets the field had been building on for over a decade—and found an average of 3.3 percent of the labels were wrong. Frogs labeled as cats. Negative reviews marked as positive.

The datasets that had more human curation had fewer errors. The ones that relied more on automation had more.

As the Pentagon begins scaling AI into classified environments, understanding this invisible layer is key. The reliability of these systems will ultimately come down to two things: clean, verified inputs, and a human in the loop on the other side.

— Candice

I hope you found this briefing useful. Please continue to forward it to anyone else who might also find it useful. They can sign up here.

Note: I’ll be taking two weeks off for Spring Break—wishing everyone a great couple of weeks! Back April 7.

Read All Newsletters

Candice Bryant