Inside the company that gathers ‘human data’ for every major AI firm
The Scoop
In early 2022, Jonathan Siddharth, CEO of staffing firm Turing, drove up from Palo Alto for a meeting at OpenAI’s San Francisco office.
Siddharth thought he was there to pitch OpenAI on Turing’s specialty: finding high-quality software engineers to work as contractors. Instead, he found himself in a room full of AI researchers who wanted something entirely different: data.
OpenAI had found that adding bits of computer code into the datasets used to train its large language model, then GPT-3, helped increase the model’s reasoning ability.
That’s why its AI researchers wanted as much computer code as they could get, and they wanted it to be good. They asked Siddharth if he could raise an army of computer coders who would complete specific software engineering tasks so their work could be ingested into OpenAI’s next project: GPT-4.
“What I remember distinctly is the scale of their ambition in terms of how big they were thinking,” Siddharth told Semafor in an interview. “Their demand from us was like a crazy high spike in terms of how much data they wanted in how little time.”
Turing’s work played an important role in helping OpenAI make a massive leap in performance that blew away the competition and shocked the world when it was released in November 2022 in the form of ChatGPT, according to a former OpenAI employee with knowledge of the matter.
It also changed Turing’s business model. In the nearly three years since it began working with OpenAI, Turing has launched a suite of AI consulting services, and counts as clients nearly every major foundation model provider, as well as large companies that want to train their own AI models. The $1 billion startup has also moved beyond coding and is providing specialized data from a wide array of industries.
“It’s very impressive, the transition they pulled off,” said Quora CEO Adam D’Angelo, who is an investor in Turing and a board member at OpenAI (he wasn’t involved or aware of the first pitch meeting with Turing). “It’s almost like the first business just got them these capabilities internally around managing large numbers of contractors and engineers. And then that was just the perfect way to build that up to then be able to address this new market, which I think is ultimately going to be much bigger.”
Know More
It took a breakthrough in AI research to fully take advantage of the enormity of the data on the internet. In what’s called “pre-training,” AI models like the ones that power ChatGPT can process massive datasets such as Common Crawl, which contains petabytes of information gathered from the web.
Once the models are pre-trained, they can reliably complete sentences in a way that sounds human by simply predicting the next word. But they aren’t very useful. In “post training” and “fine tuning,” the AI models are taught to understand how to answer questions, rather than just complete sentences, and how to avoid inappropriate responses.
In a method called supervised fine tuning, models can then learn new skills by ingesting specialized data like the coding examples that Turing collected for OpenAI.
The idea is not to get the model to memorize the new data. Rather, the trick and the challenge is to get the model to “generalize,” so that it learns the underlying principles behind the data.
Siddharth believes this is the next frontier in improving AI models. But to do this, he says, they need data that can’t be found anywhere on the internet.
When one client wants an AI model to get smarter in a specific area, Turing will hire hundreds of subject-matter experts, who will be asked to create “input and output pairs,” which are like inner monologues of questions and answers.
For instance, a chemistry expert might start off by asking about a specific molecule and then write an answer. Then follow it up with another question, and so on. This might go back and forth ten or so times. The subject-matter experts Turing hires range from PhDs in neuroscience to sales professionals who can analyze forecasts.
The company might end up with thousands of these monologues, which in the AI industry is known as “multi-turn data.” It says they are one of the keys to getting AI models to reason and understand specific concepts.
Siddharth believes this process is key to AI models becoming agents, where they can carry out complex, multi-step tasks by themselves.
This so-called “agentic” era of AI is not yet here, but Siddharth paints a picture of an AI model that can draw on all of its specialized knowledge. If it’s asked to analyze the top venture capital firms, for instance, it might draw on the knowledge of finance professionals to know what kind of data to look for. It might then use its coding knowledge to write a script that can go out and access the relevant data and convert it into the proper format.
Reed’s view
Turing’s work with OpenAI and other companies helps demystify how these models are built.
We often look at these companies through a simplistic lens. AI companies hoover up the internet, run an algorithm and out pops ChatGPT.
There’s a big debate about the merits of copyright lawsuits and the wisdom of licensing deals AI companies use to pay to content providers like Reddit and The Financial Times, which provide training data.
An entire ecosystem of AI infrastructure is emerging that will help any company train and fine tune AI models with extremely specialized purposes.
As those models become more reliable, they may begin to work together to create this “agentic” future that so many people in AI are excited about. When AI models become useful agents, it will be a turning point where they go from flashy new technology to helpful tools for businesses and consumers.
We won’t know the outcome until new generations of models start to roll out to the public. But as AI models are infused with more of this specialized knowledge, it’s possible that new capabilities will emerge or reasoning will improve.