My new book THE LITTLE BOOK OF DATA is publishing from HarperCollins Leadership June 3. You can pre-order book or audiobook now.
I recently met someone who’d tried to compete with Google at a pretty big scale. (I’ll skip the proper nouns to protect the innocent.) After a lot of probing about this man’s experience--which ended in failure--what he admitted was, “We didn’t realize how well Google knew how to run data centers.”
I was taken aback. Data centers! Those big buildings with stacks of servers in long rows. In the movies, they’re low-lit places (deep blue or sickly green, to make them seem less boring), and the spy always needs to sneak inside them to steal secrets inevitably stashed in a thingy that looks like a car radio. In real life, they occupy vast industrial buildings and cost a fortune in electricity, cooling, and protective gear for fire and flood.
“Data centers?” I manage. “That’s kind of a commodity business, ain’t it?”
He looked at me like I was an idiot.
“When you’re running a massive business, with billions of events, there are going to be errors. Thousands of errors. With the hardware, with software bugs. And if every time that phone rings at 3AM with a problem, and your team bumbles, doesn’t know how to handle it? They’re not skilled in dealing with the failure quickly and efficiently?” He looked at me hard. “You’re running up huge costs. You’re disappointing customers. You’re just… not competing. Data centers, in that business? It’s everything.”
To demonstrate the depth of his folly for underestimating the importance of data centers, he opened his phone, pulled up a single web page containing the sum total of his one win against Google.
It was—I say without judgement—pretty pathetic.
“Yeah,” he concluded wistfully.
DATA CENTER AS KITCHEN
In the first three months of 2025, OpenAI and Anthropic together raised $41 billion--a triumph for the application layer of the AI ecosystem. But a host of other companies you haven’t heard of—data center companies, also known as hyperscalers because they scale up according to client demand—STACK Infrastructure, Bridge Data Centres, Khazna, NetApp, Rowan Digital, Digital Edge, and others—raised over $15 billion, according to TechCrunch. In 2024, the total data center and data infrastructure space raised $90 billion—more than all AI application companies combined.
Which tells you: while the application front of the AI restaurant is a madhouse of customers hungry for AI applications, in the back, the data centers are frenzied kitchens cooking up tokens for consumption. The activity of the AI applications and the data processing--like the front and “back of house” at a restaurant—are linked proportionately.
To understand the importance of processing to the AI boom, in the five years from 2018 to 2023, data centers’ energy consumption as a percentage of the US total doubled (to 4%), and is expected to double again in the five years from 2023 to 2028. If you treated global data center energy use as a country, data centers now use slightly less electricity than France.
But what about the data?
WHY HAVE WE IGNORED DATA?
Oddly, after ten years of Big Data hogging the nerd spotlight, we now ignore data—treat it as a given. Why?
Perhaps because Big AI simply steals its data—OpenAI scrapes data for free, or buys copyrighted web content in bulk (not from the copyright holder). Meta goes even further and steals over 100,000 books—including a gothic novel about a blood-vomiting ghost I published at HarperCollins in my younger days.
Alarmed, the United States Copyright Office just published the third in a series of extensive reports on AI and copyright, in which they weigh the extent to which AI’s use of copyrighted material is “transformative”—that is, does the AI-generated material compete with the original; or is it an unrecognizable, and hence (according to some) harmless, use?
So does data--now reduced to the role of “training data”—still matter?
A PLACE IN THE LOVE TRIANGLE
I believe we are at a turning point for the role data plays in the ecosystem. Data is finding its place in the love triangle of AI, data centers, and data itself.
If AI is the restaurant dining room; and the data center is the kitchen; then data is—oh yeah—the food.
So why is data so late to the AI spotlight?
The challenge with a large language models—LLMs have been the heroes of our AI stories—has been to train them to use language, and use it perfectly. A massive, wondrous achievement.
For this purpose, bulk website, news, and books are perfect; just as they would be for training a person to read and speak. We have trained our models through the high school level, even university.
But our next phase of AI is pointed towards specialist applications. Grad school. If the Internet era represented a boom in distribution, AI provides the equivalent for something arguably a bit less sexy: namely, tasks. Workflow.
FIRST DAY OF A NEW JOB
There is Evariste, which uses AI to discover oncology targets; KappaSignal, which performs stock analysis; Trunk Tools, which automates construction workflows; and my favorite, SwineTech, which uses sensors to speed the work of pig farms (and yes, their product is called PigFlow; I mean, buy those guys a drink). Plus thousands of others.
Gothic novels with clever plot twists and vivid descriptions won’t be much help to the founders of these enterprises. Each are attacking the inefficiencies of a specific workflow in a specific sector. And there is nothing more intimate, and specialized, than a workflow. One sympathizes with the robots: for them, it is always the first day on the job, the new employee’s lost and gormless feeling of learning all the acronyms, what form goes where, the department names and functions, which markers are dry erase and which will stain the wall and make you that guy.
In this world, the deeper, richer, and more complex the data, the better; because it gets you to the hard part of a task. The new employee’s nightmare is the AI’s dream: using databases that are only useful to a few thousand people in the world; training a model to perform the narrowest of tasks. Each task may be excruciatingly complex, or require a decade of expertise.
And that’s the point. Humans will be spared the agony. The robots will do the excruciating parts.
DATA HUNTERS AND GATHERERS
In such a world, hunger for specialized databases will only rise.
We can understand this by following the tracks of entrepreneur-turned-investor, Travis May. Adtech veterans remember May as the former CEO of LiveRamp—a data hub for digital advertising. He went on to found Datavant, a data hub for healthcare.
Both businesses had a similar structure: set up a platform for sharing and selling data. Create standard tools. Give buyers a deep menu to choose from; help sellers set prices, invoice clients, get paid. Above all, make the privacy and the technology integrations simple.
A farmer’s market for data. Bring together the suppliers and the buyers.
And what is Travis May doing now? Investing in companies that supply specialist data for AI. Protege (led by a former May exec from Datavant) supplies AI modelers with imaging data from CT scans and MRI scans, genomic sequencing data, claims data.
Another pair of May company alumni (from LiveRamp) founded Above Data, which takes a different approach by creating an automated translation layer to clean and harmonize any datasets an organization uses; allowing those datasets to talk to each other—and to AI models.
PUBLIC DATA
Then there’s the public data. In biology, an organization called the ArcInstitute just published a model based on a dataset of 9.3 trillion DNA pairs across virtually every living species. And some extinct ones.
The Protein Data Bank over decades has gathered 1 terabyte of “Structure Data for Proteins, DNA, and RNA,” and has given rise to Alphafold and a Nobel Prize.
The NASA Exoplanet Archive is an online service for data on all the known planets in the universe (about 5,000, by the way).
The world is hungry for data; greedy to ask it questions. Our storage costs and compute costs are now so low, that there may be no data so complex (thousands of images of diseased organs) or so wide (genomic data for all of life) that we cannot put it to work.
THE NEW PHASE
There are characteristics of this new phase that may be instructive to us as we enter into it, and look for data opportunities in this new era:
Collaboration. An organization called TRACK TBI gathers data on traumatic brain injury to help doctors diagnose and treat TBI. Before, TBI patient data was silo’d between separate groups treating sports, military and civilian injuries. Now the groups collaborate, and have created a single large database that benefits all. These collaborations can be called “data coops” and are not a new idea. But the speed, scale and security of data sharing now can make previously unimaginable collaborations possible. What data, gathered across different stakeholders, might improve your organization’s ability to succeed?
Privacy and Security. The main trick of Datavant has been to “tokenize” and anonymize healthcare data. If data can be unbreakably anonymized and encrypted, and queried using “data clean rooms” without the data even leaving its home environment, then in theory there is nothing that can’t be shared and queried for its signal. What questions would you want to ask a dataset that you previously thought was too difficult to access?
Middlemen. In adtech, Liveramp was the quintessential middleman, taking a nickel every time an advertiser wanted to use a dataset for ad targeting, or anytime one party wanted to share data with another. We all used to complain about it--it was a “tax,” we complained; or a rent, in economics language. I withdraw my complaints. I believe we want middlemen to have an incentive to hunt down and gather datasets from far flung sources, and make them convenient and accessible to training models. What opportunities do you see, to gather up valuable, related data?
Make Money on Intellectual Property. If we create efficient markets for intellectual property (IP), then we will also discourage the piracy-at-scale we see underway now. Not only will middlemen get their nickel every time the IP is used; the original contributor will make a dollar. This, I hope, will create a structure for, call it, IP micropayments—where I am perfectly happy for Llama to learn from my gothic novel, as long as I get a fair price, that I set. A hundred thousand novelists and illustrators and artists and scientists and researchers are unlikely to rise up and create a tech platform to sell their wares; but one hopes a middleman will smell an opportunity for a billion nickels. How can you license data for training your models?
To hammer home the ultimate importance of data, in this new world, there will be an increasing incentive for organizations to create and keep their own data. Private data stores will be the gold reserve, the core, of organizations—used to train models locally, on applications solely for the use of the organization itself, and of course their clients and stakeholders.
Yes, doing so will keep their data out of the general flow—out of the data economy. And it will require companies to re-create, within their walls, the skills and infrastructure of data curation. The skills and services to store data, and even to use AI models, however, will be acquirable, outsourceable. The data is not. The data is what makes the organization’s application unique—it will represent the organization’s knowledge; what that organization has to say to the world—and will be a critical source of competitive advantage.
Disclosure: I am an angel investor in Above Data.
Fascinating and well researched piece. Agree with the request for scaled and connected data hubs.