The Encyclopedia Project, or How to Know in the Age of AI

In this miniseries, Seeing Past the Tech, social scientist Janet Vertesi un-blackboxes the systems we call “artificial intelligence.”
In an age when AI regurgitates the blather of meaningless content, seeking its audience in the attention marketplace, it's a small wonder that it is hard to tell what is really real anymore.

It all started with Kung Fu Panda. As the final line of credits crawled up the screen, the kids bounced excitedly on the couch. With their heads full of the impossibilities of talking pandas and dragons common to today’s cartoons, my youngest asked a fair question: “Is kung fu really real?”

“I wanna know the history of kung fu,” the eldest chimed sagely. This was music to my husband’s ears, who was waiting for an entrée to introduce them to martial arts. “Yes, kung fu is real,” he explained. “And it has a long history. Let’s look it up.”

And that’s when our family movie night took a turn toward dystopia.

Our kids aren’t allowed to just look things up online themselves. Like many parents, we guide them through trying to find something trustworthy, so as to avoid online sludge. We have an especially draconian approach to YouTube. We search through multiple layers of proxies, private browsers, and peer sites, so that Google can’t infer who we are or our preferences. So when my husband typed in a few search terms and scrubbed through several clips before settling on one, we had some confidence in our due diligence. It certainly looked like a comprehensive introduction to kung fu.

At first, a male voice droned over a flurry of images. Thirty seconds in, my husband whispered, “I think this text is AI generated.” Fifteen seconds later, I whispered back, puzzled, “I think the voice is AI generated too.” Then, in the foreground, we spotted six fingers on a character. As we rushed to the controls, a picture filled the screen; it looked to all the world like a painting of a temple (just like the one in Kung Fu Panda). But this temple was going up in flames, while monastic fighters paraded unperturbed before the inferno. “Why is the building on fire?” our youngest panicked.

“Because,” my husband said calmly as he turned the machine off, “it’s not a real building. It’s something a computer made up, what it thought we wanted to see.”

I shook my head. This was a deep game. There was, as of yet, no single button to AI-generate an entire video. We had hoped the length and variation of our selection might offer some respite. Instead, somebody assembled the pieces, generating each in turn and stitching them together by hand, aware of their ruse as they chased clicks. I drew a long breath.

“That’s it,” I announced. “We’re getting an encyclopedia.”


Since the release of automated “generative-AI” services—like ChatGPT, Gemini, DALL-E, Stable Diffusion, and Midjourney—the information superhighway has flooded under a deluge of machine-produced debris. Photos feature humans and animals with impossible anatomy, disjointed words reminiscent of text, illogical architecture. Stochastic parrots peck at whatever word or pixel is most likely to come next. Our feeds are replete with entreaties to fool’s gold, which even expert users have difficulty discerning from the real thing.

Google searches, such that they now are, turn up chaotic and irrational answers, pleasantly presented as if for our edification. Famously, a search for “African country starting with K” brought this unintelligible response (drawing from Google’s ingestion of an AI-summarizing website, Emergent Mind, itself relying on user posts at Hacker News): “While there are 54 recognized countries in Africa, none of them begin with the letter ‘K.’ The closest is Kenya, which starts with a ‘K’ sound, but is actually spelled with a ‘K’ sound. It’s always interesting to learn new trivia facts like this.” Upon ingesting satirical clips from The Onion, McSweeney’s, and Reddit, Google search exhorted people to thicken tomato sauce with glue and eat their daily serving of rocks. The company that made its fortune helping us wade our way through the internet is now drowning in its own generated nonsense.

It’s not just Google. Large Language Models (LLMs) have invaded scientific papers, too. Succumbing to the need to “publish or perish,” authors and editors are doing us the favor of showing just how much of their papers are generated by ChatGPT, sometimes through their simple neglect to proofread before publication. A few even make up scientific images and figures. Meanwhile, AI-generated books are flooding Amazon. If you happen to buy a book on mushroom hunting written by machine, you’ll only know the statistical likeliness for the words in the description, not whether what you’re munching will actually kill you.

Against this backdrop, my child’s innocent question—“Is it really real?”—rings stronger than ever. No, darling, it is not. None of it is. At least, I can’t guarantee that it is real or true. More importantly, almost no human or machine can anymore, either.


And therein lies the problem with the internet, a problem that we have thus far mischaracterized as one of “ethics,” “misinformation,” or “search,” all efforts to treat the symptoms and not the disease. The problem of the internet is the problem of epistemology.

We are familiar with many questions of the “ethics” of AI, thanks to billionaires like Elon Musk and science fiction movies like The Terminator. Ethics is the branch of philosophy that asks, “What is good?” Ethicists ask questions like, “What is ethical action? Which of these competing actions should we choose? And which choice ensures the greatest good for the most people?” Such questions, we hope, help us uphold ethical principles as we build new technologies.

Epistemology, meanwhile, is a sister branch of philosophy concerned with knowledge. Epistemologists asks questions like, “What is truth? What do we know? And how do we know what we know is true?” Or, in colloquial terms, “Is it really real?”

There are many ways to approach this problem of knowledge. Plato famously explained that everything we thought we knew was a form of shadow play on a cave wall—a mere trace of what was really real, located just beyond our consciousness. In the 20th century, philosophers tried other explanations. They latched onto the idea of knowledge as “justified true belief” and trumpeted the logic of the scientific method to distinguish “real” science from Nazi science as fascism marched across Europe. These approaches often assume a single right answer that we can know, if only we adopt the right methods or tools.

Another way to approach this problem is to admit multiple experiences and paths to knowledge. Battle-worn from the endless deathly trenches of the Great War, Ludwig Wittgenstein famously employed optical illusions like the duck-rabbit to show how rational people could come to hold irreconcilably different views of the same reality. Later, standpoint epistemologists built on this idea: because people in different social positions have different experiences of reality, objective knowledge arises from mapping out these multiple justified positions, looking beyond the shadows to ascertain what is really going on. Sociologists of knowledge, like me, examine how different social groups come to legitimate different ways of understanding, often to the exclusion of others.

What all these philosophies have in common, across so many centuries of inquiry, is a central preoccupation with what we know and how we know it. We think about how different groups construct truth claims; how we might tell the difference between illusion and reality; how different social facts circulate, clash, or compete; and why, in contests over truths, those who hold more power often gain the spoils. Moreover, we think about knowledge and our responsibility to uphold it, wherever it is found.

This was not the preoccupation of those who built the internet, or the generative AI systems that feed upon it.

Browse

Data Free Disney

By Janet Vertesi

As initially envisioned under Defense Advanced Research Project Agency (DARPA) funding, the internet treated all “information”—indeed, everything that passed through its faceless, decentralized nodes—as equal. The original intention was not to give any node or “packet” more weight in the system (as this could more easily come under adversarial attack—the internet was designed to share information even in the case of a war). Any packet was just like any other, merely finding a locally optimal path from A to B, across equally possible pathways.

With the commercialization of the internet, all information packets remained equal. Yet to this wartime technology of optimizing packet switching was added a new metaphor: the marketplace of ideas. Information, as Silicon Valley guru Stewart Brand famously argued, wants to be free. In this novel metaphorical mechanism, each of these free ideas would compete equally for our attention, our time, and our money. Like the mythical free market—it was naively assumed—the best ideas would win. Such an epistemic regime was perhaps the most accurate reflection of the beliefs and worldview of contemporary America: a system of knowledge governed by an invisible hand, tended by curators just enough to induce a profitable margin along the way.

Yet knowledge is not a market commodity. Moreover, “justified true belief” does not result from an optimization function. Knowledge may be refined through questioning or falsification, but it does not improve from competition with purposeful nonknowledge. If anything, in the face of nonknowledge, knowledge loses.

Nor can we say that beliefs earning the most praise and attention from online communities are either justified or true. From Pizzagate to Redditors’ misguided attempts to find the Boston bombers, collective information work amid a free marketplace of ideas can go horribly wrong. Interrogation of the world through organized, methodical, meaningful mechanisms—the scientific method or otherwise—falls prey to shadow ways of knowing and methodological imposters. When all information is flat—technologically and epistemologically—there is no way to interrogate its depth, its contours, or its utter lack thereof.

Instead of being organized around information, then, the contemporary internet is organized around content: exchangeable packets, unweighted by the truthfulness of their substance. Unlike knowledge, all content is flat. None is more or less justified in ascertaining true belief. None of it, at its core, is information.

As a result, our lives are consumed with the consumption of content, but we no longer know the truth when we see it. And when we don’t know how to weigh different truths, or to coordinate among different real-world experiences to look behind the veil, there is either cacophony or a single victor: a loudest voice that wins.

Certainly, there are online knowledge projects that have garnered acclaim, such as Wikipedia. Like any community-driven knowledge project, Wikipedia’s knowledge map skews heavily toward some topics and not others. Wikipedians have also developed their own local strategies for discerning truths from falsehoods, for ejecting malfeasance and backing up their online claims. Yet Wikipedia is at least clear about what it is and what it is not.

Unlike Wikipedia, the rest of the web subsequently fell prey to Search Engine Optimization (SEO), ranking technologies and algorithmic amplification, all of which merely promoted the most promotable, most profitable, the most outrageous. None of these superlatives are a synonym for knowledge.

As our lives are increasingly infected with blundering, plundering AI systems and their hallucinatory streams, we must learn to assess not accept, synthesize not summarize, appreciate not appropriate, consider not consume.

In the midst of global wars and propaganda campaigns—when it is more important than ever to be informed—the systems that bring us our “information” can’t measure or optimize what is true. They only care what we click on.

The nail in the coffin is what is currently sold to us as “Artificial Intelligence.” This is neither intelligent nor entirely artificial, yet it’s pumping the internet with automated content more quickly than you can fire an editorial office. No system predicated on these assumptions can hope to discern “misinformation” from “information”: both are reduced to equally weighted packets of content, merely seeking an optimization function in a free marketplace of ideas. And both are equally ingested into a great statistical machinery, which weighs only our inability to discern.

The result is a great torrent of tales “told by an idiot, full of sound and fury, / Signifying nothing.” The role of humans in this turbulence is not to attend with wisdom, but merely to contribute, and to consume.


It took two weeks for the World Book Encyclopedia to arrive at our doorstep in two thick cardboard boxes. A full, 24 volume set, its spine is decorated with a futuristic spacescape: an inviting swirl of purple, turquoise, and pink that beckons us to ask, open, and learn. “Look it up,” I explain, now means “look through these pages.”

Online searches are banished from the dinner table and school projects alike. Instead, my children can go to the encyclopedia for any question they can think of. And I promise to read them any entry they choose. They are fast learners, soon navigating the obscure alphabetical sorting of knowledge and the index volume, hopping easily from one topic to another. One bleary Saturday morning I find several volumes cracked open on the couch, my eldest nestled among them, who explained, “I’m just looking up Dubai.”

My goal is not to introduce them to an antiquated or elitist form of knowledge. Alongside thinking about what is just, I want my children to learn how to think about what we think is true, and to investigate where knowledge comes from.

To do so, we will have to talk about what knowledge is and the many forms it takes. We will have to talk about how to tell truth apart from bluster, as well as the role of trust in judging truth claims. And we will have to talk about how we work together to try to figure out what is real and what is right.

In this house, the encyclopedia is our first resource. When we visit Wikipedia, we look at the Talk and Edit pages to peel back the invisible labor, documentation, and social work that undergirds each entry. We talk about what is made up—by people and by machines—and how hard it is to tell the difference between what looks real and what is really real.

In the meanwhile, I’m reading the encyclopedia. And I’m enjoying the astonishing novelty of perusing otherwise hotly divisive topics absent the familiar sense of being inflated with rage. If not for Kung Fu Panda, I would never have realized—after 30 years of surfing the information superhighway—that it was possible to feel so, well, informed.

Like the impossibilities of talking animals in our cartoons, the internet is now full of impossibilities too: the result of talking machines, whose babble was trained on a flat content space, in which knowledge was never a factor and from which knowledge can therefore never rightfully emerge. This is not justified true belief; it is not even multiple sides of the same problem. Instead, generative AI regurgitates the blather of meaningless content, seeking its audience in the attention marketplace. Small wonder that it is hard to tell what is really real anymore.

Unfortunately, just as the internet sinks under the weight of this content, our libraries, too, are sinking, underfunded and under siege. Whose truths we encounter in these libraries is so contested we have chosen to limit our students’ access to knowledge in its many forms, denying them the ability to stand among truth claims and look past the shadows. Consequently, students today often have no choice but to turn to the internet for their “research.” Once there, they lack the skills or nuance to tell right from wrong.

That’s why teaching the next generation of citizens and software engineers to tell right from wrong is not merely an ethical project. It is also an epistemological project, and it has been missing from Silicon Valley since day one. This absence has twisted our search engines and our database architectures, our communities and our politics. Without an information topology, we are adrift in content, attempting in vain to navigate a cascade of absurdities without a compass. Entranced by our devices, we swipe past one playful shadow after another with no means to ascend beyond, seeing kangaroos in the duck-rabbit, believing the impossible and the unjustifiable. Beset by misinformation and mechanical hallucinations, we—and our machines alike—are increasingly unable to tell the difference.

And yet, there are ways to tell a right from a wrong, to tell forms of knowledge apart from their imposters, to hold multiple truths at the same time with confidence, to know with some certainty what is really real. But not all content is really real, and no matter how many forums our large language models ingest or artists’ images Stability AI steals, what will emerge can never be knowledge.

It is high time we returned to these methods and these questions, to the thousands of years of information management and knowledge exchange that transmitted not merely facts or content, but an appreciation for what it takes to surface truths. This does not need to be a colonial or reductionist project. Today’s knowledges are plural, distributed, emerging from many locations and peoples, each with unique methods and grounding forces. This also does not mean that anything goes. The challenge is to listen to each other and to integrate among conflicting perspectives with grace and care, not to shout louder. As our lives are increasingly infected with blundering, plundering AI systems and their hallucinatory streams, we must learn to assess not accept, synthesize not summarize, appreciate not appropriate, consider not consume.

When we build new technologies, ethical questions—like “what is good?”—are always essential. Yet these are not the only questions at stake, nor do they point to the sole cause of today’s online turmoil. Arguably, our contemporary tech landscape urgently demands that we return afresh to another of the oldest questions of all: “What is really real?” icon

This article was commissioned by Mona Sloane.

Featured image: Encyclopedias at the New York Public Library by Clay Banks / Unsplash (CC by Unsplash License).