How to Predict a Bestseller

Literary theory is not a field that creates many bestsellers. Biographies of Shakespeare will always have a market, and now and then a work like Camille Paglia’s Sexual Personae rides a wave of ...

Literary theory is not a field that creates many bestsellers. Biographies of Shakespeare will always have a market, and now and then a work like Camille Paglia’s Sexual Personae rides a wave of controversy. But literary research is hard to popularize: Malcolm Gladwell doesn’t write page-turners about narrative theory.

The Bestseller Code tries to become an exception to that rule: a popular work about recent literary theory, sold to the trade rather than the classroom. Admittedly, it describes a weird kind of theory, which has more in common with Moneyball than Jacques Derrida. The authors have worked in publishing as well as the academy, and have developed an algorithmic model that claims to predict which novels will make the New York Times’ bestsellers list. When tested on books from the past thirty years, their model is right 80 percent of the time.

Playing on readers’ curiosity about that startling accuracy, The Bestseller Code lures them into a broader conversation about the meaning of literary success. What do readers really want from fiction? What plots, characters, or themes give stories the widest appeal?

The authors discover, for instance, that sex doesn’t sell as well as one might think. Fifty Shades of Grey was an exception; and even Fifty Shades may owe its popularity less to explicit sex than to its emphasis on “human closeness,” and what the authors call “an emotional roller coaster” of a plot. No single theme or genre guarantees sales. But novels are more likely to succeed if they concentrate on only a few themes, use a conversational style, and alternate regularly between intense experiences of happiness and despair. It also doesn’t hurt to include female characters who forcefully express their needs.

All of this is intriguing. Archer and Jockers don’t claim to be able to teach people how to write bestsellers, but aspiring novelists who thumb through the volume will find plenty to think about. I doubt The Bestseller Code will be quite as popular as its namesake The Da Vinci Code. But I could be wrong: no one has created a model yet to predict the sales of nonfiction.

I can confidently predict, however, that my fellow literary scholars will hate the book. Popularizations of academic research are rarely popular with academics, and this one has the additional handicap of seeming to embody an external pressure toward empiricism that the discipline especially dislikes. Literary scholars are not yet convinced that numbers really illuminate our subject. We tend to be frustrated when researchers from other fields make sweeping quantitative arguments about literature.

Archer and Jockers both have doctorates in English literature, and they have woven a lot of textual and biographical detail into their story. But no amount of nuance would make this argument less controversial. Opinions are already polarized. The book’s central premise—that literary success can be explained with numbers—feels like the kind of simplification that outsiders crave, and that English professors are sworn to resist.

The polarization is unfortunate, because this book actually represents an enormous opportunity for literary research. Why readers like what they like is a question that critics have long tried to answer. And the authors of The Bestseller Code are not pulling a fast one: it is absolutely possible to create statistical models that predict which books will succeed with a given audience. Archer and Jockers were the first people to show that this could be done with bestsellers, but they’re not the only people to have tested the method. Jordan Sellers and I have shown that it’s just as easy for a model to identify the volumes of nineteenth-century poetry that elite magazines considered important enough to review.

the patterns revealed in this book could transform literary history—even if all the book’s explanations for those patterns turn out to be wrong

Prediction is therefore not limited to the sort of mass-market works we might expect to be formulaic and predictable. The Bestseller Code is using John Grisham and Danielle Steel to dramatize methods that can also help us understand other forms of literary success, and compare forms of success to each other. What does The New Yorker want that the mass market doesn’t? How did readers’ preferences change from 1920 to 1980? Which books published in 1950 might have done better if they appeared a few decades earlier or later? Questions like these open up new horizons for literary history.

The book is vulnerable to a range of objections—some based on misunderstanding, and some valid. Archer and Jockers have addressed the circularity that may look like an obvious flaw (“how can you ‘predict’ things that already happened?”) by testing their model on evidence it hasn’t seen. The model is never shown examples of Danielle Steel’s success before making predictions about her novels, for instance, and it has successfully predicted bestsellers that were published after its creation.

It is true, on the other hand, that quantitative methods are better at prediction than explanation. A model may tell you that a book is likely to sell, because it resembles other bestsellers in using the verb need and avoiding exclamation points. It can be harder to explain why those details make a difference. In The Bestseller Code, for instance, verbs are treated as signs of character. Since the characters who need things are often women, need gets linked to Lisbeth Salander—of The Girl with the Dragon Tattoo—and to recent enthusiasm for troubled female protagonists. For all I know, this is right. But explanatory categories are slippery. Are verbs the same thing as character? Are alternations of mood the same thing as plot? Causality is even slipperier. Do stories without exclamation points show up on the bestseller list because readers like a “deadpan tone,” or because editors at influential firms remove exclamatory dialogue from manuscripts they intend to promote? Strictly speaking, we don’t know.

<i>Memorial Edition of Harriet Beecher Stowe's bestselling </i> Uncle Tom's Cabin<i>, 1897.</i> Photograph courtesy of the Schomburg Center for Research in Black Culture / New York Public Library

Memorial Edition of Harriet Beecher Stowe’s bestselling Uncle Tom’s Cabin, 1897. Photograph courtesy of the Schomburg Center for Research in Black Culture / New York Public Library

To write a lively trade book, Archer and Jockers have had to make the task of explanation sound easier than it is. So skeptical readers will find plenty of questions to raise, and should raise them. The alternative explanations glossed over in a popular work should be spelled out in scholarly debate, where all evidence is open to inspection.

But ten years from now, we won’t remember whether The Bestseller Code was wrong about exclamation points. We’ll remember whether it sparked a new kind of inquiry about literature. In fact, the patterns revealed in this book could transform literary history even if all the authors’ explanations for those patterns turned out to be wrong. This lag between discovery and explanation is one reason why algorithms are unlikely to displace traditional critical methods. Scholars tend to worry that numbers will promise simple answers to old interpretive puzzles. But that’s rarely what they provide. Instead, quantitative methods turn up evidence that prompts new questions. The evidence may be framed on a larger scale than we’re used to—encompassing thousands of books and millions of readers. But patterns still have to be interpreted, and the preferences of a million readers can be just as puzzling as the meaning of a single book. icon

Featured image: Release party for Harry Potter and the Deathly Hallows in Sunnyvale, California, 2007. Photograph by Zack Sheppard / Wikimedia Commons