Authorship attribution is helpful if you suspect fraud: for instance, if you believe that Shakespeare wasn’t educated enough to write the plays, or that Charlotte Brontë’s Jane Eyre was really written by her brother, Branwell. It’s also helpful if authorship is unknown and you want to assign credit, say, for the Neapolitan Novels of “Elena Ferrante”—or if you want to know whom to blame. In the example I discuss below, the Romantic poet-philosopher-critic Samuel Taylor Coleridge wanted very much to know who blasted his 1816 volume, Christabel; Kubla Khan, a Vision; The Pains of Sleep, in the Edinburgh Review. Though Coleridge had opinions about who wrote the review, no one in the last two hundred years has been able to identify that person with certainty.
The basic question of authorship attribution—who wrote what—is a question we’re increasingly able to answer, using computers that employ stylistic analysis. But this process also creates new questions about the role of the author: not about whether it’s the author or reader who makes meaning from a text, but what it means to write something at all.
In some naive algorithmic approaches that know nothing of the author—or, indeed, of history, ideology, or critical traditions—authorship emerges powerfully from the sea of texts as a set of shared patterns. That is, an artificial intelligence, or AI, successfully “recognizes” an author not as a person, but instead as the likeness of features that characterize a body of work. In order to find patterns across texts, the algorithmic “reader” uses a collection of textual traits—like frequently used words or punctuation—to draw conclusions about who wrote what.
Here’s how it works. In a relatively coherent set of texts by a single author, a writer’s idiosyncratic linguistic choices leave a mark analogous to a fingerprint. In order to recognize that fingerprint, you need a reasonably large, reasonably similar set of texts to compare with the one you’re curious about, and the one you’re curious about should be long enough to exhibit the fingerprint patterns.1 Methods that use style to discover who authored an anonymous piece of writing—or to validate who really wrote it—rely on text alone to draw probable conclusions.2
The “style” in “stylistic analysis” is less like the cut of someone’s coat than the body inside it. Such style is often about idiosyncratic and personal habits that mean little in themselves, reflecting region and education as much as conscious aesthetic choice. A 2007 study published in the Journal of Machine Learning Research, for instance, recognizes Oscar Wilde not by the carnation in his buttonhole but by “the kinds of features that might be used consistently by a single author over a variety of writings”; this includes the frequencies of function words, the syntactic structures, the sentence length and word length, and the syntactic and orthographic idiosyncrasies.3 In other words, magic—but magic grounded in linguistics and statistics.
I’ll take up the concept of authorship from my perspective as a reader and literary critic—and highlight an example of how computers identify authorship to show how it works—in order to ask what stylistic analysis can teach us about the relationship between author and text.
From Slate to Silicon?
Most readers, reasonably, understand the author to be the person who wrote a text. When Harper Lee’s Go Set a Watchman came out, I wanted to read it for the same reason as many other people: it was by Harper Lee. The controversy that surrounded the publication was focused around the figure of Lee and concerned with her ability to give consent in her later years, and with her extensive revision of the manuscript that became To Kill a Mockingbird.
The provenance of the Go Set a Watchman manuscript was critical to the story of its publication. Tonja B. Carter’s Wall Street Journal account of finding Lee’s text (though later disputed) welcomed attempts to authenticate it. “In the coming months,” Carter wrote, “experts, at Nelle’s direction, will be invited to examine and authenticate all the documents in the safe-deposit box. Any uncertainty about the ‘Mockingbird’ manuscript removed from the mailing envelope and the mysterious pages of text in the Lord & Taylor box will be addressed.”4 The fact that the book was by Lee mattered tremendously.
Carter’s Sherlock-style clues offer to resolve questions about authorial identity like those addressed by the technical authorship attribution methods discussed above. But such attribution, as a statistical approach, offers relative probabilities rather than perfect certainty in response to the question: “Did this person write that document?”—or, better—“Which of these people was most likely to have written that document?”
Both formulations of the central problem of authorship attribution are by Patrick Juola, of Duquesne University’s Department of Mathematics and Computer Science, and the protocol for establishing authorship he offers is very different from examining the typewriter keys, the paper, the paleography, the acidity of recurrent coffee stains, or other clues a literary detective might try to use to place the body of the author at the site of the text’s production.5 Rather, the fingerprint is evident in the text itself.
Disclosure: I didn’t read Go Set a Watchman, because I couldn’t shake the conviction that it wasn’t legit. I was startled to find myself approaching a novel as if its main significance lay in its power of revealing its author’s intention, by being either an ideal expression of what she wanted to write (an unedited manuscript) or what she wanted me to read (an edited book).
An AI perspective on authorship demands that we account for the full history of how a text takes shape.
Simone Murray’s work on the digital literary sphere suggests that my desire to feel close to the author is not a throwback to simpler times but is in step with current notions of literary authorship. Murray points out that creative work is now only a “kernel” of a wider range of “author-reader communications.” She argues that the succession of digital platforms that have seemed to disrupt literary culture—email, websites, blogging, Facebook, Twitter, Instagram—are actually part of a longer history of readerly desire for “intimate mind-to-mind communion between author and reader.”6 My sudden, strong suspicion that someone else had finished writing Lee’s first novel, or had made substantive, unauthorized changes to the manuscript, marked an investment in the idea that it was her work alone I wanted to read.
This feeling stands in sharp contrast to my habitual professional interest in texts. For literary critics, the author is often best understood as a critical concept, or a way of asking important questions about how a text came to be written, circulated, and read. And, indeed, authorship is a role that has changed over time, as texts became increasingly associated with a specific creator who could be prosecuted or paid for the work.
In “The Elenic Question,” published in the blog of the Los Angeles Review of Books in 2016, Merve Emre and Len Gutkin argue that the public debate around the mystery of the novelist Elena Ferrante’s identity was critically generative: the parallel “Homeric question” about Homer’s identity raises questions about “oral-formulaic composition (the process by which oral poets improvised poetry) and textual reception (how that improvised poetry was circulated through writing).”
Emre argued in the New York Times Magazine in October 2018 that Ferrante’s practice of anonymity is an “expressive strategy” that can “multiply and muddle the distinct egos of the author: Elena as the writer of the Neapolitan novels; Elena as their first-person narrator; Elena as a commentator on the novels she has written.” Here, the author is a set of critical questions as well as an actual person. If you train a machine to recognize the textual traces of an “actual person,” what becomes of the critical questions about authorship?
The harsh review of Coleridge’s Christabel in the Edinburgh Review offers some interesting answers to this question—though the authorship of the review itself remains uncertain. In 2015, Francesca Benatti (of Digital Humanities at the Open University) and Justin Tonra (of the National University of Ireland, Galway’s English Department) set out to ascertain whether it was written by Thomas Moore, an important Irish writer. To give a sense of the academic significance of the true identity of the author of the review, let me note that the authors refer to eight previous “notable entries in the debate” at the outset of the article.7
The authors observe that much of the evidence offered so as to prove authorship has been external; by contrast, their approach focuses on “internal linguistic evidence from the text and from other texts by the authors that scholarship has identified as the most likely candidates for authorship.” Rather than reaching a certain conclusion, they ultimately offer a set of probabilities, emphasizing along the way how those results were determined by a series of specific decisions during the process of analysis.
They compare essays written by other known contributors of literary reviews to the Edinburgh Review around the same time, but only by those authors who had written what the researchers deemed to be a sufficient amount of text for a reliable analysis; the required quantity was “10,000 words, divided into at least two articles,” for each author. The authors that fit the bill were John Allen, Henry Brougham, William Hazlitt, Francis Jeffrey, Sir James Mackintosh, Thomas Moore, and Sir Francis Palgrave.
The essays used in Benatti and Tonra’s study were found in Google Books and in archival copies of the original Edinburgh Review. First, they had to make sure the images could be recognized as letters. Then they had to remove all of the long quotations of Coleridge. Since substantial extracts are a central feature of the 19th-century literary review, long chunks of Coleridge’s poetry would scramble the prose signal of the secret reviewer (if stylistic analysis attempts to see past superficial differences of topic or flair—the cut of a coat—in favor of ingrained patterns consistent across texts by a single author—the body inside it—then having two people inside the coat can confuse it8).
The basic question of authorship attribution is a question we’re increasingly able to answer, using computers that employ stylistic analysis.
Then Benatti and Tonra tried five different methods for sorting authors.9 Two of the methods looked for any patterns that distinguish texts from one another. In this unsupervised version of analysis, stylo is looking for likeness and difference in feature sets across texts.
Three methods were supervised; the authors trained a classifier on a set of known texts and then asked it to guess who wrote a set of unknown texts. To make sure the results weren’t too capricious, texts were swapped randomly in and out of the training set, over and over. The most effective of these methods10 ultimately tested each article a hundred times; it picked the author of the Christabel review as Francis Jeffrey 63 percent of the time, Moore 28 percent, Henry Brougham 8 percent, and Macintosh 1 percent. This result is inconclusive, but suggestive.
The authors note that the algorithm’s guesses about the Christabel review (the proportion of times it assigned the essay to different authors) are most like its guesses about the authorship of a different review by Hazlitt, in which “editorial intervention by Jeffrey is likely to have been extensive.” That is: perhaps it was befuddled here by editorial practice. We know stylometric analysis is well suited to identifying the style of a particular individual. It might be time, for instance, to challenge the attribution of one particular review, which the algorithm repeatedly assigned to Jeffrey rather than the previously accepted author, Brougham.
But perhaps Benatti and Tonra’s stylistic analysis also reveals a kind of an inflection point where an editor’s work turns into authorship. In reflecting on the possibility that Jeffrey heavily edited the Christabel review, they note that “Jeffrey is known to have applied numerous ‘retrenchments and verbal alterations’ to Hazlitt’s articles on at least two other occasions, and to have extended this practice to all Edinburgh contributors. Depending on the extent of his participation, it could be argued that any or all of the reviews in the Edinburgh have actually two authors.”
An AI perspective on authorship demands that we account for the full history of how a text takes shape and the reality that more than one hand is often at work. Earlier, I suggested that literary critics often find it more productive to think of an author as a concept or notion (like Foucault’s “author-function”), rather than as a person. Yet an algorithmic focus on the person of the author also generates critical questions precisely because it asks us to think about how texts get created—generated, written, rewritten, and edited—by actual people.
Perhaps there is a mystery author who wrote the review of Christabel. Or perhaps the review offers itself as evidence for the way collaboration or editing is a process written into much of what we read, including the best of what we read.
The manuscript of Go Set a Watchman seems like an artifact—and it is—but it’s also text. Jan Rybicki and Maciej Eder of the Computational Stylistics Group ran the full novel against a set of novels by selected authors of the American South.11 They found “very strong” stylometric evidence that Harper Lee authored both To Kill a Mockingbird and Go Set a Watchman—with the exception of a few passages from a key scene, attributed by the software to Capote.
The authors suggest that this result is likely to indicate not that Capote wrote that part but that “we are dealing here with a mixture of different stylometric signals, including extensive copy-editing and/or inspiration.” The shift to what we might think of as the ur-impersonal mode of literary analysis—feeding a text into a computer—forces us once again to confront the fact that literature is always a social production.
Rachel Donadio, writing in The Atlantic, reflects on the possibility of the “collaborative origins” of Elena Ferrante. The idea of a pseudonym built for two raises a slightly different question than critical considerations of authorship often pose. A tool developed to decipher the signal of an individual creator, to answer what might seem to be the fundamental question of authorship—Did that person write that document?—can also reveal how often that question is unanswerable.
This article was commissioned by Richard Jean So.
- First takeaway from this essay: if you are a person who has brought thousands of words of argumentative prose into the world and you decide to denounce someone powerful anonymously, you should maybe ask your cousin the dentist to write the denunciation for you. Efforts are also underway to produce a computational recipe for blurring one’s stylistic fingerprint (like this), but I’d go with the dentist, keeping one eye on the paper trail you leave. ↩
- Maciej Eder, Jan Rybicki, and Mike Kestemont designed stylo, an R package for computational text analysis that is freely available and designed to be easy to use. Their introduction to it here gives a helpful overview of the history and breadth of the field and a more detailed overview of how the computational analysis of style works. They point out that “computational authorship attribution increasingly attracts attention in science, because of its valuable real-world applications, for instance, related to forensics topics such as plagiarism detection, unmasking the author of harassment messages or even determining the provenance of bomb letters in counter-terrorism research.” ↩
- Moshe Koppel, Jonathan Schler, and Elisheva Bonchek-Dokow, “Measuring Differentiability: Unmasking Pseudonymous Authors,” Journal of Machine Learning Research, vol. 8 (June 2007). It is possible for a “signature accessory” or standout feature to signal only one text consistently, which can obscure elements of an author’s style shared more widely across texts. Their learning algorithm thus actually “unmasks” the author of a text from a set of unknown authors by iteratively removing the most distinctive features. ↩
- Tonja B. Carter, “How I Found the Harper Lee Manuscript,” Wall Street Journal, July 12, 2015. ↩
- Patrick Juola, “The Rowling Case: A Proposed Standard Analytic Protocol for Authorship Questions,” Digital Scholarship in the Humanities, vol. 30, supp. 1, 2015. ↩
- Simone Murray, The Digital Literary Sphere: Reading, Writing, and Selling Books in the Internet Era (Johns Hopkins University Press, 2018). ↩
- Francesca Benatti and Justin Tonra, “English Bards and Unknown Reviewers: A Stylometric Analysis of Thomas Moore and the Christabel Review,” Breac, October 7, 2015. ↩
- Stylo actually also offers a function (“rolling delta”) designed for coauthored works that “tries to determine the authorship of fragments extracted from them” or, in the terms of the metaphor here, to notice if people are taking turns wearing the coat. ↩
- The authors used stylo for this analysis. ↩
- “Of the supervised methods employed, Support Vector Machines (SVM) yielded the highest overall percentage of correct attributions of the test set at 74% correct attributions over 1,600 tests on individual articles.” ↩
- Maciej Eder and Jan Rybicki, “Go Set a Watchman while We Kill the Mockingbird in Cold Blood,” Computational Stylistics Group. ↩