This post was co-authored by Garrett Heath.
As OSCON, a global conference on open source software, got underway in Portland this week, the timing of the recent J.K. Rowling unmasking couldn’t have been better. As my colleague and co-author, Garrett Heath, tweeted from the conference, “Accio Open Source!” For the three people left on the planet who haven’t read a Harry Potter book, that’s a common summoning charm used among Rowling’s fictional wizards.
So now we know that first time author Robert Galbraith’s mystery novel The Cuckoo’s Calling didn’t become an “instant” bestseller because the critics loved it, which they did. For the first few months after it came out in April, it sold fewer than 1,500 copies—a common fate for debut novels. When the UK’s Sunday Times cracked the case and probed Ms. Rowling into a confession, we all watched and read about the uptick in sales. It is now the number one selling book on Amazon. Not only is this a testament to the power of brand marketing—J.K. Rowling is the Coca-Cola of fiction, after all—but also to the rising prevalence and power of open source software.
The events that led to the revelation of Rowling as Galbraith could have been ripped straight from the pages of a modern spy novel. A journalist at the Sunday Times gets an anonymous tip on Twitter claiming that Rowling is the real author of The Cuckoo’s Calling, but before any verification could happen, the tipper’s Twitter account is deleted. The newspaper then calls on two academics to act as literary sleuths: Peter Millican, who teaches philosophy and computing at Oxford University, and Patrick Juola, a computer science professor at Duquesne University in Pittsburgh. The newspaper provides the men with machine-readable texts of The Cuckoo’s Calling along with Rowling’s previous novel, The Casual Vacancy. It also provides them with a few crime novels by other British women writers, to be used as textual control groups.
The software that Juola and Millican used—the Java Authorship Attribution Program—is open source and freely available on GitHub for download. The academics studied the machine-readable text of Cuckoo’s and compared it to Rowling’s previous novel. In the course of doing so, they discovered a number of linguistic signatures that pointed to the author of Harry Potter. The software is predicated on the analysis of syntax, style and punctuation, but just as importantly on the distinctive use of prepositions and articles. It turns out writers can change sentence length and rhythm and can cater to a new audience, but they’re unlikely to change how they use “around” and “at” and “on.”
A word as simple and as “marked” as “whilst” can narrow down the field of possible authors. In the early 1960s, researchers studied the Federalist Papers, co-written by Alexander Hamilton, James Madison and John Jay during the creation of the U.S. Constitution. It turned out that Madison used the more British “whilst” and “on” over “upon” in his essays. Meanwhile, Hamilton tended to use “while” and “on.” These linguistic markers allowed researches to tell which essays were primarly written by Hamilton and which by Madison. They didn’t have the benefit of open source software, but it’s worth noting that their methodical techniques laid the groundwork for future literary, open source hackers.
Another case of literary unmasking with open source software occurred with Agatha Christie’s cannon. In 2010, Ian Lancashire, an English Professor at the University of Toronto, took 16 Agatha Christie novels, written over a 50-year period, and fed the text into a computer program. (Incidentally, the software, called TACT, is freely available for download and comes with a manual published by the Modern Language Association of America.) He wasn’t looking for the true identity of a pseudonymous author, however. He was just looking for notable trends across the course of a literary career. Did a master of suspense change her style or syntax across half a century?
But what he found had startling implications: in Christie’s 73rd novel, Elephants Can Remember, the incidence of “indefinite words” like “anything,” “thing” and “nothing,” suddenly spiked. Meanwhile, the variety of words Christie used dropped by 20 percent. When Lancashire finally published his paper about his findings, he noted that the data supported a view that Agatha Christie had developed Alzheimer’s by the time she wrote her final book. In fact, she’d already lost a fifth of her vocabulary by the time she wrote her final novel.
While these textual sleuthing examples come from the world of academia, open source software promises to democratize the flow and release of information. That was the promise three years ago when Rackspace co-founded OpenStack, a framework and set of protocols that underlies the open source movement. As OSCON says on its website, “Once considered a radical upstart, open source has moved from disruption to default.” But it’s good to know that it’s still doing its share of disruption as well.