

|
October 1991
The Ghost's Vocabulary
How the computer listens
for Shakespeare's "voiceprint"
by Edward Dolnick
In 1842 literature and science met with a thud. Alfred Tennyson had just
published his poem "The Vision of Sin." Among the appreciative letters he
received was one from Charles Babbage, the mathematician and inventor who is
known today as the father of the computer. Babbage wrote to suggest a
correction to Tennyson's "otherwise beautiful" poem--in particular to the lines
"Every moment dies a man, /Every moment one is born."
"It must be manifest," Babbage pointed out, "that, were this true, the
population of the world would be at a standstill." Since the population was in
fact growing slightly, Babbage continued, "I would suggest that in the next
edition of your poem you have it read: 'Every moment dies a man,/Every moment
1-1/16 is born.'" Even this was not strictly correct, Babbage conceded, "but I
believe 1-1/16 will be sufficiently accurate for poetry."
Today computers are standard tools for amateur and professional literary
investigators alike. Shakespeare is both the most celebrated object of this
effort and the most common. At Claremont McKenna College, in California, for
example, two highly regarded faculty members have devoted years of their lives
to a computer-based attempt to find out whether Shakespeare, rather than
Francis Bacon or the Earl of Oxford or any of a myriad of others, wrote the
plays and poems we associate with his name.
As Babbage's venture into criticism foreshadowed, the marriage of computers and
literature has been an uneasy one. At the mention of computers or statistics,
many Shakespeareans and others in the literary establishment wrinkle their
noses in distaste. To approach the glories of literature in this plodding way
is misguided, they say, and misses the point in the same way as does the
oft-cited remark that the human body is worth just a few dollars--the market
value of the various chemicals of which it is composed. "This is just madness,"
says Ricardo Quinones, the chairman of the literature department at Claremont
McKenna. "Why don't they simply read the plays?"
Rather than read, these literary sleuths prefer to count. Their strategy is
straightforward. Most are in search of a statistical fingerprint, a reliable
and objective mark of identity unique to a given author. Every writer will
sooner or later reveal himself, they contend, by quirks of style that may be
too subtle for the eye to note but are well within the computer's power to
identify.
For a University of Chicago statistician named Ronald Thisted, the call to
enter this quasi-literary enterprise came on a Sunday morning in December of
1985. Thisted had settled down with The New York Times Book Review and an
article by Gary Taylor, a Shakespeare scholar, caught his eye. Taylor claimed
that he had found a new poem by Shakespeare at Oxford's Bodleian Library. Among
the many reasons Taylor advanced for believing in the authenticity of the poem,
called "Shall I Die?," Thisted focused on one. "One of his arguments," Thisted
says, "was that several words in the poem don't appear previously in
Shakespeare. And that was evidence that Shakespeare wrote it. One's first
reaction is, that's dumb. If Shakespeare didn't use these words, why would that
be evidence that he wrote the poem?" But Taylor's article went on to explain
that in practically everything he wrote, Shakespeare used words he hadn't used
elsewhere. Thisted conceded the point in his own mind, but raised another
objection. "If ALL the words in there were ones that Shakespeare had never
used," he thought, "if it were in Sanskrit or something, you'd say, 'No way
Shakespeare could have written this.' So there had to be about the right number
of new words." That question--how many new words an authentic Shakespeare text
should contain--was similar to one that Thisted himself had taken on a decade
before. Together with the Stanford statistician Bradley Efron, then his
graduate adviser, Thisted had published a paper that proposed a precise answer
to the question "How many words did Shakespeare know but never use?" The
question sounds ludicrous, like "How many New Year's resolutions have I not yet
made?" Nonetheless, Efron and Thisted managed to answer it. They found the
crucial insight in a generation-old story, perhaps apocryphal, about an
encounter between a mathematician and a butterfly collector.
R. A. Fisher, the statistical guru of his day, had been consulted by a
butterfly hunter newly back from Malaysia. The naturalist had caught members of
some species once or twice, other species several times, and some species time
and time again. Was it worth the expense, the butterfly collector asked, to go
back to Malaysia for another season's trapping? Fisher recast the question as a
mathematical problem. The collector knew how many species he had seen exactly
once, exactly twice, and so on. Now, how many species were out there that he
had yet to see? If the collector had many butterflies from each species he had
seen, Fisher reasoned, then quite likely he had sampled all the species that
were out there. Another hunting trip would be superfluous. But if he had only
one or two representatives of most species, then there might be many species
yet to find. It would be worth returning to Malaysia. Fisher devised a
mathematical way to make that rough idea precise (and reportedly suggested
another collecting trip). Efron and Thisted's question was essentially the
same.
Where the naturalist had tramped through the rain forest in search of exotic
butterflies, the mathematicians could scan Shakespeare in search of unusual
words. By counting how many words he used exactly once, exactly twice, and so
on, they would attempt to calculate how many words he knew but had yet to
use.
Neither Efron nor Thisted had imagined that their statistical sleight of hand
could ever be put to a live test. No new work of Shakespeare's had been
unearthed for decades. Now Taylor had given them their chance. A new
Shakespeare poem, like a new butterfly-collecting trip to the jungle, should
yield a certain number of new words, a certain number that Shakespeare had used
once before, and so on. If Shakespeare did write "Shall I Die?," which has 429
words, according to the mathematicians' calculations it should have about seven
words he never used elsewhere; it has nine. To Efron and Thisted's surprise,
the number of words in the poem which Shakespeare had used once before also
came close to matching their predictions, as did the number of twice-used
words, all the way through to words he had used ninety-nine times before. The
poem, which sounds nothing like Shakespeare, fit Shakespeare like a glove.
This is work that can suck up lives. One Defoe scholar, trying to pick out true
Defoe from a slew of anonymous and pseudonymous works, has pursued his quarry
for twenty years, with no end in sight. A team trying to determine if the Book
of Mormon was composed by ancient authors or by the nineteenth-century American
Joseph Smith took 10,000 hours to produce a single essay. (The largely Mormon
team of researchers concluded that Smith had not written the Book of Mormon.
Confirmed samples of Smith's prose, the researchers argued, showed patterns of
word usage different from those in the Book of Mormon.) Paper after paper
begins with a trumpet fanfare and ends with a plaintive bleat. One writer, for
instance, decided to determine whether Jonathan Swift or one of his
contemporaries had written a particular article, by pigeonholing his words
according to what part of speech they were. "The only positive conclusion from
over a year of effort and the coding of over 40,000 words," she lamented, "is
that a great deal of further study will be needed." (Swift himself had
satirized, in Gulliver's Travels, a professor who had "employed all his
Thoughts from his Youth" in making "the strictest Computation of the general
Proportion there is in Books between the Numbers of Particles, Nouns, and
Verbs, and other Parts of Speech.")
Despite the shortage of triumphs the field is growing, because more and more of
the work can be assigned to electronic drudges. Scholars once had to count
words by hand. Later they had the option of typing entire books into a
computer, so that the machine could do the counting. Today computers are
everywhere, and whole libraries of machine-readable texts are available.
Software to do deluxe slicing and dicing is easy to obtain.
As a result, everything imaginable is being counted somewhere. Someone at this
moment is tallying up commas or meticulously computing adjective-to-adverb
ratios. But sophisticated tools don't automatically produce good work. A future
Academy of Statistics and Style might take as its motto the warning that the
Nobel laureate P. B. Medawar issued to his fellow scientists: "An experiment
not worth doing is not worth doing well."
Among those least likely to be fazed by such pronouncements is a professor of
political science at Claremont McKenna College named Ward Elliott. Elliott is
an authority on voting rights, a cheerful eccentric, and, like his father
before him, inclined to view the Earl of Oxford as the true author of
Shakespeare's works. Four years ago Elliott recruited Robert Valenza, an expert
programmer also on the Claremont McKenna faculty, and the two set to work on
the authorship question.
This time the model would be not butterfly hunting but radar. Valenza had spent
considerable time devising mathematical procedures to find the patterns
obscured by noisy and jumbled electronic signals. Adapted to Shakespeare, the
idea was to go beyond counting various words, as many others had done, and see
whether consistent patterns could be found in the way certain key words were
used together. Two writers might use the words "blue" and Green equally often
throughout a text, for example, but the writers could be distinguished if one
always used them on the same page while the other never used them together.
This pattern-finding mathematics is widely used in physics and engineering, in
deciphering television and radar signals, for example. Given a long list of
words--not simply the "blue" and Green of the example, but dozens more--the
computer can quickly tell how Shakespeare typically balanced those words. "You
might have a pattern," Valenza says, "with a lot of 'love,' very little 'hate,'
and a good deal of 'woe.'" A different writer might use the same words, and
even use them at the same rates as Shakespeare, but the configurations might be
different. The result is that a given list of words produces a kind of
voiceprint for each author.
Valenza and Elliott examined "common but not too common" words that Shakespeare
used. To examine rare words, Valenza had reasoned, would be like trying to
identify a voice from a whisper, and to examine common words would be to let
the writer shout into your ear. The final, fifty-two-word list--with such
miscellaneous entries as "about," "death," "desire," "secret," and "set"--was
assembled by trial and error. It consisted of words with two key properties. In
various works of Shakespeare's those words are used in patterns that yield the
same voiceprint each time. And when other writers are tested, the same words
yield voiceprints that are different from Shakespeare's.
The machinery in place, Valenza and Elliott began by testing Shakespeare's
poetry against that of thirty other writers. Exciting results came quickly: The
disputed "Shall I Die?" poem seemed not to be Shakespeare's after all. Three of
the leading claimants to Shakespeare's work--Francis Bacon, Christopher
Marlowe, and Sir Edward Dyer--were decisively ruled out. To Elliott's
good-humored consternation, the test dealt just as harshly with the claims put
forward on behalf of the Earl of Oxford. Worse was to follow. For even as this
first round of tests ruled out the best-known Shakespeare candidates, it left a
few surprising contenders. One possibility for the "real" Shakespeare: Queen
Elizabeth I. "That did it for our chance of appearing in Science," Elliott
laments, "But it vastly increased our chance of getting into the National
Enquirer." (To his dismay, Elliott did find himself in Science, not as the
co-author of a weighty research paper but as the subject of a skeptical news
brief with the headline "Did Queen Write Shakespeare's Sonnets?")
Valenza and Elliott have since conducted more-extensive tests that have ruled
out Queen Elizabeth. But the mishap highlights a risk that is shared by all the
number-crunching methods. "If the glass slipper doesn't fit, it's pretty good
evidence that you're not Cinderella," Elliott points out. "But if it does fit,
that doesn't prove that you are."
The risk of being fooled is least for someone who combines a deep knowledge of
literature with some statistical insight. Donald Foster, a professor of English
at Vassar College, fits that bill. Foster's scholarship is highly regarded.
Soon after "Shall I Die?" was presented to the world, for example, he wrote a
long debunking essay that persuaded many readers that the poem was not
Shakespeare's. In a more recent essay he consigned whole libraries of research
to the scrap heap. Hundreds or thousands of articles have been written to
explain the epigraph to Shake-speare's Sonnets, which begins, "To the onlie
begetter of these insuing sonnets, Master W.H." Who was W.H.? Foster's solution
to the mystery, which won him the Modern Language Association's Parker Prize,
is that W.H. was...a typo. The publisher, who wrote the epigraph as a bit of
flowery praise to honor Shakespeare, had intended to print "W.SH."
Those essays had nothing to do with statistics, but Foster has done some
statistical sleuthing of his own, and he is well aware of the hazards. One
scholar compared Shakespeare's plays with someone else's poems, for example,
and concluded that Shakespeare used the present tense more than other writers
do. Another compared Shakespeare with later writers and concluded that he used
many four-letter words, whereas other writers used shorter words--forgetting
that archaic words like "thou" and "hath" drive Shakespeare's average up.
"There are strong and compelling reasons for avoiding this kind of research,"
Foster says, "because it's so difficult to anticipate all the pitfalls." But
Foster himself has often given way to temptation. Like many Shakespeareans, he
steers clear of the "authorship question," but he has looked into a pertinent
mystery.
Shakespeare acted in his plays. But with two exceptions, we don't know what
roles he took. Foster believes he has found a statistical way to gather that
long-vanished knowledge. "It occurred to me," he says, "that Shakespeare may
have been influenced in his writing by the parts he had memorized for
performances and was reciting on a more or less daily basis." Last year Foster
figured out a way to test that hunch. "The results," he says, "have been
absolutely stunning."
"We started by using a concordance to type in all the words that Shakespeare
used ten times or fewer," Foster says. These aren't exotic words, necessarily,
just ones that don't crop up often in Shakespeare. Scholars have known for some
time that these "rare" words tend to be clustered chronologically. Foster found
that if two plays shared a considerable number of rare words, in the later play
those words were scattered randomly among all the characters. In the earlier
play, the shared words were not scattered. "In one role," Foster says, "there
would be two to six times the expected number of rare words." There stood
Shakespeare: the words that Shakespeare the writer had at the tip of his pen
were the ones he had been reciting as Shakespeare the actor.
If Foster is right, Shakespeare played Theseus in A Midsummer Nights Dream and
"Chorus" in Henry V and Romeo and Juliet. In play after play the first
character to come on stage and speak is the one that Foster's test identifies
as Shakespeare: John Gower in Pericles, Bedford in Henry VI, Part I, Suffolk in
Henry VI, Part II, and Warwick in Henry VI, Part III. And Foster's test picks
out as Shakespeare's the two roles that we have seventeenth-century evidence he
played: the ghost in Hamlet and Adam in As You Like It.
The theory can be tested in other ways. It never assigns to Shakespeare a role
we know another actor took. The roles it does label as Shakespeare's all seem
plausible--male characters rather than women or children. The test never runs
in the wrong direction, with the unusual words scattered randomly in an early
play and clustered in one role in a later play. On those occasions when
Foster's test indicates that Shakespeare played TWO roles in a given
play--Gaunt and a gardener in Richard II, for example--the characters are never
onstage together. Foster's theory passes another test. When Foster looks at the
rare words that Hamlet shares with Macbeth, written a few years later, those
words point to the ghost in Hamlet as Shakespeare's role. And if Foster looks
at rare words that Hamlet shares with a different play also written a few years
later--King Lear, for example--those shared words also pick out the ghost as
Shakespeare's role.
Additional evidence has been uncovered. After Hamlet, the ghost's vocabulary
exerted a strong influence on Shakespeare's writing and then tapered off. But
Shakespeare's plays went in and out of production. When Hamlet was revived
several years after its first staging, and Shakespeare was again playing the
ghost, he began again to recycle the ghost's vocabulary.
It is a strange image, a computer fingering a ghost. But it is a sign of things
to come. Eventually the prejudice against computers in literary studies will
give way. "The walls are sure to crumble," Ward Elliott says, "just as they did
in baseball and popular music....Some high-tech Jackie Robinson will score a
lot of runs, and thereafter all the teams in the league will pursue the newest
touch as ardently and piously as they now shrink from it."
Copyright ©1991 by Edward Dolnick. All rights
reserved.
The Atlantic Monthly; October, 1991; "The Ghost's Vocabulary"; Volume 268, No.
4;
pages 82-86
|