Solving Real-Life Mysteries with Big Data and Apache Spark

Can using simple statistical techniques in combination with big data help solve the Tamam Shud mystery?

Everyone loves a good real-life mystery. That’s why the three most popular TV shows of the 80s and 90s were Jack Palance’s reboot of Ripley’s Believe It or Not!, Unsolved Mysteries with Robert Stack, and Beyond Belief: Fact or Fiction hosted by Commander Riker. (Well…they were in my house, anyway.) At Cloudera, the highly-skilled support team has gotten good at cracking actual stranger-than-fiction cases like, “Why doesn’t this Kerberos ticket renew?” or, “Who deleted that table?”

In this spirit, on a recent random walk through Wikipedia links, I found a fascinating 68-year-old unsolved mystery known as the Tamam Shud case (aka “Mystery of the Somerton Man”). If you enjoyed Serial, then the story itself is worth a read, and worth watching. However for anyone who touches data for a living, the most intriguing part of the story will undoubtedly be this:

Facts of the Case

A man is found dead on a beach near Adelaide, Australia, in December 1948. Well-dressed and in good shape, he seems to have died from poisoning. His mundane possessions include no identification, but do include a scrap of a paper with the words, “Tamám Shud”. This phrase turns out to be the closing words (in Persian) of the Rubaiyat of Omar Khayyam, meaning “finished.”

Soon after, the very book from which it was torn is located. Inside its cover is scrawled the unlisted phone number of a local woman, along with this mysterious text:

WRGOABABDMLIAOIWTBIMPANETPMLIABOAIAQCITTMTSAMSTGAB

To this day, nobody has conclusively explained its meaning, and the dead man has never been identified.

Several people have approached these letters as a cryptographic cipher. The odd circumstances of death do sound like something out of a John Le Carré spy novel. Some of the best attempts, however, fail to produce anything but truly convoluted parsings.

Another possibility may already have occurred to you: Are they the first letters of words in a sentence (an initialism)? Some suspect this death was a suicide, and that the message is merely some form of final note. With this morbid scenario in mind, it’s easy to imagine many phrases, like “My Life Is All But Over,” that fit the letters because indeed their frequency seems to match that of English text.

This lead has been picked up a few times. These writeups (example) present indications that the message is indeed an initialism. However, they don’t apply what is arguably the clear statistical tool for this job. And they don’t take advantage of big data. So, let’s do both.

The Chi-Squared Test

Does the frequency of letters in the Tamam Shud text resemble the frequency of first letters of English text? If so, that would be evidence that the text is an English initialism. But any given English sentence’s frequency of initials won’t exactly match English text as a whole. Rather, it will vary a bit depending on what sentences are chosen.

The well-known chi-squared test (χ2) can help. It takes as input some expected frequencies of discrete things occurring—like letters starting words—and actual observed counts of those things. It can then quantify the probability that a sample actually drawn from the expected frequencies would exhibit deviations from the expected frequencies as large or larger than that of the observed sample: a p-value. Although you may hear about degrees of freedom and the chi-squared statistic in this context, these are details that libraries will take care of in a context like this. This usage of the chi-squared test is known as a goodness-of-fit test.

This possibility that the observed counts come from the expected distribution is called the null hypothesis. In this hypothesis, the sample is not from a different distribution.

A low p-value means that the null hypothesis is unlikely. It’s evidence that the given initials did not come from a source whose frequencies match the expected frequencies. A high p-value is not quite evidence that the initials came from the same source; it just means that actual samples from the expected distribution would regularly show deviation from expected counts that are as large, or larger. Although it’s indicative that the initials came from the expected distribution, it’s not the probability that they did.

To get to work, fire up a Scala spark-shell. It’s easy to count the frequency of letters in the mystery text in Scala:

def textToCharCounts(text: String) = { // Start with 0 count for all letters val countMap = scala.collection.mutable.Map((‘A’ to ‘Z’).map(_ -> 0L):*) text.foreach(countMap() += 1) countMap.toSeq.sorted.map { case (_, count) => count }.toArray} val tamamShudText = “WRGOABABDMLIAOIWTBIMPANETPMLIABOAIAQCITTMTSAMSTGA”val tamamShudCounts = textToCharCounts(tamamShudText) …Array(9, 4, 1, 1, 1, 0, 2, 0, 6, 0, 0, 2, 5, 1, 3, 2, 1, 1, 2, 6, 0, 0, 2, 0, 0, 0)

// Start with 0 count for all letters

text.foreach(countMap(_) += 1)

}

val tamamShudText = “WRGOABABDMLIAOIWTBIMPANETPMLIABOAIAQCITTMTSAMSTGA”

Array(9, 4, 1, 1, 1, 0, 2, 0, 6, 0, 0, 2, 5, 1, 3, 2, 1, 1, 2, 6, 0, 0, 2, 0, 0, 0)

So, “A” occurs 9 times, and “Z” occurs 0 times. These are the observed frequencies.

Finding the overall expected frequency of initials in English text requires a lot of English text. Fortunately, it’s easy to access the entire text of English Wikipedia as JSON dump. Obtain the latest content dump, like enwiki-20160808-cirrussearch-content.json.gz. (But be warned: it’s over 22GB in size.) Then push it straight into HDFS, to an example directory like /user/ds, with:

curl https://dumps.wikimedia.org/other/cirrussearch/current/enwiki-20160808-cirrussearch-content.json.gz

gunzip -c

hdfs dfs -put - /user/ds/cirrus.json

Being JSON, the text can be extracted with a bit of Apache Spark code and Jackson, using the Dataset API. For example, this can be :paste-ed into spark-shell:

val wikipediaText = sqlContext .read.text(“hdfs:///user/ds/cirrus.json”).as[String] .filter(.startsWith(“{"namespace":0,”)) .mapPartitions { jsons => val mapper = new com.fasterxml.jackson.databind.ObjectMapper() jsons.map(json => mapper.readValue(json, classOf[java.util.Map[,_]]).get(“text”).toString ) }

.read.text(“hdfs:///user/ds/cirrus.json”).as[String]

.mapPartitions { jsons =>

jsons.map(json =>

)

And some careful filtering and counting with Spark reveals the distribution of initials in the entire corpus:

12345678910111213141516171819

def textToInitials(text: String): Array[Char] = text.split(“\W+”).filter(.nonEmpty) .map(word => Character.toUpperCase(word.head)) .filter(c => c >= ‘A’ && c <= ‘Z’) val wikipediaCounts = wikipediaText // Split into words, map to initials .flatMap(text => textToInitials(text).map(.toString)) .toDF // Count characters .groupBy(“value”).count .as[(String,Long)].collect() .map { case (s, count) => (s.head, count) } // Sort by character and return counts .toSeq.sorted .map { case (_, count) => count }.toArray …Array(282341805, 114250689, 145476474, 84765259, …

text.split(“\W+”).filter(_.nonEmpty)

.filter(c => c >= ‘A’ && c <= ‘Z’)

val wikipediaCounts = wikipediaText

.flatMap(text => textToInitials(text).map(_.toString))

// Count characters

.as[(String,Long)].collect()

// Sort by character and return counts

.map { case (_, count) => count }.toArray

…

Now, it’s easy to use an implementation of the chi-squared test to compute a p-value:

def pValue(expected: Array[Long], observed: Array[Long]) = { // Disregard letters with 0 count in expected val (nzExpected, nzObserved) = expected.zip(observed).filter { case (e, ) => e > 0 }.unzip val cs = new org.apache.commons.math3.stat.inference.ChiSquareTest() cs.chiSquareTest(nzExpected.map(.toDouble).toArray, nzObserved.toArray)} pValue(wikipediaCounts, tamamShudCounts) …0.2191200150296464

// Disregard letters with 0 count in expected

expected.zip(observed).filter { case (e, _) => e > 0 }.unzip

cs.chiSquareTest(nzExpected.map(_.toDouble).toArray, nzObserved.toArray)

0.2191200150296464

Judging just by letter counts, if a sample of English text was taken repeatedly, then the observed counts would differ as much or more than this mystery text does from the expected frequency about 21.9% of the time. This is not unlikely, and thus not sufficient to rule out the null hypothesis: that the mystery letters are English initials.

Compare to the result when these counts are compared to a uniform distribution of initials, which is what one would expect if letters were chosen uniformly at random—and also how many ciphered texts would appear:

pValue(Array.fill(26)(1L), tamamShudCounts)… 1.6201309989138934E-6

…

That’s less than an 0.002% chance, and strong evidence the letters aren’t like randomly-chosen letters.

This famous passage, given as initials, appears in Wikipedia, and a similar analysis suggests it yields a similar p-value:

FSASYAOFBFOTCANNCILADTTPTAMACENWAEIAGCWTWTNOANSCASDCLE

val passageCounts = textToCharCounts( “FSASYAOFBFOTCANNCILADTTPTAMACENWAEIAGCWTWTNOANSCASDCLE”)pValue(wikipediaCounts, passageCounts) …0.2673565500306385

“FSASYAOFBFOTCANNCILADTTPTAMACENWAEIAGCWTWTNOANSCASDCLE”)

0.2673565500306385

To be honest, this work has been done before at the University of Adelaide, with the conclusion also being that the message seems to be an English initialism (if it’s an initialism in any language at all).

Can we add to the investigation with Spark?

Tracking the Source of the “Quote”

With some strong evidence that these are English initials, and the complete text of Wikipedia to hand, it’s natural to simply search for a passage whose initials match part of the mystery text. While it’s rather unlikely that the Somerton Man was quoting Wikipedia, it’s possible he had a famous passage in mind that may appear in Wikipedia.

It takes just a short passage of text and a few minutes on a cluster to find some answers. For example, looking for substrings like “MLIABOAIAQC” entails:

import org.apache.spark.sql._ val wikiTextInitials = wikipediaText .map(text => (text, new String(textToInitials(text)))) .rdd.toDF(“text”, “initials”).cache() wikiTextInitials .filter(col(“initials”).contains(“MLIABOAIAQC”)) .select(“text”).as[String].collect()

.map(text => (text, new String(textToInitials(text))))

.filter(col(“initials”).contains(“MLIABOAIAQC”))

The bad news is that most any search for significant strings from the mystery text yields no hits. Some shorter strings do, but they’re obviously not a match. It’s another dead end. But, if you didn’t recognize the quote above by its initials, you can find it now. Among other things, it occurs as part of the Gettysburg Address.

In case you’re wondering, no, the initials do not appear in the Rubaiyat of Omar Khayyam either, in the English translation from which the scrap was torn. Fortunately, thanks to Project Gutenberg, we actually have available the text of many well-known public-domain English texts and can search them, too. Although all texts can be downloaded from the site itself, that can take days—instead, gutenberg-tar.com provides a compressed archive that’s easier to obtain. Use p7zip to decompress it; if the entire archive is decompressed to one large text file (tip: 7za x -so gutenberg_txt.7z 2> /dev/null | hdfs dfs -put - /user/ds/gutenberg.txt), it can be searched like the Wikipedia JSON dump above, but no parsing of JSON is needed this time—it can just be read as a Dataset of Strings.

To cut to the chase: No, nothing emerges as a clear match to any significant substring. Another lead turns up nothing.

However, we can try to locate texts whose frequency of initials seems more likely to have given rise to the counts observed in the mystery sample, by applying the chi-squared test again. It’s problematic to interpret the p-value as a metric for ranking things. However, for merely hunting for clues about relevant texts, it might be sufficient. This might give some indication of what type of writing it resembles. (Poetry? Short fiction?)

To do this, decompress the archive as files and upload them to HDFS with hdfs dfs -put gutenberg /user/ds/. This will take a while, and create over 70,000 small files nested three to five levels deep. In this case, it will actually be helpful to use a method from the original RDD API, called wholeTextFiles, which reads (filename, contents) pairs. This approach is useful when the input consists of many small files. Search for files at all levels of nesting, splitting the larger set that’s nested deeper into more partitions, for balance:

val gutenbergBase = “hdfs:///user/ds/gutenberg”val allTextsRDD = sc.union( sc.wholeTextFiles(s”$gutenbergBase///.txt”, 10), sc.wholeTextFiles(s”$gutenbergBase////.txt”, 100), sc.wholeTextFiles(s”$gutenbergBase/////.txt”, 1000))

val allTextsRDD = sc.union(

sc.wholeTextFiles(s”$gutenbergBase////.txt”, 100),

The files have a large header and footer that shouldn’t be counted. This splits a text file into lines, and then reassembles the part after the header and before the footer into a string:

def stripHeaderFooter(text: String) = { val lines = text.split(“\n+”).map(.trim).filter(.nonEmpty) val headerEnd = lines.indexWhere(.matches( “\*\*\*.*START.+PROJECT GUTENBERG.*\*\*\*”)) val footerStart = lines.indexWhere(.matches( “\\\.END.+PROJECT GUTENBERG.\\\”), headerEnd) lines.slice(if (headerEnd < 0) 0 else headerEnd + 1, if (footerStart < 0) lines.size else footerStart) .mkString(“ “)}

val lines = text.split(“\n+”).map(.trim).filter(.nonEmpty)

“\\\.START.+PROJECT GUTENBERG.\\\”))

“\\\.END.+PROJECT GUTENBERG.\\\”), headerEnd)

if (footerStart < 0) lines.size else footerStart)

}

Finally, it’s possible to consider each text’s initials as a distribution and compare to the mystery text:

val topMatches = allTextsRDD .mapValues(stripHeaderFooter) .mapValues(text => new String(textToInitials(text))) .mapValues(textToCharCounts) .filter { case (, counts) => counts.size > 1 } .mapValues(pValue(, tamamShudCounts)) .sortBy { case (_, pValue) => -pValue } .take(5).foreach(println) …(…/gutenberg/3/6/7/3679/3679.txt,0.7984799078362972)(…/gutenberg/3/5/9/3597/3597.txt,0.7920832483908181)(…/gutenberg/2/6/3/2631/2631-8.txt,0.788166446663857)(…/gutenberg/2/6/3/2631/2631.txt,0.788166446663857)(…/gutenberg/9/0/7/9070/9070-8.txt,0.7873012087723055)

.mapValues(stripHeaderFooter)

.mapValues(textToCharCounts)

.mapValues(pValue(_, tamamShudCounts))

.take(5).foreach(println)

…

(…/gutenberg/3/5/9/3597/3597.txt,0.7920832483908181)

(…/gutenberg/2/6/3/2631/2631.txt,0.788166446663857)

These texts are:

Getting Gold, by J. C. F. Johnson
The Essays of Montaigne**, Volume 17, by Michel de Montaigne
Mr. Gladstone and Genesis, by Thomas Henry Huxley
(duplicate)
The Imaginary Invalid, by Moliére

There’s no obvious theme to these books. It’s also possible that, with a relatively short sample, most of these top results have high p-values due to chance. Unfortunately, there’s no “smoking gun” here.

Conclusion

There are more leads to follow. Why not compare bigrams (pairs of initials)? Their order of occurrence is certainly significant. The total number of bigrams in the English language is 26 x 26 = 676, and there are 48 bigrams observed in the mystery text, whereas most buckets would contain 0 or 1. The counts would be so sparse that the chi-squared test becomes difficult to apply.

Maybe some mysteries are unsolved for a reason: the message could represent a personal message, or may simply be nonsense. When those mysteries involve potential codes, though, statistical techniques (even one as simple as the well-known chi-squared test) can provide valuable supporting clues. And complementing such techniques with big data technologies like Spark make it easy to deploy plenty of raw computing resources to search high and low for patterns in vast amounts of business or scientific data.

*Sean Owen is director of data science at Cloudera in London. Before Cloudera, he founded Myrrix Ltd. (now the Oryx project) to commercialize large-scale real-time recommender systems on Hadoop. He is an Apache Spark committer, was a committer and VP for Apache Mahout, and is the coauthor of *Advanced Analytics with Spark**