Last year, I reported George Lasry's solution of syllabic ciphers. Syllabic ciphers are ciphers with many symbols for syllables as well as those for single letters.
Details of his algorithm are now published in George Lasry, "Deciphering Historical Syllabic Ciphers" (HistoCrypt 2024).
One might be tempted to say extension of a homophonic solver to a syllabic solver only involves enlarging the search space for, say, 25 letters of the alphabet by adding variables for 25*25=625 syllables. However, since last year, I've been wondering how to design a scoring function for intermediate decipherments. In typical homophonic solvers, 4-grams or 5-grams are used in computing a score. For example, if an intermediate decipherment abounds in plausible 5-grams (e.g., "ISION", "EMENT", "ETLES" for the French language), it receives a high score. But when symbols represent syllables such as "SI", "ON", "EM" "EN", just random assignment of these syllables might result in a relatively high score.
The answer to my year-long question turned out to be very simple. The scoring function is based on 4-grams, composed of not only single letters but also syllables. Thus, a 4-gram may be "E-S-T-A" or "CO-N-TRO-L" (p.177).
While it is easy to say this, it is quite another to implement it, because this involves re-constructing the whole language model. A language model represents frequency characteristics of all possible 4-grams (or 5-grams etc.) in a large corpus of text in the language in question. For a language model for conventional homophonic solvers, creating a language model amounted to simply counting letter groups. For a syllabic solver, it is necessary to first break text in the corpus into syllables, but there can be many ways to split a word. For example, the word ESTABLISHED may be broken into E-S-TA-B-LI-S-HE-D (using only CV syllables), ES-TA-B-LI-S-H-ED (using CV and VC), E-STA-BLI-S-HE-D (allowing also CCV), or the like (p.174). Naturally, success of the algorithm depends on the choice of the set of syllables used and the decomposition scheme, which required "extensive trial-and-error to fine tune" (p.176, 179).
And of course, the more than ten-fold increase in the number of variables (which more than exponentially expands the search space) means "extensive computing power" is required. George used a 64-core Windows 10 Pro PC with 256 Gbytes of RAM memory (p.176), whereas one commercial PC I see on the web has 4-core and up to 3.40 GB of memory.
His groundbreaking new scheme has proven its merits by breaking ten ciphertexts (including five with known keys).
No comments:
Post a Comment