Speed: The program now starts out increasing by 1000 phrases
per second, consistently gets over 10,000 phrases in under 30 seconds,
and usually gets to 12,000 phrases in the first minute. Things slow
down from there, but in ten runs all but one were over 17,000 words,
which takes about an hour.
Version 3: 21,012 Word Palindrome
Created: | 6-10-2016
| Words: | 21,012
| Letters: | 90,439
| Phrases: | 16,111
| Program: | pal3.py or pal3.ipynb.
| palindrome: | A man, a plan, a caretaker,
... (more) ..., Komarek, a ter, a canal, Panama!
|
Commentary: When I saw a meme proclaiming the 6-10-2016 palindromic date,
I went back to an old idea: maybe I could find a longer palindrome by advancing letter-by-letter rather than phrase-by-phrase.
There usually will be a letter that advances both left and right, so I might not have to back up as much.
Even better, I could choose first the letter that leads to the most completions of words on both left and right.
It worked! I got a longer palindrome.
The resulting program is quite similar to the original version, but I get to use more modern Python data structures (like
a Counter, replacing the bisect class.) See the IPython notebook for more on this version.
Speed: This program is about the same speed, but it makes more steady progress.
The Algorithm, Versions 1 and 2
The algorithm I used for versions 1 and 2 is as follows:
(1) Start off with the following seed text, divided into a left and right half:
A man, a plan,
|
| a canal, Panama
|
(2) Find the bit that is not palindromic; that is, not matched by text
on the other side. Call that the remainder and color it red:
aca. Since the remainder is itself a
palindrome, the whole enchilada must be a palindrome. Record it as
such, but keep going to try and find a longer one.
A man, a plan,
|
| a canal, Panama
|
(3) Look in the dictionary for words or phrases
that begin with aca. Randomly choose one and add it
to the left. Remember the others for later. In my example,
the program chose a caddy. Identify the remainder
again, namely ddy:
A man, a plan, a caddy,
|
| a canal, Panama
|
(4) Now find in the dictionary a
word that ends with ydd (the reverse of ddy).
The program found Roydd, which is a male first name (although
far from a common one). Add it to the right, leaving the remainder Ro:
A man, a plan, a caddy,
|
| Roydd, a canal, Panama
|
(5) Now we need a word starting with or (the reverse of Ro) on the left. The program chose Ore. The remainder, e, is a palindrome, so again we've got a winner. Print it to a file and move on.
A man, a plan, a caddy, Ore,
|
| Roydd, a canal, Panama
|
(6) Keep going back and forth in this fashion, until you have 10
or 15 thousand words. However, at some point we may
need to backtrack. For
example, suppose the program next chose Belize:
A man, a plan, a caddy, Ore,
|
| Belize, Roydd, a canal, Panama
|
(7) We need a word on the left that starts with zileb, but there is no such word in my
dictionary. So we backtrack, erasing Belize and trying another
word that ends in e. Fortunately,
there are many of these. If there were not, we would
have to backtrack another step and erase Ore, replacing it with another word that starts with or.
There are a few more technical details. For example, whenever I choose
a word, I need to record it so that I don't use it again (but
un-record it when backtracking over it). For
details, see the program listing.
The Algorithm, Version 3
For version 3, I switched to a letter-by-letter approach, rather than phrase-by-phrase. I keep track, on both left and right, of the
phrases that have been completed, as well as a currently-being-worked-on phrase (one on the left and one on the right) that are not yet complete. These are the two middlemost phrases. So, it might go like this:
(1) Start off with the following seed text, divided into left and right halves, with incomplete phrases in the middle (in this case, the incomplete "aca" on the left, and no incomplete letters on the right). We always maintain the invariant that the concatenation of the letters on the left equals the
reverse of the concatenation of the letters on the right (including the letters in any incomplete phrase).
A man, a plan,
| aca
|
|
| a canal, Panama
|
(2) Choose a letter to add to both the left and right. I will choose the word that gives the highest number of completed prefixes on the left and suffixes on the right. It turns out this is the letter r:
A man, a plan,
| acar
|
| r
| a canal, Panama
|
(3) Again, add the letter that completes the most prefixes and suffixes; in this case e:
A man, a plan,
| acare
|
| er
| a canal, Panama
|
(4) At this point we could add another letter, or we could mark the phrase "acare" ("a care") as complete, moving it the completed part of the left-hand-side:
A man, a plan, a care
|
|
| er
| a canal, Panama
|
(5) Keep going like this, adding letters, and sometimes moving a current string of letters to the completed phrase list (on either left or right).
Just as in the previous version of the algorithm, at some point we may have to backtrack, undoing some moves we have made, and making an alternative choice instead.
Aside: The Problem of Language Induction
When I showed Hoey my palindrome, he said that he
was really thinking of sticking with noun phrases of the
form "<indefinite article> <noun>."
This is an example of the venerable
language induction problem: given the single sentence "A
man, a plan, a canal--Panama" as evidence, what language does it define?
It seems clear that it consists of a series of any number of noun phrases.
That is,
S => NP*
But from one example, you can't say much more. Hoey assumed that every noun phrase (except "Panama") must have an indefinite article:
NP => IndefArt Noun
IndefArt => "a" | "an"
while I was allowing proper nouns to appear anywhere, not just in the final "Panama":
NP => ProperNoun
NP => IndefArt Noun
IndefArt => "a" | "an"
Guy Steele suggested I should also try allowing other noun phrases,
such as "two Xs". Gold said the
language induction problem can't be solved in general, but others
(e.g. Horning, Mooney, Muggleton,
de
Marcken) have shown that it can be solved if you
use a "probably approximately correct" approach rather than a strictly logical formalism. With Hoey in
mind, I also generated a solution with all indefinite articles:
Version 2b: 2,211 Word Palindrome with only indefinite articles
Created: | 3:00-2003 (I forget what day, exactly)
| Words: | 2,211
| Letters: | 5,842
| Phrases: | 1,106
| Program: | pal2.py with a restricted dictionary
| Palindrome: | A man, a plan, a casa, a bait, a lag, a malt,
... (more) ..., a natl, a mag, a lati, a baas, a canal, Panama!
| Storyboard: |
| | | |  | | ...
| A man, | a plan, | a casa, | a bait, | a lag, | a malt,
|
|
Commentary: Hoey said he thought that a better word list and
a smarter program could get to ten times his 540-word palindrome,
using only noun phrases with indefinite articles. I'm pretty sure that
will never happen. The problem is a dirth of "a"s. According to
Hoey's rules, every phrase must start with the letter "a". That means
that either the rest of the word must be an exact reverse of another
word (and we know there are 1100 of these) or the phrase must have
another "a" in it somewhere, and it must be matched by two or more
other phrases. Phrases such as "a man", "a plan" and "a canal" work
well because they contain multiple "a"s. Now consider a phrase such
as "a biologist". If that appears in the palindrome, then somewhere else
the letters "tsigoloib" must appear. But note that those letters must
all appear in one word/phrase, because there is no "a", and we only
get word boundaries at "a"s. And of course, there is no single word that
contains those letters.
In general, take a word (such as "an asparagus" or "a biologist"),
split it into components around the "a"s (yielding
["n", "sp", "r", "gus"] and ["biologist"]). Collect the set of all
such segments, from all the phrases in the dictionary.
Now go back through the dictionary,
and for each word, see if the reverse of each of its components is
in this set. So "an asparagus" is good, because its reversed components
all appear in the set: "n" appears in many places (including "an asparagus"
itself), "ps" appears as a component in "a psalm", "r" appears in many
places (such as "a karat"), and "sug" appears in "a sugar". On
the other hand, "a biologist" is no good, because the component "tsigoloib"
does not appear.
When I first applied this test, I started with the 69,241-word anpdict.txt (containing only noun
phrases starting with "a" or "an"). I checked to see whether each
reversed component of a phrase appeared anywhere in any other phrase.
That eliminated "a biologist", but it let "a zoom" stick around,
because the reversed component "mooz" appears in "a schmooze". Doing
this level of reduction gets us down to a dictionary of 11,065
phrases. But on reflection I realized that not only does each
reversed component have to appear in some word, it must appear as a
component of some word. The fact that "mooz" appears in "a
schmooze" is not good enough. That would only work if "a zoom" could
be followed by the letters "hcsa", which of course it cannot; it must
be follwed by the letter "a". Using that stricter test, we get down
to only 4,528 noun phrases that are
actually usable. So to get a 5,400 word palindrome we would need to
start with a bigger dictionary.
Speed: The program consistently generates palindromes of over
2000 words in under 10 seconds using the 4,528 word dictionary. It doesn't
go much beyond that.
Version 3b: 2,473 Word Palindrome with only indefinite articles
Created: | 6-10-2016
| Words: | 2,473
| Letters: | 6,787
| Phrases: | 1,237
| Program: | pal3.py or pal3.ipynb.
| Palindrome: | A man, a plan, a caret, a rec,
... (more) ..., a banana, a nan, a macerater, a canal, Panama!
|
|