Finding good names
It happens all the time. I am about to start a new project and I can't find a good name. I could postpone the decisions until publication. With my completion rate, that would save me a lot of thinking. Still, the name of a project, a programming one at least, is scattered all over the place from the start: you create a tree in your cvs/git/svn, you create a package/namespace/module in your programming environment and you write the project's name in the documentation (at least you intend to). You can change the name later but you save a lot of work if you start with the right one.
There was a time when you could pick a random word from the dictionary and be done with it. Today, it seems, all the real words are taken, even vellication. Here I use a widely accepted definition of "taken": is the .com owned by someone else. There are other root domains (.net, .org, .info) but the squatters bought the most of the dictionary there too. If we don't limit ourself to the dictionary, why should we after all, combinatorics saves the day. There are just to many ways to put letters together. Squatters can buy all the tree letter acronyms but that's as far as they can get.
My first idea was to use googleability. Googleability, you'll recall, is the quality of being searchable, and found, on Google. Words like "sun", "act" and "run" have low googleability. They are in the 0.1% of three letter words with the most Google hits. At the other end of the spectrum, you get words like "bxu" and "yzo" that no one seem to put on his website (thought this self referential sentence invalidates its claim). If your friend tells you about YZO's new BXU, you can google it and you will find the right website. Googleability works but I wasn't the first to think about it. Almost all pronounceable three and four letter words with high googleability have a registered .com. Exploring longer words is a challenge. Google doesn't like scripts. You have to fake a browser and to limit the query rate. I didn't. I seriously considered finding a new Internet service provider when I noticed that Google had banned me.
What is pronounceable? Some people can pronounce "w00t" and "pwned" but most can't. If you limit yourself to alternation of vowels and consonants you will get pronounceable words but you will miss really good ones. On the other hand, allowing a sequence of more than one consonant will yield words like "evzupricb". I was looking for a quality more subtle than just being pronounceable. Some words are warm, others are frigid. Anglo-saxon and Norse words, for example, are more visceral than Latin ones. The battles against the Orcs are not fought in Polytanie, they are fought in Azeroth. How could I capture this warmth, this evasive quality with the taste of raw meat?
Actually, that's quite simple. A lively text will contain more warm words than frigid ones. Extracting the warmth is just a matter of measuring the probability of going from one letter to the other, including the probability that a given letter will end the word. A probabilist engine trained on the King James Bible will produce a lot of old English sounding words. One last change is required though: the probability of going from "l" to "l" will be way to high and you'll end up with words like "alllerit". The full transition probability with a two letter prefix could be large but only recording the transitions that you actually see generates a transition table under 100k. And it makes sense to do so. In the King James Bible, the only thing that can follow "pr" is a vowels. Furthermore, "pr" will never be at the end of a word, unlike "ps", which is last 80% of the time.
Now that we have the rules to generate as many vivid words as we want, we just need to keep trying until we find one for which the .com is still available. It is good? Judge for yourself, as I write this, those words are available in at least two of the four main top level domains: abongs, abuted, affere, againg, amigeth, andreph, becomoses, becured, beffer, behosest, brisraell, challed, chout, egived, egyptayed, fatieford, fifies, flooses, gaing, humple, israndee, istion, joyether, judan, judgenst, kethich, morthers, moshall, offeruiled, prieven, reoph, rient, rought, sayints, shaarding, shaith, shalleak, sheaveth, speople, spild, tathild, therecaust, theried, thits, togypt, waying, whing, whings, whous, womenty. Nothing was hand picked, just random generation. We get names but we also get verbs, adverbs and adjectives. I wonder if this is how they create artificial languages.
Changing the training set for the weekly top 100 on Project Gutenberg produces a contemporary lexicon: adays, amalls, anned, appeance, appene, bectinty, cassion, descess, droort, eliked, exterid, flonst, hered, honsir, humety, immang, intanch, jused, losed, maketter, mathere, matinswer, opecous, othersuld, pears, pened, plorn, preathilly, prowhich, reard, reentich, retcheing, scrove, shoul, soothred, soubjes, spilly, squoully, therigh, throppen, unlip, vidded, wanked, whallon, wherents, whicars, whide, wouldishon, woute, youthered. Not as exotic but still credible.
I still don't have a magic trick to pick a project name but at least I can now limit the search space. I let my computer looking for words are taken during the night and I contemplate are few pages of candidates in the morning. My name generator even found its own name. Here is Yould. Now I need to find good texts for the training set.
Comments
Isn't this a pared-down implementation of a Markov Chain?
Yes, it's a Markov-Chain implementation.
Can you defined what you mean by "diverge?" You say "egyptated" is not related to the King James Bible at all. Umm...Egypt is mentioned heavily all throughout the old testament, especially in Exodus. How does "egyptated" diverge, exactly?
When I say that Yould will diverge from the training text, I mean that no matter how well you train it, it will still produce a lot of junk. As you pointed out, "egyptated" is generated because the training saw the word "Egypt" many time and because it figured that adding "ed" to a prefix ending with "t" is a valid construction.
Nevertheless, "egyptated" is meaningless; its sound is close to some words from the KJB but its meaning, and it has basically none, is really far anyting in the KJB. When Arne suggests using Yould to find names related to a book, I just want to warn him that most words generated by Yould will just be a bunch of non-sense.
Reminds me of one of my old projects: http://lucumr.pocoo.org/projects/giveitaname/
Your generator is pretty good too. Can you tell us more on how it works?

Your name generator sounds great!
Now, the only thing missing for a real domain-finder would be an integrated domain-check:
Just imagine that for authors:
"Let it read your book, and it will tell you names which are still free as domain names and which fit your style."
Now add to it your experiment with checking googleability automatically (but throttled this time and only for select words), and you have an even greater tool for any writer.
Aside from fntasizing about the merchantability of your program, I'm thinking of using it in a python program for a free roleplaying system.
Thanks a lot!
Best wishes, Arne PS: Yes, I experimented a bit with the markdown syntax :)
This sounds like a good idea an I encourage you to use the Yould engine for such a tool but be aware that yould will diverge a lot from the training text. As an example, the engine trained on the King James Bible will output "egyptated", which sounds nice and plausible but isn't related to the King James Bible at all. Your tool would need a little something to find words that are only related to the said book.
Also be aware that it is against Google's terms of services to use a tool to perform Google searches. You can do something with the SOAP API but you are limited to 1000 queries a day or something like that.