Marking DNA as spam
The idea behind bioinformatics, at least for some people, is that since the information encoded into DNA is sequential, you can parse it more or less the same way that you parse an English text. You can apply a regex to DNA and you can search DNA just as you can search text. But how far can you push it?
The part of information processing science formerly known as AI developed a whole range of "machine learning" techniques. Most never made it into the real world but once in a while, a new idea is ripe and you see it spread like a storm. Most techniques that tried to model how the brain works are miserable failures, but it happens that someone understands how people use information and apply really simple pasterns turn worthless data into the most valuable repository of knowledge.
If you were to look for "Bayesian classification" in a stats textbook, you would probably find nothing, except maybe the it's a method that you should avoid. Naive Bayesian classification is a way to combine independent evidences about some event to refine your knowledge about it. The catch is right there, "independent" in probabilities in what they call a "strong assumption" and you almost never see truly independent evidences except for trivial stuff.
When Paul Graham solved the problem of spam, he made something that would drive any stats teacher crazy, he said that the words in an email are independent. Why did he do that? If you do that, you can use a really simple formula, if you don't, the computation is so hard that no computer can undertake it for a 300 words email. While statisticians are still screaming, Bayesian filters are filtering the worlds spam and they work better than anything else.
How about DNA? The original question was "given the few proteins that we know to be localized in the nucleolus, can we find a pattern that will help us find others?" If someone ask you that, the fist logical reaction should be to grab something to drink. Now that you've taken care of that, you can say that you don't know and that you'd prefer to work on something as boring as applying design patterns to bioformatics, you can decide to apply some techniques that should theoretically work or you can decide to dive and to try what works in the real world, even though it shouldn't.
What we did is shocking, we didn't implement a Bayesian classifier for DNA, we used the excellent spamoracle and fed it DNA sequences as if it was emails. I kind of hope that my stats teacher won't read that far but was is even more shocking is that it works, to some extent.
We don't find all the nucleolar proteins but we find a lot of them with really few false positives. Further more, spamoracle has really nice regex queries on the training set. That way we can explore what kind of "words" leads to correct classifications. And thats the interesting parts because you can't do that with neural nets and other machine learning techniques. And what we find fits pretty well with what people find in wet labs. Now, a word of warning is needed. BLAST is not working like Google and there is a reason for that. You typically search for a few exact word on Google and you typically do a fuzzy search on a really long DNA word with BLAST. Before you can replace all your wet lab staff with monkeys punching sequences into a Bayesian filter, a lot of work must be done on building better training sets and on biologically relevant ways to search for a word in correlation tables.
