One of the central principles in the field of Bioinformatics is the idea of Garbage In, Garbage Out. Essentially, no matter how robust an analysis program is or how much error correction is part of the algorithm design, if you have crappy data as your input . . . you are going to get crappy data as a result. We like to blame the program first, change the various settings second, and look for alternative programs when that doesn’t work. It is easier to keep looking until you find something that sort of works, than to come to terms with the idea of repeating a costly experiment or sequencing run because the data just isn’t very good.
I thought about this today during class as I was reading through the nearly 50+ options in the standard BLAST alignment tool. For those of you who aren’t familiar with BLAST, the program is designed to compare sequences of letters, think A-C-G-T’s to other sequences of A-C-G-T’s. The idea is to find the optimal match between your sample, and a database of reference sequences, or previously categorized and known sequences. It helps to determine where in a genome, what species and a whole host of other things in relation to your sequence. The numerous settings help with all the possible situations that could result, either from poorly generated data, low potential matches, high variability and many other things. Regardless of the settings though, if you have junk data . . . you’re going to get junk results.