<!>happiness thread (2015-06-15 16:56:13)
happiness thread
Anthologica Universe Atlas / Forums / Miscellaneria / happiness thread / <!>happiness thread (2015-06-15 16:56:13)

? Rhetorica Your Writing System Sucks
posts: 1292
, Kelatetía: Dis, Major Belt 1
message
That is definitely what I was afraid of. Unless you added scores for linguistically-relevant transformations like metathesis and rank misalignments according to how many steps of sound change are required to go from one to the other (and weight them according to frequency.) Did you do any of that, or is it basically just Levenshtein edit distance?

"Quick-and-dirty" approaches are a major problem in bioinformatics, mostly outside of Europe. You end up with tools and pipelines that work for only a small portion of the data in practice, and it's usually not the interesting parts. We live in the shadow of a seemingly-immortal dictator called BLAST, which tries to speed up alignments by finding n-mers of several letters common to two sequences and then extending them as long as they match; it's meant to be a first-pass lookup tool for people investigating and working with a single gene, but in practice people will gladly run millions or even billions of short fragments (which it's not supposed to be used for) with diverse evolutionary heritage (another problem: aligned scoring schemes are usually distance-dependent) through the wretched thing.

The result of this: it's accepted wisdom that there are far more different kinds of bacteria in any given environment than there actually are. (Although there's another story about falling in love with flawed market genes that I'll tell you some other time, the metagenomic analyses ostensibly corroborate them.)

If it helps any, think of each alignment as the log transform of a giant joint probability. Then you'll be unable to go back to just thinking of it as an abstract 'score.'