happiness thread
Anthologica Universe Atlas / Forums / Miscellaneria / happiness thread

? Morrígan Witch Queen of New York
posts: 303
, Marquise message
Everything basically. I need to prepare some more data...

http://dialawizard.tumblr.com/post/120528229284/dialawizard-related-to-the-other-data-this-is

also this, where the Indo-Iranian proves to be tricky with the current configuration
http://dialawizard.tumblr.com/post/120495109859/ive-made-some-important-progress-in-laying-the
? Morrígan Witch Queen of New York
posts: 303
, Marquise message
So, the latest good news is that I implemented a few new gap penalty functions and so far it looks like the Indo-Iranian data is behaving better with a non-negative gap (using a convex gap function). Still, building more sample data will be a big help.
? Rhetorica Your Writing System Sucks
posts: 1292
, Kelatetía: Dis, Major Belt 1
message
...I think I know what she's doing, and I'm not completely certain I like it.

Do you realize you'll need to build a whole new model for every language family?
? Morrígan Witch Queen of New York
posts: 303
, Marquise message
There are concerns about that. But mostly I'm interesting in using simulated data to examine how things like inherited vocabulary size and overlap between children effect information recoverability.

Though I have suspicions that when it come to actual reconstruction, unsupervised machine learning might be viable.
? Rhetorica Your Writing System Sucks
posts: 1292
, Kelatetía: Dis, Major Belt 1
message
Yyyeah, you should've started with that mentality, I think. Sequence alignment is my speciality, and the functions optimized by the standard approaches are grossly incorrect from a biological standpoint; they just have momentum because they're verifiable and objective. The thought of trying to bring that into a linguistics setting makes me squeamish.
? Morrígan Witch Queen of New York
posts: 303
, Marquise message
Oh interesting.

sequence alignment isn't the interesting problem here though, for the most part that's probably going to be inferring proto-forms and rules from correspondences. Which I'm able to get fairly handily. Without training on my Chechen-Ingush-Batsbi data, the system was able to pick out correspondence which I know to be correct. What will be interesting is seeing if it can use a (probably statistical) model to infer reasonable ancestor forms, and identify conditioning environments.
? Jipí der saz ûf eime steine
posts: 291
, Transition Metal, Marburg, Germany
message
So … you may have explained it before, but what are those charts depicting? I understand that historical linguistics has become increasingly influenced and informed by genetics, epidemiology and population biology over the past decades, but that still doesn't give me a clue what those charts you posted mean.
? Morrígan Witch Queen of New York
posts: 303
, Marquise message
They depict the way that model parameters impact the performance of the algorithm vs some human-aligned data. Basically all of that involves sequence alignment stuff, and weight coefficients for my feature model.

I was thinking I could make a thread about this over in Terra Firma
? Hallow XIII Primordial Crab
posts: 539
, 巴塞尔之侯
message
Relatedly how come neither the Linguistics or (!) the Computational Linguistics BA program here felt the need to include statistics

in fact how come there are any university degrees at all that allow people to get away with not taking math classes

and most importantly why can't I get points for doing it anyway like a reasonable person
? Morrígan Witch Queen of New York
posts: 303
, Marquise message
yeah, I don't get that. we DID have a quantitative methods course, but frankly it was terrible, the professor sucked (he failed to get tenure, ultimately) and the book sucked. Keith Johnson, I think, the orange one that does everything in R and explains nothing.
? Hallow XIII Primordial Crab
posts: 539
, 巴塞尔之侯
message
Yeah, same here. I mean, okay, it's vaguely more forgivable in traditional linguistics, but if I study CL and I don't have to take statistics what does that say about the value of my degree? (The regulations that prevent CS from being available in 90-point format are also terrible.)
? Rhetorica Your Writing System Sucks
posts: 1292
, Kelatetía: Dis, Major Belt 1
message
That's obscene. You're both fired.
? Rhetorica Your Writing System Sucks
posts: 1292
, Kelatetía: Dis, Major Belt 1
message
quoting Morrígan, Marquise:
They depict the way that model parameters impact the performance of the algorithm vs some human-aligned data. Basically all of that involves sequence alignment stuff, and weight coefficients for my feature model.

I was thinking I could make a thread about this over in Terra Firma

Yes, please do. I need to know what sequences you're aligning so I can rant drunkenly about how it's a complete farce. Except without alcohol.
? Jipí der saz ûf eime steine
posts: 291
, Transition Metal, Marburg, Germany
message
FIRST AS A TRAGEDY
THEN AS A FARCE

THEN AS SEQUENCE ALIGNMENT

vWA8lru.png
? Hallow XIII Primordial Crab
posts: 539
, 巴塞尔之侯
message
karl marx, vladimir ulyanov and lev levenshtein
? Morrígan Witch Queen of New York
posts: 303
, Marquise message
i'm afraid.


I'm just using sequence alignment as a first pass to get likely correspondences from cognate pairs (or more, but alignment in n dimensions might not yield any better results than doing all pairs and tree-building). So it seems pretty uncomplicated.
? Rhetorica Your Writing System Sucks
posts: 1292
, Kelatetía: Dis, Major Belt 1
message
That is definitely what I was afraid of. Unless you added scores for linguistically-relevant transformations like metathesis and rank misalignments according to how many steps of sound change are required to go from one to the other (and weight them according to frequency.) Did you do any of that, or is it basically just Levenshtein edit distance?

"Quick-and-dirty" approaches are a major problem in bioinformatics, mostly outside of Europe. You end up with tools and pipelines that work for only a small portion of the data in practice, and it's usually not the interesting parts. We live in the shadow of a seemingly-immortal dictator called BLAST, which tries to speed up alignments by finding n-mers of several letters common to two sequences and then extending them as long as they match; it's meant to be a first-pass lookup tool for people investigating and working with a single gene, but in practice people will gladly run millions or even billions of short fragments (which it's not supposed to be used for) with diverse evolutionary heritage (another problem: aligned scoring schemes are usually distance-dependent) through the wretched thing.

The result of this: it's accepted wisdom that there are far more different kinds of bacteria in any given environment than there actually are. (Although there's another story about falling in love with flawed market genes that I'll tell you some other time, the metagenomic analyses ostensibly corroborate them.)

If it helps any, think of each alignment as the log transform of a giant joint probability. Then you'll be unable to go back to just thinking of it as an abstract 'score.'
? Jipí der saz ûf eime steine
posts: 291
, Transition Metal, Marburg, Germany
message
Tate-blast.jpg

(This one was short-lived, though)
? Rhetorica Your Writing System Sucks
posts: 1292
, Kelatetía: Dis, Major Belt 1
message
This, right? I actually had a boxed copy of it (okay—just the manuals and box) that I found at a junk shop. No idea where it is now.
? Morrígan Witch Queen of New York
posts: 303
, Marquise message
Definitively NOT using Levenstein. Each segment in a sequence is represented by a multivalue feature vector. Right now, this is just a multidimensional vector distance weighted by the strength of each feature (they use different scales), but I'd prefer it represented a probability  (or -log thereof) that a pair was related) which is obviously more complicated to model.

I don't have metathesis explicitly implemented yet, but the algorithm already has a way of comparing short (length 1 to n, though anything above n=3 is absurd and even that is questionable), so a 2-2 comparison would compare segments where one underwent metathesis. There are cases where this is probably not sufficient though.

The ranking is an interesting problem, but conceivably that's the interesting problem. Given a set of correspondences and environments, I'll need to figure out a way to work backward and possibly re-run the alignments using new information based on discovering non-viable reconstructions, or discovering that the alignments we derived are somehow not viable.

The most important question is how this system performs when given garbage, viz. a set of chance resemblances between unrelated languages. I need to build an algorithm that can tell these apart, or at least determine that the relationship is a chance one.

I'll start a thread some time tonight if I'm able to get the time. I'm supposed to have dinner with my cousin, so who knows.
notices