happiness thread

Anthologica Universe Atlas / Forums / Miscellaneria / happiness thread

previous 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 next end

Morrígan Witch Queen of New York
posts: 303
, Marquise message

Everything basically. I need to prepare some more data...

http://dialawizard.tumblr.com/post/120528229284/dialawizard-related-to-the-other-data-this-is

also this, where the Indo-Iranian proves to be tricky with the current configuration
http://dialawizard.tumblr.com/post/120495109859/ive-made-some-important-progress-in-laying-the

8 years ago link

Morrígan Witch Queen of New York
posts: 303
, Marquise message

So, the latest good news is that I implemented a few new gap penalty functions and so far it looks like the Indo-Iranian data is behaving better with a non-negative gap (using a convex gap function). Still, building more sample data will be a big help.

8 years ago link

Rhetorica Your Writing System Sucks
posts: 1292
, Kelatetía: Dis, Major Belt 1
message

...I think I know what she's doing, and I'm not completely certain I like it.

Do you realize you'll need to build a whole new model for every language family?

8 years ago link

Morrígan Witch Queen of New York
posts: 303
, Marquise message

There are concerns about that. But mostly I'm interesting in using simulated data to examine how things like inherited vocabulary size and overlap between children effect information recoverability.

Though I have suspicions that when it come to actual reconstruction, unsupervised machine learning might be viable.

8 years ago link

Rhetorica Your Writing System Sucks
posts: 1292
, Kelatetía: Dis, Major Belt 1
message

Yyyeah, you should've started with that mentality, I think. Sequence alignment is my speciality, and the functions optimized by the standard approaches are grossly incorrect from a biological standpoint; they just have momentum because they're verifiable and objective. The thought of trying to bring that into a linguistics setting makes me squeamish.

8 years ago link

Morrígan Witch Queen of New York
posts: 303
, Marquise message

Oh interesting.

sequence alignment isn't the interesting problem here though, for the most part that's probably going to be inferring proto-forms and rules from correspondences. Which I'm able to get fairly handily. Without training on my Chechen-Ingush-Batsbi data, the system was able to pick out correspondence which I know to be correct. What will be interesting is seeing if it can use a (probably statistical) model to infer reasonable ancestor forms, and identify conditioning environments.

8 years ago link

Jipí der saz ûf eime steine
posts: 291
, Transition Metal, Marburg, Germany
message

So … you may have explained it before, but what are those charts depicting? I understand that historical linguistics has become increasingly influenced and informed by genetics, epidemiology and population biology over the past decades, but that still doesn't give me a clue what those charts you posted mean.

edited 2 times, last update 8 years ago link

Morrígan Witch Queen of New York
posts: 303
, Marquise message

They depict the way that model parameters impact the performance of the algorithm vs some human-aligned data. Basically all of that involves sequence alignment stuff, and weight coefficients for my feature model.

I was thinking I could make a thread about this over in Terra Firma

8 years ago link

Hallow XIII Primordial Crab
posts: 539
, 巴塞尔之侯
message

Relatedly how come neither the Linguistics or (!) the Computational Linguistics BA program here felt the need to include statistics

in fact how come there are any university degrees at all that allow people to get away with not taking math classes

and most importantly why can't I get points for doing it anyway like a reasonable person

8 years ago link

Morrígan Witch Queen of New York
posts: 303
, Marquise message

yeah, I don't get that. we DID have a quantitative methods course, but frankly it was terrible, the professor sucked (he failed to get tenure, ultimately) and the book sucked. Keith Johnson, I think, the orange one that does everything in R and explains nothing.

8 years ago link

Hallow XIII Primordial Crab
posts: 539
, 巴塞尔之侯
message

Yeah, same here. I mean, okay, it's vaguely more forgivable in traditional linguistics, but if I study CL and I don't have to take statistics what does that say about the value of my degree? (The regulations that prevent CS from being available in 90-point format are also terrible.)

8 years ago link

Rhetorica Your Writing System Sucks
posts: 1292
, Kelatetía: Dis, Major Belt 1
message

That's obscene. You're both fired.

8 years ago link

Rhetorica Your Writing System Sucks
posts: 1292
, Kelatetía: Dis, Major Belt 1
message

quoting Morrígan, Marquise:
They depict the way that model parameters impact the performance of the algorithm vs some human-aligned data. Basically all of that involves sequence alignment stuff, and weight coefficients for my feature model.

I was thinking I could make a thread about this over in Terra Firma

Yes, please do. I need to know what sequences you're aligning so I can rant drunkenly about how it's a complete farce. Except without alcohol.

edited once, last update 8 years ago link

Jipí der saz ûf eime steine
posts: 291
, Transition Metal, Marburg, Germany
message

FIRST AS A TRAGEDY
THEN AS A FARCE
…
THEN AS SEQUENCE ALIGNMENT

edited 2 times, last update 8 years ago link

Hallow XIII Primordial Crab
posts: 539
, 巴塞尔之侯
message

karl marx, vladimir ulyanov and lev levenshtein

8 years ago link

Morrígan Witch Queen of New York
posts: 303
, Marquise message

i'm afraid.

I'm just using sequence alignment as a first pass to get likely correspondences from cognate pairs (or more, but alignment in n dimensions might not yield any better results than doing all pairs and tree-building). So it seems pretty uncomplicated.

edited once, last update 8 years ago link

Rhetorica Your Writing System Sucks
posts: 1292
, Kelatetía: Dis, Major Belt 1
message

That is definitely what I was afraid of. Unless you added scores for linguistically-relevant transformations like metathesis and rank misalignments according to how many steps of sound change are required to go from one to the other (and weight them according to frequency.) Did you do any of that, or is it basically just Levenshtein edit distance?

"Quick-and-dirty" approaches are a major problem in bioinformatics, mostly outside of Europe. You end up with tools and pipelines that work for only a small portion of the data in practice, and it's usually not the interesting parts. We live in the shadow of a seemingly-immortal dictator called BLAST, which tries to speed up alignments by finding n-mers of several letters common to two sequences and then extending them as long as they match; it's meant to be a first-pass lookup tool for people investigating and working with a single gene, but in practice people will gladly run millions or even billions of short fragments (which it's not supposed to be used for) with diverse evolutionary heritage (another problem: aligned scoring schemes are usually distance-dependent) through the wretched thing.

The result of this: it's accepted wisdom that there are far more different kinds of bacteria in any given environment than there actually are. (Although there's another story about falling in love with flawed market genes that I'll tell you some other time, the metagenomic analyses ostensibly corroborate them.)

If it helps any, think of each alignment as the log transform of a giant joint probability. Then you'll be unable to go back to just thinking of it as an abstract 'score.'

edited 2 times, last update 8 years ago link

Jipí der saz ûf eime steine
posts: 291
, Transition Metal, Marburg, Germany
message

(This one was short-lived, though)

edited once, last update 8 years ago link

Rhetorica Your Writing System Sucks
posts: 1292
, Kelatetía: Dis, Major Belt 1
message

This, right? I actually had a boxed copy of it (okay—just the manuals and box) that I found at a junk shop. No idea where it is now.

8 years ago link

Morrígan Witch Queen of New York
posts: 303
, Marquise message

Definitively NOT using Levenstein. Each segment in a sequence is represented by a multivalue feature vector. Right now, this is just a multidimensional vector distance weighted by the strength of each feature (they use different scales), but I'd prefer it represented a probability (or -log thereof) that a pair was related) which is obviously more complicated to model.

I don't have metathesis explicitly implemented yet, but the algorithm already has a way of comparing short (length 1 to n, though anything above n=3 is absurd and even that is questionable), so a 2-2 comparison would compare segments where one underwent metathesis. There are cases where this is probably not sufficient though.

The ranking is an interesting problem, but conceivably that's the interesting problem. Given a set of correspondences and environments, I'll need to figure out a way to work backward and possibly re-run the alignments using new information based on discovering non-viable reconstructions, or discovering that the alignments we derived are somehow not viable.

The most important question is how this system performs when given garbage, viz. a set of chance resemblances between unrelated languages. I need to build an algorithm that can tell these apart, or at least determine that the relationship is a chance one.

I'll start a thread some time tonight if I'm able to get the time. I'm supposed to have dinner with my cousin, so who knows.

edited once, last update 8 years ago link

previous 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 next end

notices

return to Miscellaneria