<!>Phonemic Overrepresentation and Underrepresentation in Certain Type of Languages (2016-02-10 15:42:17)
Phonemic Overrepresentation and Underrepresentation in Certain Type of Languages
Anthologica Universe Atlas / Forums / Terra Firma / Phonemic Overrepresentation and Underrepresentation in Certain Type of Languages / <!>Phonemic Overrepresentation and Underrepresentation in Certain Type of Languages (2016-02-10 15:42:17)

? Rhetorica Your Writing System Sucks
posts: 1292
, Kelatetía: Dis, Major Belt 1
message
...Because I just so happen to be studying natural language processing right now, there are more sophisticated alternatives to raw subtraction that tell you useful things like mutual information. The Jensen–Shannon divergence is probably the most appropriate; it's related to the Kullback–Leibler divergence—which tells you how many extra bits would be required for something written in one Huffman-coded language to express something in another—but has corrections built in for preventing division-by-zero errors and returns the same value in either direction.

More minimally, you could also look into Laplace smoothing if you wanted to look into a Bayesan representation for better population modelling; right now you're only looking at means for these two classes of languages, but not considering variance.

SINCERELY, A MACHINE LEARNING NERD.