Phonemic Overrepresentation and Underrepresentation in Certain Type of Languages
Anthologica Universe Atlas / Forums / Terra Firma / Phonemic Overrepresentation and Underrepresentation in Certain Type of Languages

? Yaali Annar The Gote
posts: 94
, Initiate Speaker message
Before I start. Bear in mind that even though they have data for almost 2000 languages. Phoible's data is rather uneven. IIRC african languages are more documented in Phoible.

Anyway, I want to see which phonemes are more represented in certain type of language. For example, here is the top most common phoneme in a language with clicks, along with their frequency.

Phoneme Frequency
m	      18
i	      18
u	      18
a	      17
j	      16
w	      16
s	      16
k	      15
p	      15
h	      14


This data don't tell us much. All those 10 phonemes are not only common in click languages, but common in almost all languages. So, what I do instead is comparing phoneme frequency in the click languages, to phoneme frequency to all the languages registered in phoible. If we took the difference in the frequency, we can find the most over-represented phonemes in click languages:

Phoneme Frequency  Frequency  Difference 
        In Subset  In All
kǃ      72.22      0.81       71.41
kǀ      66.67      0.75       65.92
pʰ      72.22      17.21      55.02
kʰ      72.22      17.77      54.45
t̠ʃʰ     61.11      6.86       54.25
tʰ      66.67      13.28      53.39
tsʰ     55.56      5.11       50.44
kǃʰ     44.44      0.50       43.95
ŋǃ      44.44      0.50       43.95
kǁ      44.44      0.50       43.95
kʼ      50.00      9.41       40.59
tsʼ     44.44      5.11       39.33
ɬ       44.44      5.99       38.46
kǀʰ     38.89      0.44       38.45
t̠ʃʼ     44.44      6.92       37.52
tʼ      44.44      7.23       37.21
ɡǃ      33.33      0.37       32.96
ɡǀ      33.33      0.37       32.96
d̠ʒ      66.67      34.23      32.44
pʼ      38.89      6.92       31.97


I use simple substraction instead of division, to avoid "DIVISION BY ZERO" issues.
Anyway, we can see that in click languages, click consonant are over represented, which is obvious. But what is not so obvious is that aspirated stops are over represented, as well as ejectives.

How about the most under-represented phonemes? Well...

Phoneme Frequency  Frequency  Difference 
        In Subset  In All
ɡ       0.00       68.89      -68.89
ɔ       0.00       46.13      -46.13
˦       0.00       29.68      -29.68
ɾ       11.11      33.92      -22.80
oː      5.56       27.00      -21.44
eː      5.56       26.87      -21.31
kp      0.00       19.20      -19.20
ɡb      0.00       18.95      -18.95
iː      16.67      33.85      -17.19
ɨ       5.56       22.26      -16.70
t       66.67      83.23      -16.56
ə       11.11      27.24      -16.13
˧       0.00       15.71      -15.71
k       83.33      98.69      -15.36
o       61.11      75.81      -14.70
ɣ       0.00       14.65      -14.65
ɔː      0.00       14.15      -14.15
n       77.78      89.71      -11.94
aː      22.22      33.23      -11.01
ɔ̃       0.00       10.79      -10.79


I have no idea why /g/ appears there.

Edit: There seems to be something funky with postgres JOIN feature.  This might cause some value to be listed as 0,  when in reality it' s larger than 0.
? Yaali Annar The Gote
posts: 94
, Initiate Speaker message
So, this makes the query result for underrepresented phonemes became less accurate. But the result overrepresented phonemes are still somewhat accurate. Here are the overrepresented phonemes in language with ejectives:
Phoneme Frequency  Frequency  Difference 
        In Subset  In All
kʼ      94.97      9.41       85.55
tʼ      72.96      7.23       65.72
pʼ      69.81      6.92       62.89
t̠ʃʼ     69.81      6.92       62.89
tsʼ     51.57      5.11       46.46
ʔ       85.53      45.95      39.59
ʃ       76.10      41.52      34.58
qʼ      32.70      3.24       29.46
kʷʼ     25.79      2.56       23.23
t̠ʃʰ     29.56      6.86       22.70
x       42.14      19.89      22.25
q       30.19      8.48       21.71
h       90.57      69.01      21.55
kʰ      38.99      17.77      21.23
ɬ       27.04      5.99       21.06
χ       25.79      6.48       19.30
pʰ      35.85      17.21      18.64
tʰ      30.82      13.28      17.54
ts      44.03      26.68      17.34
t̠ʃ      67.30      50.75      16.55


I think one of the thing that can be gleaned from this, is that an ejective language tend to have /ʃ q ɬ/
? Yaali Annar The Gote
posts: 94
, Initiate Speaker message
And here's the result for implosive language. So, language with implosives tend to be tonal, have voiced fricative, labiovelar stops, and prenasalized stops.

Phoneme Frequency  Frequency  Difference 
        In Subset  In All
ɓ       91.42      15.27       76.14
ɗ       77.61      12.97       64.64
˨       56.34      29.36       26.98
˦       56.34      29.68       26.67
z       60.82      35.10       25.72
v       57.09      32.79       24.30
mb      39.18      15.46       23.72
nd      37.31      14.40       22.91
f       75.75      53.18       22.57
kp      41.42      19.20       22.22
ɡb      40.67      18.95       21.72
ŋɡ      36.19      14.53       21.67
ɡ       89.18      68.89       20.29
ɔ       64.18      46.13       18.04
r       61.57      43.95       17.61
l       91.04      73.75       17.29
d       77.99      60.91       17.07
˧       32.46      15.71       16.75
ɛ       61.57      47.51       14.06
ŋmɡb    18.28      4.99        13.30

? Yaali Annar The Gote
posts: 94
, Initiate Speaker message
I think I fixed the join issue, and now I can list the most underrepresented phoneme in click language more accurately.

Phoneme Frequency  Frequency  Difference 
        In Subset  In All
ɾ       11.11      33.92      -22.80
oː      5.56       27.00      -21.44
eː      5.56       26.87      -21.31
kp      0.00       19.20      -19.20
ɡb      0.00       18.95      -18.95
iː      16.67      33.85      -17.19
ɨ       5.56       22.26      -16.70
t       66.67      83.23      -16.56


Yeah... 20% of language registered in phoible has labiovelar stops.
? Yaali Annar The Gote
posts: 94
, Initiate Speaker message
Ever wondered what phonemes are over represented in language with front rounded vowel?

Phoneme Frequency  Frequency  Difference 
        In Subset  In All
f       68.18      53.18       15.00
v       45.45      32.79       12.66
z       46.36      35.10       11.26
x       30.91      19.89       11.02
ʒ       25.45      15.46       9.99
ɣ       24.55      14.65       9.89
s̪       13.64      4.61        9.02
ts      35.45      26.68       8.77
l       81.82      73.75       8.07
tʰ      20.91      13.28       7.63


Ce sont très européenes

I also find some coincidences. For example, click languages are likely to have ejective, but the reverse is not correct. There is a coincidence between ejective language and uvular language, however. The following are the overrepresented phonemes in language with uvular phonemes and we can see that ejectives makes the list, just like the way /q/ makes into the list in overrpresented phonemes in language with ejective phonemes.

Also oddly, /ʃ/

Phoneme Frequency  Frequency  Difference 
        In Subset  In All
q       59.65       8.48       51.17
χ       45.61       6.48       39.13
x       44.30       19.89      24.41
ʃ       64.47       41.52      22.95
ʁ       23.25       3.30       19.94
qʼ      22.81       3.24       19.57
kʼ      27.63       9.41       18.22
tʼ      24.12       7.23       16.89
pʼ      23.68       6.92       16.76
ħ       19.30       2.74       16.56
t̠ʃʼ     21.93       6.92       15.01
t̠ʃ      64.04       50.75      13.29
tsʼ     17.98       5.11       12.87
ʕ       14.91       2.12       12.79
ʒ       28.07       15.46      12.61
χʷ      14.47       2.06       12.42
qʰ      14.47       2.06       12.42
ʔ       57.46       45.95      11.51
xʷ      14.91       3.43       11.48
qʷ      11.84       1.68       10.16
? Rhetorica Your Writing System Sucks
posts: 1292
, Kelatetía: Dis, Major Belt 1
message
...Because I just so happen to be studying natural language processing right now, there are more sophisticated alternatives to raw subtraction that tell you useful things like mutual information. The Jensen–Shannon divergence is probably the most appropriate; it's related to the Kullback–Leibler divergence—which tells you how many extra bits would be required for something written in one Huffman-coded language to express something in another—but has corrections built in for preventing division-by-zero errors and returns the same value in either direction.

More minimally, you could also look into Laplace smoothing if you wanted to look into a Bayesan representation for better population modelling; right now you're only looking at means for these two classes of languages, but not considering variance.

SINCERELY, A MACHINE LEARNING NERD.
? Yaali Annar The Gote
posts: 94
, Initiate Speaker message
I read those articles several times, but I still can't understand them