Phonemic Overrepresentation and Underrepresentation in Certain Type of Languages
Phonemic Overrepresentation and Underrepresentation in Certain Type of Languages
Anthologica Universe Atlas / Forums / Terra Firma / Phonemic Overrepresentation and Underrepresentation in Certain Type of Languages / Phonemic Overrepresentation and Underrepresentation in Certain Type of Languages

? Yaali Annar The Gote
posts: 94
, Initiate Speaker message
Before I start. Bear in mind that even though they have data for almost 2000 languages. Phoible's data is rather uneven. IIRC african languages are more documented in Phoible.

Anyway, I want to see which phonemes are more represented in certain type of language. For example, here is the top most common phoneme in a language with clicks, along with their frequency.

Phoneme Frequency
m	      18
i	      18
u	      18
a	      17
j	      16
w	      16
s	      16
k	      15
p	      15
h	      14


This data don't tell us much. All those 10 phonemes are not only common in click languages, but common in almost all languages. So, what I do instead is comparing phoneme frequency in the click languages, to phoneme frequency to all the languages registered in phoible. If we took the difference in the frequency, we can find the most over-represented phonemes in click languages:

Phoneme Frequency  Frequency  Difference 
        In Subset  In All
kǃ      72.22      0.81       71.41
kǀ      66.67      0.75       65.92
pʰ      72.22      17.21      55.02
kʰ      72.22      17.77      54.45
t̠ʃʰ     61.11      6.86       54.25
tʰ      66.67      13.28      53.39
tsʰ     55.56      5.11       50.44
kǃʰ     44.44      0.50       43.95
ŋǃ      44.44      0.50       43.95
kǁ      44.44      0.50       43.95
kʼ      50.00      9.41       40.59
tsʼ     44.44      5.11       39.33
ɬ       44.44      5.99       38.46
kǀʰ     38.89      0.44       38.45
t̠ʃʼ     44.44      6.92       37.52
tʼ      44.44      7.23       37.21
ɡǃ      33.33      0.37       32.96
ɡǀ      33.33      0.37       32.96
d̠ʒ      66.67      34.23      32.44
pʼ      38.89      6.92       31.97


I use simple substraction instead of division, to avoid "DIVISION BY ZERO" issues.
Anyway, we can see that in click languages, click consonant are over represented, which is obvious. But what is not so obvious is that aspirated stops are over represented, as well as ejectives.

How about the most under-represented phonemes? Well...

Phoneme Frequency  Frequency  Difference 
        In Subset  In All
ɡ       0.00       68.89      -68.89
ɔ       0.00       46.13      -46.13
˦       0.00       29.68      -29.68
ɾ       11.11      33.92      -22.80
oː      5.56       27.00      -21.44
eː      5.56       26.87      -21.31
kp      0.00       19.20      -19.20
ɡb      0.00       18.95      -18.95
iː      16.67      33.85      -17.19
ɨ       5.56       22.26      -16.70
t       66.67      83.23      -16.56
ə       11.11      27.24      -16.13
˧       0.00       15.71      -15.71
k       83.33      98.69      -15.36
o       61.11      75.81      -14.70
ɣ       0.00       14.65      -14.65
ɔː      0.00       14.15      -14.15
n       77.78      89.71      -11.94
aː      22.22      33.23      -11.01
ɔ̃       0.00       10.79      -10.79


I have no idea why /g/ appears there.

Edit: There seems to be something funky with postgres JOIN feature.  This might cause some value to be listed as 0,  when in reality it' s larger than 0.