<!>Sound Change Appliers (2014-08-04 11:24:47)
Sound Change Appliers
Anthologica Universe Atlas / Forums / Miscellaneria / Sound Change Appliers / <!>Sound Change Appliers (2014-08-04 11:24:47)

? Morrígan Witch Queen of New York
posts: 303
, Marquise message
Haedus Toolbox SCA, Manual (DRAFT)

The following is provided only as an example and is subject to revision. The contents are incomplete, but I'm interested in seeing if the information provided is clear, or need to be explained better. Also, if the guided example is a good instructional model to showcase supported functionality.

PART I - The SCA and it's capabilities



Introduction


The Haedus Toolbox SCA (henceforth 'SCA') was designed to address one specific and pernicious problem with other sound-change applier programs in common use withing the conlang community, namely that single sounds often need to be represented by two or mor characters, whether that is 'ts', 'tʰ', or 'qʷʰ'. This relieves the user from having to artificially re-order rules because 'p > b' happens to also affect 'pʰ' even when the intent of the user is for these to be distinct and segments.
The SCA can infer the what sequences should be treated as unitary by attaching diacritics and modifier letters to a preceding base character. The user can also manually specify sequences that should be treated as atomic.

Using The SCA


To use the SCA, the user must provide a lexicon and rules file. In stand-alone operation, it can be run using the command
java -jar sca.jar LEXICON RULES OUTPUT
, with the user providing paths for the worlist, rules, and output file.

Scripts


Operation of the SCA is controlled through a script file while primarily contains rule definitions, but which also allows the user to define variables, reserve character sequences, and control segmentaton and normalization.

The following characters have special meanings in the SCA script language and cannot be used elsewhere: >, /, _, #, %, *, ?, +, !, (, ), {, }, ., =

Script files may contain comments, starting with %, either at the beginning of a line, or in-line. Apart from rules and variables, there are additional commands used to control segmenation and normalziation and reserve sequences to be treated as atomic. This is controlled by the following command, plus one of the listed flags

USE NORMALIZATION:
NFD Canonical decomposition (default)
NFC Canonical decomposition, followed by canonical composition
NFKD Compatibility decomposition
NFKC Compatibility decomposition followed by compatibility composition
NONE No normalization; input is not modified

USE SEGMENTATION:
TRUE By default, automatic segmentation is used
FALSE Treats each input character as atomic, except where characters are reserved by the use

Variables


The SCA allows for the definition of variables (and re-definition) on-the-fly, anywhere in the script. Variables definitions consist of a label, the assignment operator = and a space-separated list of values. For example:
TH = pʰ tʰ kʰ
T  = p  t  k
D  = b  d  g
W  = w  y  ɰ
N  = m  n
C  = TH T D W N r s

The values may contain other variable labels. There are no restrictions on variable naming - it is up to the user to avoid conflicts. However, when SCA parses a rule or variable definition, it searches for variables by finding the longest matching lable. If you have T, H, and TH defined as variables, a rule containing TH will always be understood to represent the variable TH, and not T followed by H.

If you wish to define longer variable names, you can use a non-reseved prefix like $, @, or wrap the name in square brackets, like [Obstruent].

The Rule Format


The syntax for rules is desinged to be similar to that used to describe sound changes in linguistics generally, and to support pattern matching using regular expressions.

This SCA uses > as the transformation operator, and separates the transformation and condition using /. The condition is not required and rules lacking a condition do not require the the / symbol. When the / symbol is present, the precondition-postcondition separator _ must appear exactly once.

Some basic rules are:
pʰ tʰ kʰ > f  θ  x
p  t  k  > b  d  g  / N_
p  t  k  > pʰ tʰ kʰ / _V

Condition Format


Most of the power of the Toolbox condition format lies in it's ability to use ad-hoc sets, and regular expressions. The underscore character _ separates the precondition from the postcondition, so that the rule will be applied only when both sides of the condition match.

Regular Expression metacharacters
. matches any character
+ matches the previous expression one or more times
* matches the previous expression zero or more times
? matches the previous expression zero or one times
{} matches any of the list of expressions inside it
() used to group expressions
! matches anything that is NOT the following expression (NB: not implemented)

Sets, delimited by curly braces {}, contain a list of space-separated subexpressions. These can be single characters, variables, or other regular expressions - anything allowed elsewhere in the condition. It's not clear that this capability is of any real use, but it remains avaible if you happen to find a use for it.

PART II - A Guided Example



This section will use a small example language to illustrate the functionalities provided by this SCA. The example language has the following inventory of consonants:
pʰ tʰ kʰ
p  t  k
b  d  g
w  y  ɰ
m  n
   r
   s

and a basic, square four-vowel system, and long and short vowels:
i u    iː uː 
e a    eː aː

The syllable onsent may consist of any consonant, or a cluster of an plosive and r. The coda may be empty, contain a nasal, or a voiceless plosive. Additionally, the onset may be absent if the syllable is the first in the word.

Stress is deterministic. In words with one syllable, open syllables with short vowels are unstressed, and others are stressed. In two or three syllable words, the first syllable is always stressed. Elsewhere, stress is penultimate

Knowing this much, we can define the following variables, which we will need later.
TH = pʰ tʰ kʰ
T  = p  t  k
D  = b  d  g
W  = w  y  ɰ
N  = m  n

P  = TH T D    % Plosive
C  = P W N r s % Consonants

VSU = a  e  i  u
VSS = á  é  í  ú
VLU = aː eː iː uː
VLS = áː éː íː úː

VS = VSU VSS  % Short vowels
VL = VLU VLS  % Long vowels
V  = VS  VL   % All Vowels

U  = VSU VLU  % Unstressed
S  = VSS VLS  % Stressed

We might first wish to define rules which describe vowel stress, since it is rule-based and not lexical.

While this is not a rule, we might want to consider the expression for syllable structure: "The syllable onsent may consist of any consonant, or a cluster of an plosive and r. The coda may be empty, contain a nasal, or a voiceless plosive."

The onset looks like {C Pr}, which we can read as "C or (P followed by r)"
The coda is {n T}?, read as "(n or T), one or zero times"
Because diphthongs are not allowed, the syllable must be ({C Pr}V{n T}?)

Our lexicon will contain only the phonemic forms without stress, so we need to write the rules for assigning stress. For monosyllables, we need the following rules:
VLU > VLS / #{C Pr}?_{n T}?# % Monosyllables with long vowels are always stressed
VSU > VSS / #{C Pr}?_{n T}#  % Close monosyllables are always stressed.

% We apply to the short vowels only, because long vowels were affected by the previous rule

U > S / #{C Pr}?_{n T}?({C Pr}V{n T}?)({C Pr}V{n T}?)?# % Applies stress to bi- and tri-syllables
U > S / #({C Pr}?U{n T}?)+{C Pr}_{n T}?{C Pr}V{n T}?#   %Note that the first syllable expression uses U to prevent it affecting words changed by the previous rule.