Basic Vocabulary for Philologists

Anthologica Universe Atlas / Forums / Terra Firma / Basic Vocabulary for Philologists

previous 1 2 next end

dhok posts: 235
, Alkali Metal, Norman, United States
message

I'm starting this topic to avoid cluttering up the Historical Linguistics thread with musings about it.

After a hiatus of a couple weeks- caused mainly by apathy deriving from a bout of depression- I'm right back working on this thing.

Currently, it's a bunch of spreadsheets. I have Classical Vocabulary for Philologists, which contains sheets for Latin, Greek and Sanskrit; Romance Vocabulary for Philologists, which contains Italian, Spanish, Portuguese, Catalan, French and Romanian, plus Latin (and now that I've discovered that there in fact exists an English-language etymological dictionary of Sardinian, I may as well include that, too); and some inkling of a Germanic Vocabulary for Philologists, which will probably contain at the least Old Norse, Old English and modern German. I'd also like to put something together for Russian as well, since I'm taking it.

Ideally, I'd eventually like to put this on Anthologica somehow or other. The idea is that you would be able to call up any language that's included and browse the database, but be able to exclude cognates you have no use for. (For example, if I'm learning Sanskrit and know Greek and Latin, I want to be able to see Greek and Latin cognates, but I don't need to look at Tocharian or Old Irish ones.) There would be a better UI than just a giant, fugly Excel spreadsheet. Perhaps you could even have it construct you a one-of-a-kind Anki deck. I don't know.

I'm not sure how this will all play out, but I'm currently giving the spreadsheets a redo. I have a 1000-word frequency list for Latin and a 500-word list for Greek. It's difficult to come across Sanskrit frequency lists; the Heidelburg Corpus will let you construct frequency lists for any work it has, though. It doesn't have the Ṛgveda, which is a pity, since you'd ideally really like to be able to use Vedic Sanskrit for a project like this. Instead, I'm going to base the frequency list on the frequency lists of the sort of Classical Sanskrit works that someone who is learning Sanskrit might be most likely to read. I'll use the Mahābhārata as a starting point, which seems wise, since it's so big and was written over such a long period that its frequency list should be fairly representative of Sanskrit as a whole. I'll then throw out proper names and include anything else that's in the frequency lists of the Ramayana, Hitopadeśa and the two texts in the corpus that are by Kālidāsa (the Meghadūta and Kumārasaṃbhava; he's pretty widely read, isn't he?)

There's also the trouble of defining what a "cognate" is for the purposes of this exercise, especially when you have a word that's a root plus a preverb, or-worse- an alpha privative. (It's clear that Latin ignorare and Greek γιγνώσκω are both from *ǵneh₃, for example, but they have opposite meanings, because ignorare also included an *ṇ.) Do we just include anything that falls under the same root?

Classical Vocabulary for Philologists may be viewed here. Suggestions are encouraged. Right now there are a hell of a lot of columns- if/when the spreadsheet gets turned into something easier on the eyes, it will make it much easier for everyone if the same sort of information can be found in the same columns. Eventually I'd like to be able to collapse some of the cognate columns- just have an Iranian column instead of Avestan and Persian, for example- but I'm not sure what the best way to do that is. Right now I'm ignoring cognates in the other Italic languages for basically this reason.

edited 4 times, last update 9 years ago link

dhok posts: 235
, Alkali Metal, Norman, United States
message

Etymologies are now divided into three: a root, a pre-root (which will take care of most prefixes) and a postroot (which will take care of most consonant add-ons). The question "what is a cognate?" remains, however. (This is especially true with true compounds, like Latin nōn, which comes from ne+oynom. Should it be listed under ne, or oynom?)

edited 2 times, last update 9 years ago link

Morrígan Witch Queen of New York
posts: 303
, Marquise message

Cool; no idea what is the appropriate thing to do in that case; just go with the first element? That seems to be insufficiently a principled decision.

9 years ago link

dhok posts: 235
, Alkali Metal, Norman, United States
message

quoting Morrígan, Countess:
Cool; no idea what is the appropriate thing to do in that case; just go with the first element? That seems to be insufficiently a principled decision.

For nōn, that's probably the best choice, since nōn syntactically and semantically patterns with Sanskrit na, OCS ne, OIrish ní and the like, and not with words meaning "one". I agree, though, that this seems a very touch-and-go rule that won't apply to everything. (We could just decide to live with this...semantics are fuzzy, after all.)

If we're using a master list of roots to construct this, maybe it's better to break each word down into its constituent parts and turn such words into disambiguation pages of a sort? I know travisb recommended using a real database, not ad-libbing a spreadsheet. I don't know how to do databases, but I can certainly try to learn.

edited once, last update 9 years ago link

Morrígan Witch Queen of New York
posts: 303
, Marquise message

Well, databases are also a bit crap. If spreadsheets (and the contents of their cells, more importantly) are formatted in a consistent way, we can transform them to RDF and have a nice queryable data structure. Not that this part is easy, but it's doable and (I'd argue) doesn't need to be as rigorous as database design is.

9 years ago link

dhok posts: 235
, Alkali Metal, Norman, United States
message

quoting Morrígan, Countess:
Well, databases are also a bit crap. If spreadsheets (and the contents of their cells, more importantly) are formatted in a consistent way, we can transform them to RDF and have a nice queryable data structure. Not that this part is easy, but it's doable and (I'd argue) doesn't need to be as rigorous as database design is.

Can you PM me your Gmail address so I can add you to the list of approved editors?

9 years ago link

Morrígan Witch Queen of New York
posts: 303
, Marquise message

I think you did already, I haven't been in there in while, so busy. I need to start developing an ontology to handle this stuff anyway, but that's gonna be a lot of work.

9 years ago link

dhok posts: 235
, Alkali Metal, Norman, United States
message

Have you gotten my PM?

9 years ago link

Morrígan Witch Queen of New York
posts: 303
, Marquise message

And responded to it this morning. I think using a master index is probably the way to go, after more consideration. The only downside is if there are dupes in the old master lexicon (there are).

9 years ago link

dhok posts: 235
, Alkali Metal, Norman, United States
message

[redacted]

edited 2 times, last update 9 years ago link

Morrígan Witch Queen of New York
posts: 303
, Marquise message

I have Beekes; it should be easy to use the index to build a new version of the spreadsheet. I used the Indo-Iranian dictionary too; Celtic might be based heavily on Masatovic, I've not been on that site in a while. I have a PDF of that book though.

9 years ago link