Sanskrit corpus help needed.
Anthologica Universe Atlas / Forums / Terra Firma / Sanskrit corpus help needed.

? dhok posts: 235
, Alkali Metal message
It's probably a long shot posting this here, but I'm asking on the off chance that somebody might have some pointers.

On my desk is a booklet entitled Basic Greek Vocabulary, which is an alphabetical list of the most common thousand or so Ancient Greek words. I would be delighted if a list like this existed for Sanskrit, but so far I can't find one. What I have found is the Digital Corpus of Sanskrit, but never having worked with corpora before, I don't know how one would get a list of most common lemmas out of it. Does anyone have pointers?
? Rhetorica Your Writing System Sucks
posts: 1292
, Kelatetía message
I think you may like this. The Digital Corpus of Sanskrit only lets you list words from a given text or one at a time, and their sole downloadable, the FrameNet XML, is pretty much useless outside of a very limited range of computational linguistic analyses. (If that's the Sanskrit equivalent of Perseus, the field is doooomed.) It's not impossible to get the information you want from them, but it would require scraping and aggregating the results from each text.
? dhok posts: 235
, Alkali Metal message
quoting Rhetorica:
I think you may like this. The Digital Corpus of Sanskrit only lets you list words from a given text or one at a time, and their sole downloadable, the FrameNet XML, is pretty much useless outside of a very limited range of computational linguistic analyses. (If that's the Sanskrit equivalent of Perseus, the field is doooomed.) It's not impossible to get the information you want from them, but it would require scraping and aggregating the results from each text.

It's got potential. The other half of the problem is that all Sanskrit texts available are natural texts, which includes the language's infamous system of sandhi. I'll eat my hat if I can get a computer to work out that gajośvaścagrāmādāgacchathaḥ is really gajas aśvas ca grāmāt āgacchathas. There are some easy shortcuts you can take- is usually a final -m, -ḥ is usually -s or -r- but combine the whole system with Sanskrit's equally infamous love of compounds, and you've got a recipe for disaster.