In my Chinese studies, the Lancaster Corpus of Mandarin Chinese (LCMC) has been a useful source of data—word and character frequencies, collocations, phrase usage, parts of speech, etc. The corpus is freely available for non-commercial and research use. However, the native form of its data is in a set of XML files, which is not an easy format to work with. In addition, the XML data is slow to read data from, because all those XML tags and the entire data structure needs to be parsed. A much better format for the data is an SQL database. Stored in a database, many kinds queries and reports can be executed very efficiently. Depending on the software, these queries and reports can return results very quickly, much faster than in the XML format.
I have made available a Perl script and some other related tools to assist with extracting the LCMC files into a SQLite database. SQLite is a lightweight relational database management system intended for portability and ease of use. Because it functions as a standalone program (not client-server), it is easy to install and use. It’s more ubiquitous than you might think. It’s how the Firefox and Chrome browsers stores its history, cookies, and preferences. But it’s also used, for example, by the Anki program as the storage format for flashcard data, and by the Calibre e-reader program to store information on installed e-books.
› Continue reading…
, word frequency
Transcriber, Audacity, and Anki are three programs, all free and open source, that are useful for language study. At some point in the future, I hope to write more on each of these. In the meantime, I wanted to announce two export plugins I created for Transcriber. One export creates a label file for Audacity, for splitting an audio file into individual clips, and the other creates an import file for Anki, associating the transcribed text with the audio segments. Below are step-by-step instructions for the 6 steps involved, starting from a raw audio file and finishing with a set of Anki flashcards.
› Continue reading…
I recently bought an Amazon Kindle, for the primary purpose of reading more Chinese. It has turned out to be a great investment, since I am no longer tied to my computer screen for reading things I find online. I had been collecting bookmarks to online books sites for a long time without making much use of them. Now that I am a bigger consumer of reading material, I’m starting to make use of them. In particular, I need sites that allow for downloading the raw text, so that I can convert it into a formatted book. › Continue reading…
Being able to stream internet radio stations from all over the world is a great opportunity for language learners. Living in a country full of native speakers is the ideal environment for listening practice, but not everyone has the chance to travel or to spend significant time in the target country. For the rest of us, listening to live radio online gives us a touch of authentic culture, whether it’s a call-in show, traditional or pop music, or even commercial ads. In addition to streaming radio, podcasts are another way to listen to native speakers. As podcasts are individual files that can be downloaded, they make suitable material for listening practice, with the ability to repeat or slow down sections that are unclear. But podcasts in Chinese are few in number. Being able to record streaming radio would give the learner a wealth of practice material with massive variety. Or it can just record hours of music for your listening pleasure. › Continue reading…