I am a faithful user of flashcards to study Chinese words, with Anki as my software of choice to take care of the spaced repetition rescheduling. Even though I try to keep my queue empty on a daily basis, there are still days when I feel like I’m swimming against the tide. If I look at my forecast of upcoming cards, the level of daily cards quickly drops to a low baseline after a week or so. Yet, I never seem to reach the level that Anki’s forecast graph promises me. Then there are other days where I get weary of the constant drilling and skip a few days. When I come back to study, I have a large queue of overdue cards waiting for me (as expected). However, once those cards are cleared, Anki’s forecast of future cards is surprisingly good—maybe better than if I hadn’t skipped those days. Am I being punished for my diligence? Is this just my perception of the flashcard experience, or am I encountering something tangible related to SRS scheduling?
A way to test various theories was to create a simulation of Anki’s SRS scheduling. › Continue reading…
In my Chinese studies, the Lancaster Corpus of Mandarin Chinese (LCMC) has been a useful source of data—word and character frequencies, collocations, phrase usage, parts of speech, etc. The corpus is freely available for non-commercial and research use. However, the native form of its data is in a set of XML files, which is not an easy format to work with. In addition, the XML data is slow to read data from, because all those XML tags and the entire data structure needs to be parsed. A much better format for the data is an SQL database. Stored in a database, many kinds queries and reports can be executed very efficiently. Depending on the software, these queries and reports can return results very quickly, much faster than in the XML format.
I have made available a Perl script and some other related tools to assist with extracting the LCMC files into a SQLite database. SQLite is a lightweight relational database management system intended for portability and ease of use. Because it functions as a standalone program (not client-server), it is easy to install and use. It’s more ubiquitous than you might think. It’s how the Firefox and Chrome browsers stores its history, cookies, and preferences. But it’s also used, for example, by the Anki program as the storage format for flashcard data, and by the Calibre e-reader program to store information on installed e-books.
› Continue reading…
, word frequency