Tag: Linguistics

Recently, I created a set of flashcards of single Chinese characters, to practice writing. The front of my Anki cards contained the pinyin, definition, and clozes for the most common words containing the character, while the back of the card was simply the character. I tagged the cards in groups of 200 by frequency rank, using tags of “1-200”, “201-400”, etc. I already know a number of characters, so I decided to start practicing with the more infrequent characters, the “1601-1800” tag.

There were some characters I was well familiar with. Other characters took more time to remember how to write, but weren’t too difficult, as I knew the characters on sight from extensive reading. But every once in a while I would be shown a card, and it would be for a character I had never seen before in 6 years! Some like 贼 (zéi, thief) or 鹏 (péng a mythical bird) were surprising to see in the 1600-1800 range for frequency ranks, ranked more frequent in the Lancaster Corpus than 垂 (chuí to hang down) and 夹 (jiā to squeeze). But however unusual they were, I still recall encountering them at some point (金色飞贼 is the golden snitch in Quidditch from Harry Potter, and 鹏 was from reading on Chinese mythical animals). However, 琉 (liú glazed tile) and 鲍 (bào abalone) don’t look familiar at all, and I am fairly certain I have never seen the characters 萼 (è calyx of a plant) and 懋 (mào diligent) in over 6 years of study. Is it just a strange chance that I haven’t encountered them, is it failing memory, or are they more rare than their frequency would suggest?

Hapax Legomena vs. the Brick Wall

The Brick Wall

With reading as my primary skill of focus in learning Chinese, a large part of my study is acquiring new words. Some vocabulary is from general word lists such as the HSK, while much of it is tied to a specific text I am reading, in order to increase my level of comprehension. While many approach the task of reading in a foreign language by looking up unknown words as they are encountered, I prefer to learn them ahead of time, to avoid the break in concentration while reading. With my bad habit of perfectionism, my main strategy in the past for learning these word has been the “Brick Wall Method”:

The Brick Wall Method – Learn every unknown word you encounter, no matter how difficult or rare it is

My theory has been — like being a brick wall against a tennis player — to not let any unknown word get past me, so that eventually I will run out of unknown words and thus will have learned the language. If a word is used in a text, it’s clearly important to some nominal degree, and if it’s used once, then it’s more likely to be seen again at some point, versus all the words that aren’t in the text.

A Mathematical Model for Chinese Word Knowledge

The Known Chinese Words Test has been running for a month now. During that time I’ve collected data from 170 trials, from learners with a wide range of levels. The results are very encouraging, so that I can give more details about what I have found. What I have been working on is a mathematical model for word knowledge, which can describe the probability for a particular person to know any word, with just a few variables involved. The results from the collected trials validates that hypothetical model, and I’m elated.

By 2008, I had been studying Chinese off and on for around 3 years. As a self-learner, my study was rather eclectic: Pimsleur, Chinesepod, and random flash card lists were my main methods. I was far from fluent, still struggling to understand all but the simplest news articles, fiction, or blog posts. But I felt like I did know a lot of words, I just didn’t know how many. How much longer before this would start to get easy? So I undertook a self-examination to estimate how many Chinese words I actually knew.

In forums for foreign language learners,1 2 certain questions recur. Some are easy to answer (Do I need to learn tones? Yes). Others (Should I study words or characters?) may have more than one answer. Two common questions I am especially interested in are:

  • How many words do I know?
  • How many words do I need to know (to read a newspaper, book, etc.)?

The second question is not immediately answerable except under certain conditions: if you can’t yet read your target material, then the answer is–More! Of course, people ask the second question because they want to know how close they are to their target fluency level, and the answer to the first question can give a rough estimate of that.

Word Frequencies from Corpus Data

Knowing the frequency of the Chinese words you are studying helpful in a few different ways. If an unknown word is relatively common, then it’s generally more important to learn that word, compared to a less common word. With that knowledge in hand, you can feel less guilty about removing the rare words from your flashcards, and persist in learning the ones that are common yet difficult. If a word has a low frequency in general, but happens to be used a lot in a particular text, that word may be of interest to study. In the early stages of learning, studying the top N (100, 200, 500, etc.) words as flashcards is an effective way to bootstrap one’s word knowledge before diving into authentic texts. But it’s not 100% effective; the long tail of infrequent words will keep you busy learning new vocabulary for years!

