People study foreign languages in many different ways. Because my main goal is reading, my particular method for studying Chinese places a large emphasis in acquiring receptive vocabulary, knowing the pinyin and the definition of words from the written characters. This is done through either flashcard software (I use Stackz) or spaced repetition software (like Anki). If I have an electronic text available, I use home-grown scripts to segment the text into words, and then create a word list of all the unique words. If I only have a printed book or magazine, I pick out the unknown words by hand, although this can be overwhelming with a difficult text.

Studying long lists of words via flashcards may sound like boring drudgery to some, but personally I enjoy the memorization process, so it works for me. And I can’t imagine a more time-efficient way of acquiring a large quantity of words. My first foray into reading Chinese started around the time I had completing the first 60 lessons in Pimsleur, with a list of the most frequent 500 words. It was a quick leg up on recognizing words and reading simple texts, although deeper understand of things like why quilt (被) was such a common word would come later. After that, I just made flashcards of whatever text I was reading: Chinesepod, iMandarinpod, blog posts, news articles, etc. I hadn’t yet discovered spaced repetition software; however, SRS isn’t very good at introducing large quantities of totally unfamiliar words anyway.

Now, I have advanced to longer and more difficult texts. The word lists from these texts are large; while a set of 70-150 words is manageable for a single flashcard study session, the list generated from a book chapter could contain 200-500 words that are totally unfamiliar. It takes some tedious pruning to cut down the raw list to a digestible size. This raises the question: What makes a good vocabulary list? Is there a formula for picking the most interesting words, and can the process be automated?

Case study: 《哈利波特与魔法石》

In Chapter 1 of the Chinese version of Harry Potter and the Sorcerer’s Stone, there are 4,747 words (tokens). This corresponds to 1,479 unique Chinese words (types). This is a large list of words, and it was time-consuming just to go through each one and identify the ones worthy of study. I completed the task with the following result:

480 words in the (pre-2009) HSK levels 1-3, which was mostly known already. Thus, I could exclude them out of hand
698 known, or easily guessable from the characters
301 identified as worth studying

The final list of 301 words is quite a large flashcard set to be used with my usual method of study, the brick wall method, in which I don’t eliminate any word, no matter how unimportant or difficult to remember. Among the more frequent words are some specific to the novel — 猫头鹰 (owl), 长袍 (robe) — which need to be known in order to understand certain passages, even though their occurrence outside the novel is rare. Most of the other unfamiliar words, on the other hand, are somewhat useful to know in a general sense, but the gist of most passages can be figured out without fully knowing the words. However, in order to build my vocabulary, I chose to study these words as well. In order to make the study more manageable, I did separate the words into two lists, one with the 61 words that were used more than once in the chapter, and another with the remaining 240 words that were only used once. I mainly studied the first list, while studying the second one more casually and as time allowed.

Another characteristic of the Chinese version of the book is higher than average use of onomatopoeia and transliterations. The onomatopoeia — for example, 咔哒, 咯咯笑, 喃喃, 噔, and many more — are simply due to being narrative fiction, plus being translated closely from English. There are also many transliterations from the original English words; almost all proper names are directly transliterated using standard syllable equivalents, with only a few exceptions. Chengyu also make their appearance in the text; there are around 15-20 chengyu in the first chapter. The ones used more than once in chapter 1 are 目不转睛 (unable to take one’s eyes off) and 一模一样 (exactly the same).

What criteria?

Whatever set of criteria determines the useful words to study, ultimately it will be about scoring the words with a usefulness value, and either cutting off the words below a certain value or splitting them into separate lists of primary and supplemental words. In scoring, there are many different ways of evaluating the words. Rather than using a continuous numeric range of scores, the conceptual categories below treat these criteria as different ways of partitioning the words into different lists, of higher priority and lower priority.

High frequency vs. Low frequency

If the goal is simple raw throughput– learning all the words in the language in the fastest possible time–then scoring the words by their frequency in some large corpus is the optimal method. Higher frequency words will always be more important than less frequent ones. This is because if one fails to learn a high frequency word in a list in favor of a less frequent word, the failed word will just be more likely to show up in subsequent texts. If it’s going to keep showing up, it’s best to learn it and get it out of the way.

This by itself is rather unsatisfying. Certain topic-specific words will be inherently important for understanding a text, even though they are uncommon words outside of the text. Ignoring these words is doing a disservice to comprehension of the text at hand. Another disadvantage I have found is that most of the words in my initial lists are already low-frequency — all less than around 100 occurrences per 1 million words — such that there is little difference between the words.

An alternative is to differentiate between the word frequencies just within the text, without regard to the general frequency in a larger corpus. For example, one could pay more attention to the words used more than once, and regard the words used only once as supplemental information. Or, for a longer text, the same division can be used, but further broken down by chapter or section. In other words, the importance of words used just once in chapter 1 but more frequently in other chapters are not underestimated.

Broad vs. domain-specific occurrence

Like frequency, this criterion considers the word occurrences in a large balanced corpus, but here it weighs how evenly distributed the word is across a variety of different texts and broad categories. For example, while having nearly identical overall frequencies of 31 per million, the words 草案 (draft of a plan or law) and 优越性 (superiority; advantage) are found almost exclusively in press reporting, while 偶尔 (occasionally) and 惊讶 (amazed) are mainly found in fiction. However, 减轻 (lighten; mitigate) and 看上去 (it seems …), on the other hand, are used in all kinds of texts. In linguistic terms, the latter have a higher value of dispersion in the corpus. This is a value that can be computed using a variety of related equations.

Anomalous frequency vs. expected frequency

This is similar to broad vs. domain-specific occurrence, but here it refers specifically to the word frequency in the current text under study relative to the reference corpus. Proper names would commonly fall into this category, but also unusual words that are a main subject of the text, or words favored by the author of the text. Perhaps one would want to keep these words in a separate list, studying them enough for short term remembering without worrying about the longer term.

Contextually essential vs. ancillary vs. superfluous

This is a value judgment on how essential the word is to understand the meaning of the containing sentence; in other words, how well the sentence or paragraph can be understood if the word is removed. A word may not necessarily be used many times in the text, but certain words happen to be the key to comprehension. Other words, especially in narrative fiction, serve to add useful but non-essential information, while many words simply add color and atmosphere to a storyline, and aren’t essential to general comprehension. Determining the importance to comprehension of each word involves testing each word individually, and cannot be automated,. Thus, it is a time-consuming process.

Noun/verb vs. adjective/adverb vs. conjunction vs. miscellaneous

This is somewhat similar to the contextual criterion above. However, it may be more efficient because it doesn’t require reading every sentence and testing the effect of word deletion. Generally, nouns and verbs tend to carry the most important meaning. Adjectives and adverbs are more likely to add description to the nouns and verbs, but are more likely to be non-essential. Conjunctions and other connecting or functional words tend to be important for putting all the parts of a sentence together; however, as abstract terms or grammatical function words thse utility words are among the more difficult to memorize from flashcards. Thus, it may be better to separate them from other words and use other methods to study them.

In the miscellaneous words category there are a few different classes of words I would consider. First are the onomatopoeia, the sound words — click, thud, mumble — that add color to written works. I tend to treat these sound words as a separate category. The pronunciation is nearly always easy to guess, since they are so often just a 口 mouth component plus a phonetic component. I can make an educated guess as to the meaning: light ka and ta sounds are quiet, high-pitched, or tapping; gu and du sounds are rumbling, mumbling, or booming.

Another type of miscellaneous word is pure phonetic transliterations. In these words, as long as the pronunciations of the individual characters are known, the full word can be sounded out. Sometimes, the Chinese syllables have changed the sound enough that the original word can’t be guessed. But a one-time look at the correct translation is usually enough to keep it in memory. At any rate, these are mostly used in the names of people or places, and full knowledge of the word isn’t crucial, as long as you can recognize the same word when it’s used consistently in the text.

A third miscellaneous type I would consider is chengyu. These are certainly worthy of study; however, in my experience, there are two types of chengyu: the ones fairly obvious from their characters, and the ones indecipherable from their characters. This latter category includes the deceptive chengyu, which seem simple due to the use of simple characters, but in fact are not what they would appear to be. Personally, these tend to be enough misleading that they can be as difficult as the obscure ones.

In/not in standard word lists

One standardized word list that has been a great reference in my studies has been the HSK word list (the old, pre-2009 one). The old HSK lists were a series of 4 word lists of increasing difficulty, with each list from around 1,000 to 3,500 words. In the 4 lists together, there were 8,000 total words. Identifying words in a text that are in a standard word list like this is straightforward and can be completely automated.

The resulting union of the sets of text word list and HSK words can be used in two different ways. If one hasn’t formally studied the HSK lists, these words can be considered particularly important. However, if one knows the HSK words fairly well, these words can be excluded from the study as they can be considered already known. For example, I have studied the HSK words in lists 1-3, but not 4. In the Harry Potter case study above, eliminating the known HSK words cuts the original vocabulary list down by 33%.


I have just been exploring ideas, so I haven’t reached any conclusions. Ideally, I am looking for a solution that can be largely automated; input a digital text and receive an effective word list. In searching literature on language teaching, it seems that word lists are hand-chosen by teachers, possibly with the assistance of frequency lists. In practice, the method that seems to work for me is to split the words into a few different lists: 1) transliterations and onomatopoeia; 2) words used more than once in the chapter (usually including chengyu); and 3) words only used once in the chapter.

Funny, though, this entire investigation may now be moot. Progressing through the chapters of 《哈莉波特与魔法石》, my reading ability is slowly picking up speed. It still takes mental effort to understand it, but my encounters with seemingly important but unfamiliar words has decreased a lot. I have now reached the point where I can jot down the occasional unknown word or phrase by hand, or quickly look it up in a dictionary. But computer-generated vocabulary lists is still going to be one of my tools, whenever I’m in the mood to quickly bulk up my vocabulary knowledge.