The HSK is a well-known skill level test used by the PRC to assess language proficiency in Chinese. Even for those who have no interest in taking the HSK test, the lists of Chinese words associated with the test are a convenient source of material for learners to study vocabulary. I have used these word lists myself with great success; it was a quick and effective way to gain a huge amount of usable vocabulary.

In 2010, the HSK exam underwent a major reworking, changing the structure of its skill ranks, increasing emphasis on speaking and writing, and revising its vocabulary. Where the “old” pre-2010 word lists consisted of 8,000+ words across 4 levels, the “new” HSK has 5,000 words distributed into 6 levels. Below is a summary of the word counts in the old and new vocabulary lists, based on actual word lists obtained from various sources (see footnotes for details). Note that these include a small amount of double counting (less than 2%) due to words repeated at more than one level, because of either different pronunciation or meaning. Also note that these counts differ slightly from the official word counts reported by Hanban.

Word counts in the old and new HSK word lists
Level old HSK new HSK
1 1007 153
2 2001 150
3 2189 300
4 3587 600
5 1300
6 2513
Total 8784 5016

Since I had invested so much time in studying the old lists (up to level 3), it was natural to wonder whether I should continue studying my existing flashcards or switch to the new HSK lists. How many words have I learned that are deprecated by the HSK, and does it mean they are unimportant? If I did switch, what level should I pick to start studying ?

Analysis 1 – Frequency Profiles

One way of comparing the sets is to look at the frequencies of the words in Chinese. Words that are more frequent in a language tend to be easier to learn, because the learner is exposed to them more often and in a variety of contexts. I did an analysis based on each word’s frequency in the corpus of Lancaster Corpus of Mandarin Chinese (LCMC). For the words that were not found in the corpus, I made a small improvement in the matching by removing the 儿 ending from erhua words when the corresponding base word was in the corpus. While most of the other unmatched words were truly rare, a few common words are not. For example, 公共汽车 (gōnggòng qìchē, bus) and 不客气(bù kèqi, You’re welcome) are not considered a single word by the LCMC. Also, there are “words” in the new HSK that are purposeful combinations: 打篮球 (dǎ lánqiú, to play basketball) and 踢足球 (tī zúqiú, to play football), for example.

Plot of occurrences vs. log word rank in the old HSK

Plot of occurrences vs. log word rank in the new HSK

The above images plot the number of words at narrow slices of the log rank of the word (which is inversely related to the frequency–high-frequency words have low rank). The graphs clearly show that in both the old and new HSK the different levels target a different difficulty of words. However, while the lowest and the highest level word lists have little overlap, adjacent lists have a significant degree of overlap. Note that plotting on a log-rank scale yields a fairly symmetrical bell curve suggesting a Gaussian distribution, despite the fact that because of the log scale, the left half of the curve has it’s slope exaggerated, and the right half compressed. I suspect this is just accidental, but it’s close enough to a Gaussian distribution that I can still estimate the peaks by fitting the curves.

Peaks from Gaussian distribution fitting
Level old HSK peak rank new HSK peak rank
1 967 492
2 2756 640
3 5991 1304
4 10263 1903
5 3367
6 8500

From this, we can estimate which old levels match to the new:

  • Old level 1 -> New level 1+2+3
  • Old level 2 -> New level 4+5
  • Old level 3 -> New level 5+6
  • Old level 4 -> New level 6 (but only the more difficult words of the new list)

Analysis 2: Word-for-Word Matching

A second way of looking at the differences in the list is by following the individual words themselves. Do the words in the old levels mostly go to particular levels in the new HSK? Even before beginning the analysis, the large difference in the list sizes indicates that there will be a large number of words in the old list that are discontinued. With 8,000 words in the old and 5,000 in the new, it’s unavoidable that at least 37% of the old words will not be reused.

New HSK Levels
Old Level Words in Old Level 1 2 3 4 5 6 Not in new (% discontinued)
1 997 136 128 204 187 83 9 250 (25%)
2 1959 4 12 67 318 680 171 707 (36%)
3 2127 2 2 10 41 314 689 1069 (50%)
4 3651 0 0 4 19 117 1290 2131 (60%)
(newly introduced) 11 8 13 30 105 338
Total 8644 153 150 298 595 1299 2497 4157 (48%)

(Note: When comparing the lists, to avoid double counting I have filtered out words that appear in more than one level with different readings, retaining the word only on the lowest level in which it appears.)

The table above summarizes the fate of the old HSK words. As expected, a large number of the old words (48%) of the old HSK words were not included in the new HSK. In addition to the remaining 52% of the old words that were reused, an addition 505 words (10% of the new HSK) are new additions. From the old words that were retained, a rough summary of the mapping from old to new may be as follows:

  • Old level 1 -> New 1+2+3+4
  • Old level 2 -> New 4+5
  • Old level 3 -> New 5+6
  • Old level 4 -> New 6

The two schemes proposing mappings between the old and new HSK vocabulary agree closely. The main difference between them is that the analysis based on word reuse clearly shows that many of the words in the old Level 1 end up in the new Level 4, which the word frequency analysis does not detect. On the other hand, the graph of word ranks in the old and new levels shows more clearly that while the old level 4 roughly maps to the new level 6, it only does so for the more rare words of the new level 6.

For Chinese learners who have studied with the old word lists, here is a useful list of the words unique to either the old or new HSK.

As time permits, I hope to continue looking at these lists on a more qualitative level. For example, is there a common theme for the words that have been discontinued and that have been added? Just glancing at the newly added words, some are internet and computer-related words (手机, 电子邮件, 上网, 笔记本), and some others are useful verb-object compounds (打电话, 打篮球, 爬山, 刷牙 and many more). Some redundancy was removed from the old. For example, 北方 was retained and 东, 南, and 西 as single characters were retained. But all the other x+方, along with all the x+边, x+部, and x+面 words were removed (and good riddance to the last three, which were not only obvious but were rarely used in real life). On the other hand, there were many words removed that would seem quite important. 窗 (window), 春天 (Spring), 大学 (university)?! These are words you would no longer be exposed to if you only relied on the new HSK word lists.


Notes:

  • My source for the old HSK word lists was a merging of HSK Flashcards, chinese-forums, and wiktionary. The wikitionary and chinese-forums lists were nearly identical, but there were significant differences between those two and the HSK Flashcards list.

  • The source for the new HSK word list was from Lingomi.

  • There is a chinese-forums thread also on this topic, although it concentrates more on the character lists instead of word lists.