Word Frequencies from Corpus Data

Knowing the frequency of the Chinese words you are studying helpful in a few different ways. If an unknown word is relatively common, then it’s generally more important to learn that word, compared to a less common word. With that knowledge in hand, you can feel less guilty about removing the rare words from your flashcards, and persist in learning the ones that are common yet difficult. If a word has a low frequency in general, but happens to be used a lot in a particular text, that word may be of interest to study. In the early stages of learning, studying the top N (100, 200, 500, etc.) words as flashcards is an effective way to bootstrap one’s word knowledge before diving into authentic texts. But it’s not 100% effective; the long tail of infrequent words will keep you busy learning new vocabulary for years!

So, how can we obtain word frequency data? With Chinese, it’s trickier than it sounds. Getting character frequency data is quite simple to do: first, get some text (the internet’s your oyster), filter out every non-Chinese character, then count the rest, character by character. On the other hand, to count Chinese words, first we need to know where the boundary of each word is, in a language where all the words run together. Computers can do a fair job, with up to 98% accuracy (ref.: The Third International Chinese Language Processing Bakeoff). But to get 100% accuracy requires human oversight. This is where a manually-segmented corpus comes in handy.

Corpus Basics

A corpus is a collection of texts that has been marked up and tagged in various ways, to suit a particular academic purpose. For example, each word may be tagged as a particular part of speech, and may make distinctions beyond simple parts of speech: whether it’s a proper noun, foreign word, etc. A more in-depth tagging system may indicate how groups of words make up noun phrases, verb phrases, and clauses, and then how those parts make up a full sentence, in a hierarchical tree. But what really matters for our purpose of getting word frequencies is that the researchers creating the corpus have done the work of splitting up the words. Now, all we need to do is count them!

Listings of Chinese corpora that are allegedly available can be found here and here. These lists link to the main page for each research project, and there is a wide variation on the amount of information the sites provide. Some of them will link to an online word or part of speech lookup, which will use the corpus on the back end to output a list of matches. However, the corpora themselves are hard to find. To tag a corpus of a decent size (say, 1 million words) takes a significant amount of labor. I can’t fault them for keeping them under wraps. However, another important reason is probably the tangled mess of copyright law, when the corpus creator doesn’t have authorization to redistribute from all the owners of all the original texts.

The Oxford Text Archive site is a clearinghouse for a number of corpora in many different languages, including three in Chinese. The PH Corpus is a collection of Xinhua News Agency articles from 1990-1991. The Sheffield Corpus of Chinese has source texts from ancient up through 1911. The Lancaster Corpus of Mandarin Chinese (LCMC) consists of a varied collection of texts, both fiction and non-fiction, mainly from the years 1990-1993. Of these three, the LCMC is the only one that can be directly downloaded without requiring an application form. For that reason, the LCMC is the only corpus I’m familiar with up to this point.

Corpus-based Word Lists

While actual corpora are hard to come by, there are on the web a few tabulated lists of words with frequencies available, which have been generated from actual corpus data.

From the University of Leeds, Some corpus-derived data: “A collection of Chinese corpora” has links to lists generated from the LCMC (listing the top 5000 words) and a home-grown web corpus (the top 50k words); and “Large Corpora used in CTS” links to a list from the Chinese Gigaword corpus (the top 25k words), among lists in many other languages.

Within the software distribution of A Corpus Worker’s Toolkit, there are two files. File ldc.dic has 44,000 unique words with frequencies, from a corpus of 4.9 million words. File wordlist.txt contains a list of 119,000 words, but without frequencies.

Jun Da’s Chinese text computing corpus is a web-derived corpus containing 258 million characters. Character frequencies are available on the site. The corpus isn’t a tagged corpus, so the word frequencies cannot be extracted. But the site can generate frequency data from bigram analysis (bigrams are pairs of characters that are seen together more than just random chance; they aren’t necessarily actual words but often are).

Introduction to the LCMC

The LCMC home page is at http://www.lancs.ac.uk/fass/projects/corpus/LCMC/default.htm. The license states: “The LCMC Corpus is distributed free of charge for use in non-profit-making research.” That covers my intentions just fine. The corpus itself can be obtained from the Oxford Text Archive. From the OTA’s main page, the detail page for the LCMC (id #2474) contains the download link. After entering an email address, an automated message is sent which gives details on how to download the distribution. The single zip file contains some documentation along with all the corpus text in XML format. The entire corpus is available as both a character version and a pinyin version. The corpus is divided into 15 files depending on the class of text. The LCMC strives to be a “balanced” corpus, meaning that it contains a broad coverage in many different areas, both fiction and non-fiction.

Table 1: Categories in the LCMC
id	type
A	Press reportage
B	Press editorial
C	Press review
D	Religion
E	Skills, trades and hobbies
F	Popular lore
G	Biographies and essays
H	Miscellaneous (reports, official documents)
J	Science (academic prose)
K	General fiction
L	Mystery and detective fiction
M	Science fiction
N	Martial art fiction
P	Romantic fiction
R	Humour

The XML data files are not easy to work with by themselves. In a future article, I’ll describe how to parse the data into an SQL database, making analysis much simpler. For now, let me summarize some quick number-crunching:

Total number of characters, including non-Chinese: 1,508,495
Total number of words, including non-Chinese: 1,001,826
Total number of Chinese characters: 1,314,160
Total number of Chinese words: 827,822
Number of unique Chinese characters: 4,722
Number of unique Chinese words: 42,621

The most fascinating result from this analysis is that only 4,722 distinct characters are in the entire corpus. That’s still a lot if your goal is to learn them all. But it’s quite different from the 10,000 or 40,000 (the Kangxi dictionary has over 47,000, according to Wikipedia), or even 80,000 characters (the Zhonghua Zihai lists 85,000) that people claim you need to know, probably when trying to argue that Chinese is impossible to learn.

Corpus Basics

Corpus-based Word Lists

Introduction to the LCMC

You Might Also Like

Hapax Legomena vs. the Brick Wall

Improving Character Frequency Lists with Dispersion Data

Counting Known Chinese Words – Part I