There has been a scarcity of posts on the blog lately, as I’ve been working on a web application for the site. This is a page anyone can use to estimate their knowledge of Chinese words. The start page for the test is here.

Wordtest - screenshot of test page

The look and feel of the test is similar to a typical flashcard program, with buttons to show the pinyin and English for a presented word, and buttons for marking the word known or unknown. The main difference is that unknown words are not repeated. For the test, 165 sample words out of the top 36,000 in the Lancaster Corpus of Mandarin Chinese are selected at random, and pinyin and English definitions from CC-CEDICT are added. After submitting the results (and answering a few brief questions), you can see the detailed score, which shows how the estimated known word count is calculated. The known/unknown word scores are extrapolated from the samples, split into 11 different segments by frequency.

Wordtest - results page

For the next few months, I will be collecting results (anonymously, except for a random token to enable repeat testing), which I plan to use in a research project I’ve been working on. My interest is in word knowledge over a wide range of frequencies, and how that varies between individuals. I’m not affiliated with a university, so this whole project is just a hobby. Nevertheless, I hope to end up with some published results

FAQ

Where are the traditional characters?

The best frequency list in traditional characters that I’ve managed to find is at Taiwan’s Ministry of Education site. According to their licensing application, “本資料檔案僅係授權使用,而非販售賣斷” (File archives are authorized for use only, and not sold outright). That’s a rather brief statement compared to a typical English language EULA, so it’s not clear whether I’d be able to use the results of trials that used their data. I will submit an application to them using their form, and see if it gets approved. In the meantime, the code to use traditional characters is ready to use, so if I ever find another decent frequency list, I can use that one.

Why are words like “苏联” (Soviet Union) ranked as frequent words?

The LCMC was compiled around the years 1990-1993, and most of their non-fiction texts are from news reports around 1991. In compiling my word frequency list from the LCMC data, I excluded texts from categories D (Religion) and H (reports, official documents), to lessen the chance of these anomalies.

The definition it gave me isn’t right. Where can I submit a change?

All definitions are from CC-CEDICT. To submit a request for a change, look up the word in the MDBG dictionary, then click on the “correct this entry” icon for the found word.

I just want to take the test. Do I need to answer the survey questions?

No, you can leave them all blank and just take the test for fun. At some point after the research period is over, the survey questions will be removed.

How is standard deviation calculated?

theor. variance(N; K; S) = K(N-K)(N-S)/(N-1)/S

N = number of items in the range
K = number of known words (the true number, not the estimate from the trial)
S = number of samples (15 in this trial)

It’s an equation I derived myself from the combinatorics, but it’s possible that this equation, or something similar, has been derived before by someone else. Note that in this situation, the actual value of K is unknown to us, and using the value obtained by extrapolating the samples is a rough approximation. This probably violates a number of statistical laws. In other words, this is a rough estimate!

I get a blank page, or an error

Please use the Contact form to report any problems. This is a new application that hasn’t been heavily tested against a large number of browsers. To maximize the performance of the application, it relies heavily on Javascript to run the test, avoiding any network communication until the trial is finished. Any issues are likely an error in the Javascript or a bug in the web page coding.

Why would I put my email address in the survey?

This field is entirely optional. There are two uses for that field. One is to send out a one-time announcement at the conclusion of the project. The other is as a way to follow up, so that if I get data that throws my whole hypothesis into question, I have some chance at finding out why. If you’re at all worried about giving it out, definitely leave it blank.