I have had my online vocabulary extraction tool available on the web for a while now. I have gotten a lot of use out of it myself, as my primary interest has been to develop more vocabulary to increase reading ability. The application generally works ok, but it suffers from some technical issues. Because it loads the entire CC-CEDICT every time it runs, it taxes the shared hosting provider a lot, to the point where the script crashes unpredictably, especially for larger texts. It also requires manual intervention to keep the dictionary up to date, and adding more dictionaries takes a lot of additional effort.

Meanwhile, for the past year I’ve been working on a similar program that can be used offline. It has been working well, is a little faster, and is easier to drop in newer versions of the CC-CEDICT dictionary. I have spent a few months adding a little more polish to it, and now am releasing it as open source software. At this point, it is available for Windows systems. The source code is also available, which would allow it to be used on nearly any system. More details are at the project page and the documentation page. Here are some screenshots to demonstrate its functionality:


Chinese Word Extractor - Preferences

Dictionaries can be changed within the program, for example, to switch from a word list to a character usage list. Filtered word lists remove words from results. “Extra Column” files are additional data about words that can be included in the results. This software can also switch between simplied and traditional characters, a feature the web application is missing.

directories for user-added material

Additional dictionaries, filters, and word data can be dropped into the appropriate directory.

Chinese Word Extractor - Results pane

Texts can be opened from a file dialog, or can be typed or pasted into an editor window. After analyzing the text, the results will be displayed as a tab-separated table, suitable for copying and pasting into spreadsheet programs.

Enjoy!