Chinese Word Extractor

Introduction

Chinese Word Extractor is a program to split any Chinese text into individual words, summarizing information about each unique word. The information is presented in the form of a tab-delimited matrix, so that the results can be easily copied and pasted into a spreadsheet program like Excel.

screenshot-output after analysis

The program can be extended in three different ways: dictionaries, extra columns, and filtered words. Dictionaries can be changed by adding in extra files into certain directories. The distribution includes a copy of CC-CEDICT, but alternative dictionaries can be used as a replacement or in combination.

The word summary after text analysis can be modified by adding extra word data files, which will be incorporated into the output as extra columns.

If you need to filter out words from the output (for example, to eliminate words already learned), word lists can be added, and will be used to filter out matching words.

Download

Windows

Current version: Chinese_Word_Extractor_0_3_2-win32.zip (2013-01-20)

Linux

On Linux, the program can be executed as a Python 2.7 script. The instructions from this comment have been reported to work on Ubuntu 11. A similar setup may work for other flavors of Linux.

Help

See the Help document for more details

Source Code

The source code for Chinese Word Extractor is also available via GitHub, at https://github.com/cer28/ChineseWordExtractor