Chinese Word Segmentation on MicroBlog Corpora

Organizer: Institute of Computational Linguistics, Peking University

1. Preface

After years of intensive researches, Chinese word segmentation achieves a quite high precision. However, the performance of segmentation is not so satisfying for MicroBlog corpora. This CIPS-SIGHAN-2012 bake-off task of Chinese word segmentation will focus on the performance of Chinese word segmentation algorithms on MicroBlog corpora.

2. Task Descriptions

This evaluation involves the following task:

Opened evaluation on simplified Chinese word segmentation task. This task provides no training set, and participants are free to use data learned or model trained from any resources.

3. Evaluation Metrics

Metrics used in this bake-off task is:

Precision = (Number of words correctly segmented)/(Number of words segmented) * 100%

Recall = (Number of words correctly segmented)/(Number of words in the reference) * 100%

F measure = 2*P*R / (P+R)

4. Data

Only a tiny amount of segmented data is given as a format reference of the segmentation systems, which consists of original data and segmented data. The standard of segmentation is in accord with Specification for Corpus Processing at Peking University http://www.icl.pku.edu.cn/icl_groups/corpus/coprus-annotation.htm

5. Test Corpus

The test corpus consists of approximately 5,000 texts from MicroBlog.

6. Results Submitting

The system running result file should be named as:

Result-#ID.txt, where #ID being the abbreviation of the name of participating site.

The participating site should also submit a system description file, which should be named as:

Description-#ID.txt

The system description should include the following information:

The hardware and software environments, which include: operation system and its version, CPU type and frequency, memory size, and etc.

Execution Time: the time from accepting the input to generating the output.

Technology outline: an outline of the main technology and parameters of the participating system.

Training Data: For the open training tests, all the extra training data should be described here.

7. Data Format

Input data format: the input data are unsegmented plain text file as follows:

【拍客】最给力的美女主持亲妮动作热场引观众爆掌声-芝麻拍客 http://t.cn/aEZfpo

Output Data Format

The output data should be the test file after word segmentation, where line breaks are inserted between words as follows:

拍客

给力

美女

主持

亲妮

动作

热场

观众

掌声

-

芝麻拍客

http://t.cn/aEZfpo

8. Encoding

Both the input and output file are UTF-8 encoded.

9. Evaluation Technical Report

Each participant should submit an Evaluation technical report to CIPS-SIGHAN-CLP2012 (http://www.cipsc.org.cn/clp2012).

Important Dates

2012-05-15 Registration opens

2012-07-01 Distribution of simplified training data of 500 MicroBlog texts

2012-08-01 Submit format verification

2012-09-27 Distribution of test data

2012-09-30 Evaluation ends

2012-10-20 Evaluation results release

Contact Information

For any questions about this bake-off task, please contact:

Huiming Duan, Zhifang Sui

Institute of Computational Linguistics, Department of Electronic Engineering and Computer Science, Peking University. 5 Yiheyuan Rd, Haidian District, Beijing, 100871, China

Email: duenhm@water.pku.edu.cn, szf@pku.edu.cn