Chinese Word Segmentation on MicroBlog Corpora
Organizer: Institute of Computational Linguistics, Peking University
After years of intensive researches, Chinese word segmentation achieves a quite high precision. However, the performance of segmentation is not so satisfying for MicroBlog corpora. This CIPS-SIGHAN-2012 bake-off task of Chinese word segmentation will focus on the performance of Chinese word segmentation algorithms on MicroBlog corpora.
2. Task Descriptions
This evaluation involves the following task:
Opened evaluation on simplified Chinese word segmentation task. This task provides no training set, and participants are free to use data learned or model trained from any resources.
3. Evaluation Metrics
Metrics used in this bake-off task is:
Precision = (Number of words correctly segmented)/(Number of words segmented) * 100%
Recall = (Number of words correctly segmented)/(Number of words in the reference) * 100%
F measure = 2*P*R / (P+R)
Only a tiny amount of segmented data is given as a format reference of the segmentation systems, which consists of original data and segmented data. The standard of segmentation is in accord with “Specification for Corpus Processing at Peking University” http://www.icl.pku.edu.cn/icl_groups/corpus/coprus-annotation.htm
5. Test Corpus
The test corpus consists of approximately 5,000 texts from MicroBlog.
6. Results Submitting
The system running result file should be named as:
Result-#ID.txt, where #ID being the abbreviation of the name of participating site.
The participating site should also submit a system description file, which should be named as:
The system description should include the following information:
• The hardware and software environments, which include: operation system and its version, CPU type and frequency, memory size, and etc.
• Execution Time: the time from accepting the input to generating the output.
• Technology outline: an outline of the main technology and parameters of the participating system.
• Training Data: For the open training tests, all the extra training data should be described here.
7. Data Format
Input data format: the input data are unsegmented plain text file as follows:
Output Data Format
The output data should be the test file after word segmentation, where line breaks are inserted between words as follows:
Both the input and output file are UTF-8 encoded.
9. Evaluation Technical Report
Each participant should submit an Evaluation technical report to CIPS-SIGHAN-CLP2012 (http://www.cipsc.org.cn/clp2012).
• 2012-05-15 Registration opens
• 2012-07-01 Distribution of simplified training data of 500 MicroBlog texts
• 2012-08-01 Submit format verification
• 2012-09-27 Distribution of test data
• 2012-09-30 Evaluation ends
• 2012-10-20 Evaluation results release
For any questions about this bake-off task, please contact:
Huiming Duan, Zhifang Sui
Institute of Computational Linguistics, Department of Electronic Engineering and Computer Science, Peking University. 5 Yiheyuan Rd, Haidian District, Beijing, 100871, China