The 3nd Chinese Parsing Evaluation:

Task description and data sets

1. Introduction

The first and second Chinese parsing evaluations (CIPS-ParsEval-2009[1] and CIPS-SIGHAN-ParsEval-2010[2]) were held successfully in 2009 and 2010 respectively. The evaluation results in the Chinese clause and sentence levels show that the complex sentence parsing is still a big challenge for the Chinese language. This time we will focus on the sentence parsing task proposed by the second CIPS-ParsEval to dig out the detailed difficulties of Chinese complex sentence parsing in the respect of two typical sentence complexity schemes: event combination in the sentence level and concept composition in the clausal level. We will introduce a new lexicon-based Combinatory Categorical Grammar (CCG) ([3],[4]) annotation scheme in the evaluation, and make a parallel comparison of the parser performance with the traditional Phrase Structure Grammar (PSG) used in the Tsinghua Chinese Treebank (TCT).

This evaluation includes two sub-tasks, i.e. PSG parsing evaluation and CCG parsing evaluation. For each sub-task, there are two tracks. One is the Close track in which model parameter estimation is conducted solely on the train data. The other is the Open track in which any datasets besides given training data can be used to estimate model parameters. We will take separate evaluations for these two tracks.

In addition, we will evaluate following two kinds of methods separately in the close track.

1) Single system: parsers that use a single parsing model to finish the parsing task.

2) System combination: participants are allowed to combine multiple models to improve the performance. Collaborative decoding methods will be regarded as a combination method.

2. Evaluation tasks

Task 1: CCG Parsing Evaluation

Input: A Chinese sentence with correct word segmentation annotation. The word number is more than 2. The following is an example:

  • 小型(small) 木材(wood) 加工场(factory) (is) (busy) (-modality) 制作(build) (several) (-classifier) 木制品(woodwork) (period) (A small wood factory is busy to build several woodworks. )

Parsing goal: Assign appropriate CCG category tags to the words and generate CCG derivation tree for the sentence.

Output: The CCG derivation tree with CCG category tags for the sentence.

  • (S{decl} (S (NP (NP/NP 小型) (NP (NP/NP 木材) (NP 加工场) ) ) (S\NP ([S\NP]/[S\NP] ) (S{Cmb=LW}\NP (S\NP (S\NP ) ([S\NP]\[S\NP] ) ) (S\NP ([S\NP]/NP 制作) (NP (NP/NP ([NP/NP]/M ) (M ) ) (NP 木制品) ) ) ) ) ) (wE ) )

Task 2: PSG Parsing Evaluation

Input: A Chinese sentence with correct word segmentation annotation. The word number is more than 2. The following is an example:

  • 小型(small) 木材(wood) 加工场(factory) (is) (busy) (-modality) 制作(build) (several) (-classifier) 木制品(woodwork) (period) (A small wood factory is busy to build several woodworks.)

Parsing goal: Assign appropriate part-of-speech (POS) tags to the words and generate phrase structure tree for the sentence.

Output: The phrase structure tree with POS tags for the sentence.

  • (zj (dj (np (b 小型) (np (n 木材) (n 加工场) ) ) (vp (d ) (vp-LW (ap (a ) (uA ) ) (vp (v 制作) (np (mp (m ) (qN ) ) (n 木制品) ) ) ) ) ) (wE ) )

3. Evaluation metrics

There are two parsing stages for the PSG and CCG parsers. One is the syntactic category assignment stage, including POS tag and CCG category. The other is the parse tree generation stage, including PSG parsing tree and CCG derivation tree. So we design two different sets of metrics for them.

3.1 Syntactic category evaluation metrics

Basic metrics are syntactic category tagging precision (SC_P), recall (SC_R) and F1-score(SC_F1).

  • SC_P= (# of correctly tagged words) / (# of automatically tagged words) * 100%
  • SC_R= (# of correctly tagged words) / (# of gold-standard words) * 100%
  • SC_F1= 2*SC_P*SC_R / (SC_P + SC_R)

The correctly tagged words must have the same syntactic categories with the gold-standard ones.

To obtain detailed evaluation results for different syntactic categories, we can classify all tagged words into different sets and compute different SC_P, SC_R and SC_F1 for them. The classification condition is as follows.

If (SC_Token_Ratio >=10%) then the syntactic tag will be one class with its SC tag, otherwise all other low-frequency SC-tagged words will be classified with a special class with Oth_SC tag. Where, SC_Token_Ratio= (word token # of one special SC in the test set) / (word token # in the test set) * 100%.

3.2 Parsing tree evaluation metrics

Basic metrics are labeled constituent precision (LC_P), recall (LC_R) and F1-score (LC_F1).

  • LC_P = (# of correctly labeled constituents) / (# of automatically parsed constituents) * 100%
  • LC_R = (# of correctly labeled constituents) / (# of gold-standard constituents) * 100%

F1= 2*LC_P*LC_R / (LC_P+LC_R)

The correctly labeled constituents must have the same syntactic tags and left and right boundaries with the gold-standard ones.

To obtain detailed evaluation results for different syntactic constituents, we can classify them into 5 sets and compute different LC_P, LC_R and LC_F1 for them.

(1) Complex event constituents

(2) Concept compound constituents

(3) Clausal and phrasal constituents

(4) Single-node constituents

(5) All other constituents

The classification is based on the syntactic constituent and grammatical relation tags annotated in TCT. Please refer next section for more detailed information.

We compute the weighted average of the F1-scores of the first four sets (Tot4_F1) to obtain the final ranked scores for different proposed parser systems. The computation formula is as follows: Tot4_F1 = ∑LC_F1i * LC_Ratioii[1,4].

LC_Ratioi is the distributional ratio for the ith constituent set in the test set. It computation formula is: LC_Ratioi= (# of constituents in ith set) / (# of all constituents) * 100%

For comparison analysis, we all compute the weighted average of F1-scores of all five sets for ranking reference.

To estimate the possible performance upper bound of the automatic parsers, we also design the following complementary metrics:

(1) Unlabeled constituent precision (ULC_P)= (# of constituents with correct boundaries) / (# of automatically parsed constituents) * 100%

(2) Unlabeled constituent recall (ULC_R)= (# of constituents with correct boundaries) / (# of gold standard constituents) * 100%

(3) Unlabeled constituent F1-score (ULC_F1)= 2*NLC_P*NLC_R / (NLC_P + NLC_R)

(4) Non-crossed constituent precision (NoCross_P)= (# of constituents non-crossed with the gold standard constituents) / (# of automatically parsed constituents) * 100%

4. Evaluation data

All the news and academic articles annotated in the TCT version 1.0[6] are selected as the basic training data for the evaluation. It consists of about 480,000 Chinese words. 1000 sentences extracted from the TCT-2010 version can be used as the basic test data. Based on them, the final training and test data sets can be built through the following automatic transformation procedures.

Firstly, we make binary for all TCT annotation trees and obtain a new binarizated TCT version. Two new grammatical relation tags RT and LT are added to describe the inserted dummy nodes with left and right punctuation combination structures. They can provide basic parsing tree structures for PSG and CCG parsing evaluations.

Secondly, we classify all TCT constituents into 5 sets, according to the syntactic constituent (SynC) and grammatical relation (GR) tags annotated in TCT

1. Complex event constituents, if one of the following conditions is matched.

a) TCT SynC tag=fj and TCT GR tag {BL, LG, DJ, YG, MD, TJ, JS, ZE, JZ, LS}

b) TCT SynC tag=jq

2. Concept compound constituents, if all the following two conditions are matched

a) TCT GR tag {LH, LW, SX, CD, FZ, BC, SB}

b) TCT Sync tag {np, vp, ap, bp, dp, mp, sp, tp, pp}

3. Clausal and phrasal constituents, if all the following two conditions are matched

a) TCT GR tag {ZW, PO, DZ, ZZ, JY, FW, JB, AD}

b) TCT Sync tag {dj, np, sp, tp, mp, vp, ap, dp, pp, mbar, bp}

4. Single-node constituents, if TCT SynC tag=dlc

5. All other constituents

They will provide basic information for detailed parsing tree evaluation metrics computation.

Finally, we build the evaluation data sets for two parsing tasks through the following approaches:

1. For PSG parsing evaluation, we automatically transform the TCT annotation data through:

a) For the syntactic constituents belong to the above class 1-2 and 5, we retain the original TCT two tags;

b) For the syntactic constituent belong to the above class 3-4, we only retain the original TCT SynC tags.

2. For CCG parsing evaluation, we automatically transform the TCT annotation data into CCG format by using the TCT2CCG tool [7].

To evaluate the effect of different training corpus scale for parser performance, we divide all training data into N parts. In each training round, the n parts (n[1,10]) annotation corpora can be used to train N different parsing models. Based on them, N different test results can be obtained on the same test data set. Therefore, several variation trend diagrams of different kinds of evaluation metrics on different training corpus can be built. In the evaluation, we will set N=10.

5. Evaluation procedure

Every participant in the close track will be provided 10 different training data. They are asked to provide 10 different test results outputted by 10 different parsing models training in the above 10 different training data.

All participants should give a standard document name for each provided result according to the following nomination format: <Participant ID>-<Task name>-<Training mode>-<Parser model>-<System name>-<Training data scale>.CPT, where:

(1) <Participant ID> represents the participant ID obtained in on-line registration.

(2) <Task name> represents the name of the participating parsing task: PSG or CCG.

(3) <Training mode> represents the different training tracks: Close or Open.

(4) <Parsing model> represents the different parsing model building methods used by the participants: Single or Multiple (system combination).

(5) <System name> represents the abbreviated names given by participants for their system with no more than 5 characters.

(6) <Training data scale> represents the different training data used in training procedure: [1-10].

Here is a nomination example: 01-PSG-Closed-Single-CCP-1.CPT. It can provide the following information: A participant with ID 01 takes part in the close track of the PSG parsing evaluation task. He uses a single parsing model to training his parser. The abbreviated name of his parser system is CCP. He uses 10% training data to obtain the test results. All these information can provide good supports for the final data analysis and summarization.

References

[1]. Qiang Zhou, Yuemei Li. Evaluation report of CIPS-ParsEval-2009. In Proc. of First Workshop on Chinese Syntactic Parsing Evaluation, Beijing China, Nov. 2009. pIII—XIII. (2009)

[2]. Qiang Zhou, Jingbo Zhu. Chinese Syntactic Parsing Evaluation. Proc. of CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP-2010), Beijing, August 2010, pp 286-295. (2010)

[3]. Steedman, Mark. Surface Structure and Interpretation. MIT Press, Cambridge, MA. (1996).

[4]. Steedman, Mark. The Syntactic Process. MIT Press, Cambridge, MA. (2000)

[5]. Clark, S., Copestake, A., Curran, J.R., Zhang, Y., Herbelot, A., Haggerty, J., Ahn, B.G., Wyk, C.V., Roesner, J., Kummerfeld, J., Dawborn, T.: Large-scale syntactic processing: Parsing the web. Final Report of the 2009 JHU CLSP Workshop (Oct 2009)

[6]. Qiang Zhou. Chinese Treebank Annotation Scheme. Journal of Chinese Information, 18(4), p1-8. (2004)

[7]. Qiang Zhou. Automatically transform the TCT data into a CCG bank: designation specification Ver 3.0. Technical Report CSLT-20110512, Center for speech and language technology, Research Institute of Information Technology, Tsinghua University. (2011).