Home   |  Call for papers   |  Bake-offs   |  Committees  |  Instructions for poster  |  Submission  |  Program   Conference venue  |  For attendees  

Chinese Word Sense Induction

Organizers: Sun, Le      Dong, Qiang      Zhang, Zhenzhong

1. Motivation

The use of word senses instead of word forms has been shown to improve performance in information retrieval [Uzuner et al., 1999], information extraction [Chai and Biermann, 1999] and machine translation [Vickrey et al., 2005]. Word Sense Disambiguation generally requires the use of large-scale manually annotated lexical resources. Word Sense Induction (WSI) can overcome this limitation, and it has become one of the most important topics in current computational linguistics research.

Compared with European languages such as English, the study of WSI and WSD in Chinese is inadequate. In addition, Chinese word senses have their own characteristics. The methods that work well in English may not work well in Chinese. This task is intended to promote the exchange of ideas among participants and improve the performance of Chinese WSI systems.

2. Task Description

The input corpus: For each target word, participants will be given the total number of senses for this word and a corpus of word instances.

The data format: instance-ID                instance

The output: For a given target word, cluster its instances into classes, with each class representing a different sense.

The data format: target-word-ID target-word-instance-ID  class1[/weight]  class2[/weight]  …?
The default weight is 1.

Examples:

Suppose that the target word “蓝?has two different senses in Chinese. A unique symbol will be used to represent each of the two senses (e.g., C0 and C1). The target word ID is the word itself. For example, the word ID for “蓝?is “蓝?

The input corpus looks like this:

0001    “蓝”袜?
0002     “蓝”先?
0003     “蓝”天
0004     “蓝”光的波长为470-475nm
…?

The output looks like this:

蓝     ?0001        c0
蓝     ?0002        c1
蓝     ?0003        c0/0.8      c1/0.2
蓝     ?0004        c0
…?

The first line of the output means the system assigns the sense ID “C0?to the target word “蓝?for Instance 0001with a default weight of 1. The third line means the system assigns the sense IDs “C0?and “C1?to the target word “蓝?for Instance 0003 with a weight if 0.8 and 0.2 respectively.

Evaluation:

The Word Sense Induction task will be evaluated as a hard clustering task, meaning that each target word in each instance can only belong to one class. If a system assigns more than one sense to the target word, we assume that it belongs to the sense with the highest weight. For example, in the third line of the example above, the word “蓝?for Instance 0003 belongs to sense “C0? because it has the highest weight (0.8). In case of ties, we assume that the system is proposing a new sense that is a combination of those tied senses. For example, if the system output is “蓝 0005  c0/0.5  c1/0.5? we assume that the system proposes a new sense C0_C1.

We consider the gold standard as a solution to the clustering problem. All examples tagged with a given sense in the gold standard form a class. For the system output, the clusters are formed by instances assigned to the same sense tag (the sense tag with the highest weight for that instance). We will compare clusters output by the system with the classes in the gold standard and compute F-score as usual (Zhao and Karypis, 2005). F-score is computed with the formula below.

Suppose Cr is a class of the gold standard, and Si is a cluster of the system generated, then

  1. F-score(Cr, Si)=2*P*R/(P+R)
  2. P=the number of correctly labeled examples for a cluster/ total cluster size
  3. R= the number of correctly labeled examples for a cluster/ total class size

Then for a given class Cr ,

    F-score(Cr)= images002

Then

    F-Score = images004

Where c is total number of classes, nr is the size of class Cr , and n is the total size.
Participants will be required to induce the senses of the target word using only the dataset provided by the organizers.

3. Datasets

We will provide a sample test set of 2500 examples of 50 target words to illustrate the data format.

The dataset consists of 100 target words and 5000 instances in total. All data comes from the internet.

The testing data: the instances consist of target words without the correct senses.

The data format:

Target word 1-ID   the total number of the target word 1 senses
Instance-ID    instance
…?
Target word 2-ID   the total number of the target word 2 senses
Instance-ID    instance
…?

For example,

蓝?2
0001  “蓝”袜?
…?
打?15
0010   “打”水
0011   “打”球
…?

4. Important Dates

  1. 2010-01-01 Registration for conference opens
  2. 2010-04-01 Distribution for training data
  3. 2010-06-09 Distribution for testing data
  4. 2010-06-11 Evaluation ends
  5. 2010-06-18 Releasing results
  6. 2010-06-25 2010-07-01 System description papers due
  7. 2010-07-02 2010-07-22 Paper reviews due
  8. 2010-07-09 2010-07-31 Camera-ready papers due

5. Contact Information

The home page is http://www.cipsc.org.cn/clp2010/task4_en.htm
If there are any questions about the task, please send the e-mail to sunle@iscas.ac.cn , zhenzhong08@iscas.ac.cn

Reference

  • Ozlem Uzuner, Boris Katz, and Deniz Yuret. Word sense disambiguation  for information retrieval. In Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence, page 985, 1999.
  • Joyce Yue Chai and Alan W. Biermann. The use of word sense disambiguation in an information extraction system. In InAAAI/IAAI, pages 850-855. Press, 1999.
  • David Vickrey, Luke Biewald, Marc Teyssler, and Daphne Koller. Word-sense disambiguation for machine translation. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, page 771-778, 2005.
  • Eneko Agirre, Olatz Ansa, David Martinez, and Eduard Hovy, ‘Enriching wordnet concepts with topic signatures? in Proceedings of the NAACL workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations. ACL, (2001).
  • Ying Zhao, and George Karypis, “Hierarchical Clustering Algorithms for Document Datasets? Data Mining and Knowledge Discovery, 10, 141?68, 2005

 

 

  


     2010 (C) Copyright CIPS-SIGHAN2010, All right reserved

Send questions to the webmaster .