Named Entity Recognition and Disambiguation in Chinese

1. Introduction

Named Entity Recognition and Disambiguation is an important task in Natural Language Processing (NLP). In Chinese, Named Entity Recognition and Disambiguation is especially important and challenging. First, common words can also be used as Named entities. For example, 高明(brilliant), a common adjective, is also a person name in China. Therefore, it is challenging to distinguish common words from named entities, given that Chinese words have less morphology variations than many other languages. Second, different types of named entities can use the same names. For example, 金山(Gold Hill) can be used as the name of persons, locations and organizations. Finally, it is typical in China that many persons share the same name. For instance, there are many persons having the name 王刚(Wang Gang) in China. To further investigate these issues, SIGHAN 2012 establishes a task for Named Entity Recognition and Disambiguation.

Similar tasks have been explored previously. The KBP (Knowledge Base Population) task in TAC (Text Analysis Conference) has a named entity disambiguation task, which they use the term entity linking. KBP provides a knowledge base (KB) of named entities. The KB provides a mapping from names to entities. One name can be mapped to many entities. The goal of KBP is (1) to link names occurring in the document to the corresponding entities in KB; (2) to cluster names referingreferring to the same entity, if this entity is not included in the KB. Another related task is WPS (Web People Search). WPS does not provide any knowledge base, instead it require names referring to the same entity to be clustered together.

This task can be seen as combination of related tasks in KBP and WPS: First the test names in the document should be judged to be common words or named entities; if a name is predicted as a named entity, participants should further determine which named entity in the KB it refer to; finally, if some names are predicted as named entities thatdo not occur in the KB, participants should instead cluster these names by the named entities they refer to.

2. Task Description

We provide a knowledge base, NameKB, for this task. Suppose the name "N" is shared by m entities, then NameKB has m entries for name N, with each entry describing an entity. The following is an example entry of name 雷雨 (thunderstorm). And for concreteness, we use 雷雨 as an example of test name throughout this document.

<?xml version="1.0" encoding="UTF-8" ?>

- <EntityList name="雷雨">

- <Entity id="01">



- <Entity id="02">



- <Entity id="03">



- <Entity id="04">

<text>男,汉族,硕士研究生学历,出生于19619月,陕西 中共商南县委书记,商州人,19808月参加革命工作,19827月加入中国共产党,现任中共商南县委书记。曾任任共青团商洛地委副书记;洛南县政府副县长;任中共商南县委副书记;中共山阳县委常委、县政府常务副县长,等。</text>


- <Entity id="05">



- <Entity id="06">




Figure 1. Knowledge base for雷雨:雷雨.xml

For each test name, we provide a document collection T. Each document t in T contains at least one test name (e.g. 雷雨). For each 雷雨 present, you need to determine which entry in NameKB is the corresponding entity. If the word instance 雷雨 does not refer to any entity, you need to classify it as Other. If the word instance 雷雨 refers to an entity not in NameKB, you need to cluster the names according by the entities they refer to, and then name the clusters as Out_XX, where XX is a number (e.g. Out_01, Out_02, ...).

3. Data Description

We provide the following data

1. Knowledge base, NameKB. We provide a XML file for each test name. This file contains several entries describing the name. The file is named as N.xml, where N is the test name. For example, the file for 雷雨 is 雷雨.xml. Please refer to figure 1 for an example.

2. Document collection, T for each test name. All documents containing the name N are placed under the folder N. For example, all document containing 雷雨 are under the folder 雷雨. Every file in the folder is a plain text file, named as XXX.txt, where XXX is three numbers.

Result File Format

You are expected to return the results as follows. For each name N, output a plain text file, N.txt. For example, 雷雨.txt should be output for name 雷雨. Every file should contain multiple rows, and each row should consist of two columns separated by one space.

The first column is the file name without extension (e.g. 123 if the filename is 123.txt). The second column is the entity identifier in the following formats.

1. If N refers to an entry in the NameKB, output the entity id XX refered in the NameKB, where XX are two numbers;

2. If N does not refer to any entity, output Other;

3. Output Out_XX otherwise.

The format of one example output file named 雷雨.txt is as follows:

001 01

002 Out_02

003 Other

004 06


Note: all result files should be returned in UTF-8 encoding.

Note that we assume that every occurrence in a document has the same meaning. Therefore you should return only *one* result for each file.

4. Evaluation method

Take 雷雨 for example. Assume NameKB has 6 entries for 雷雨, and T is the document collection for 雷雨. As described above, the returned results of the participating system, which labels for each document containing 雷雨, fall into three classes, namely: SL_XX, SOther and SOut_XX. The gold-standard results which we annotate for each test document also are divided into three classes, which for evaluation purpose, are notated as : L_XX, Other and Out_XX.

We use the following methods to compute precision and recall for every document t in T. (Here t can be seen as a replacement of 雷雨.)

(1) if t in T is predicted as SL_XX, we use the following formulae.

(2) if t in T is predicted as SOther, we use the following formulae.

(3) if t in T is predicted as SOut_XX, we use the following formulae.

(4) Accorting to all the instance documents of 雷雨, the overall precision and recall are calculated as follows.

Note that the options (1) (2) and (3) are mutual exclusive.

(5) The overall precision and recall for all test names are calculated as follows (the set of all the test names are notated as N, each name is represented as n in N)

5. Contact Information

Houfeng Wang, Sujian Li

No 5, Yiheyuan Road, Haidian District, Beijing, China

Institute of Computational Linguistics, Department of Computer Science and Technology, Peking University.

Postal code 100871

Email: {wanghf, lisujian}