Named Entity Recognition and Disambiguation in Chinese

1. Introduction

Named Entity Recognition and Disambiguation is an important task in Natural Language Processing (NLP). In Chinese, Named Entity Recognition and Disambiguation is especially important and challenging. First, common words can also be used as Named entities. For example, 高明(brilliant), a common adjective, is also a person name in China. Therefore, it is challenging to distinguish common words from named entities, given that Chinese words have less morphology variations than many other languages. Second, different types of named entities can use the same names. For example, 金山(Gold Hill) can be used as the name of persons, locations and organizations. Finally, it is typical in China that many persons share the same name. For instance, there are many persons having the name 王刚(Wang Gang) in China. To further investigate these issues, SIGHAN 2012 establishes a task for Named Entity Recognition and Disambiguation.

Similar tasks have been explored previously. The KBP (Knowledge Base Population) task in TAC (Text Analysis Conference) has a named entity disambiguation task, which they use the term entity linking. KBP provides a knowledge base (KB) of named entities. The KB provides a mapping from names to entities. One name can be mapped to many entities. The goal of KBP is (1) to link names occurring in the document to the corresponding entities in KB; (2) to cluster names referingreferring to the same entity, if this entity is not included in the KB. Another related task is WPS (Web People Search). WPS does not provide any knowledge base, instead it require names referring to the same entity to be clustered together.

This task can be seen as combination of related tasks in KBP and WPS: First the test names in the document should be judged to be common words or named entities; if a name is predicted as a named entity, participants should further determine which named entity in the KB it refer to; finally, if some names are predicted as named entities thatdo not occur in the KB, participants should instead cluster these names by the named entities they refer to.

2. Task Description

We provide a knowledge base, NameKB, for this task. Suppose the name "N" is shared by m entities, then NameKB has m entries for name N, with each entry describing an entity. The following is an example entry of name 雷雨 (thunderstorm). And for concreteness, we use 雷雨 as an example of test name throughout this document.

<?xml version="1.0" encoding="UTF-8" ?>

- <EntityList name="雷雨">

- <Entity id="01">

<text>通江县第二中学教师,男,大学本科,西华师范大学英语语言文学专业毕业。高二英语备课组长。自参工以来一事从事高中英语教学工作,长期从事班主任工作,所任班级历届成绩显著。论文《虚拟语气的用法》、《浅谈分词短语作状语》、《定语从句中介词+关系词介词的选定》在国家级、省级刊物上发表。指导向桀等多名学生参加历届全国中学生英语能力大赛并获优秀指导奖。</text>

</Entity>

- <Entity id="02">

<text>重庆市黔江区太极乡党委副书记、乡长。主持政府全面工作,主管财政、金融、审计、统计、非公有制经济、城乡统筹、乡镇企业、招商引资、烤烟、蚕桑工作。</text>

</Entity>

- <Entity id="03">

<text>罗源县中房镇下湖村人。19788月加入中国共产党。1981年,毕业于上海同济大学规划专业。同年起,任福州市城乡设计院规划室主任、工程师,兼任福州市土木建筑学会秘书长。曾获省4项、市1项建筑规划设计奖。1993年,任福州市政府城市改造办公室主任科员、福州市房地产开发总公司副总工程师。2000年,任福建武夷工程总公司建兴公司总经理(副处级)2001年,任重庆武夷公司总经理(处级)</text>

</Entity>

- <Entity id="04">

<text>男,汉族,硕士研究生学历,出生于19619月,陕西 中共商南县委书记,商州人,19808月参加革命工作,19827月加入中国共产党,现任中共商南县委书记。曾任任共青团商洛地委副书记;洛南县政府副县长;任中共商南县委副书记;中共山阳县委常委、县政府常务副县长,等。</text>

</Entity>

- <Entity id="05">

<text>四川省蒲江县教育局党组书记、局长。主持县教育局全面工作。主管教育督导、计财、基建和教仪电教等工作。</text>

</Entity>

- <Entity id="06">

<text>女,19758月生,回族,广西南宁人,中共党员,19977月广西师范大学汉语言专业毕业,2006年获教育硕士学位,中学中级教师,19977月进入桂林中学任教语文至今。</text>

</Entity>

</EntityList>

Figure 1. Knowledge base for雷雨:雷雨.xml

For each test name, we provide a document collection T. Each document t in T contains at least one test name (e.g. 雷雨). For each 雷雨 present, you need to determine which entry in NameKB is the corresponding entity. If the word instance 雷雨 does not refer to any entity, you need to classify it as Other. If the word instance 雷雨 refers to an entity not in NameKB, you need to cluster the names according by the entities they refer to, and then name the clusters as Out_XX, where XX is a number (e.g. Out_01, Out_02, ...).

3. Data Description

We provide the following data

1. Knowledge base, NameKB. We provide a XML file for each test name. This file contains several entries describing the name. The file is named as N.xml, where N is the test name. For example, the file for 雷雨 is 雷雨.xml. Please refer to figure 1 for an example.

2. Document collection, T for each test name. All documents containing the name N are placed under the folder N. For example, all document containing 雷雨 are under the folder 雷雨. Every file in the folder is a plain text file, named as XXX.txt, where XXX is three numbers.

Result File Format

You are expected to return the results as follows. For each name N, output a plain text file, N.txt. For example, 雷雨.txt should be output for name 雷雨. Every file should contain multiple rows, and each row should consist of two columns separated by one space.

The first column is the file name without extension (e.g. 123 if the filename is 123.txt). The second column is the entity identifier in the following formats.

1. If N refers to an entry in the NameKB, output the entity id XX refered in the NameKB, where XX are two numbers;

2. If N does not refer to any entity, output Other;

3. Output Out_XX otherwise.

The format of one example output file named 雷雨.txt is as follows:

001 01

002 Out_02

003 Other

004 06

……

Note: all result files should be returned in UTF-8 encoding.

Note that we assume that every occurrence in a document has the same meaning. Therefore you should return only *one* result for each file.

4. Evaluation method

Take 雷雨 for example. Assume NameKB has 6 entries for 雷雨, and T is the document collection for 雷雨. As described above, the returned results of the participating system, which labels for each document containing 雷雨, fall into three classes, namely: SL_XX, SOther and SOut_XX. The gold-standard results which we annotate for each test document also are divided into three classes, which for evaluation purpose, are notated as : L_XX, Other and Out_XX.

We use the following methods to compute precision and recall for every document t in T. (Here t can be seen as a replacement of 雷雨.)

(1) if t in T is predicted as SL_XX, we use the following formulae.

(2) if t in T is predicted as SOther, we use the following formulae.

(3) if t in T is predicted as SOut_XX, we use the following formulae.

(4) Accorting to all the instance documents of 雷雨, the overall precision and recall are calculated as follows.

Note that the options (1) (2) and (3) are mutual exclusive.

(5) The overall precision and recall for all test names are calculated as follows (the set of all the test names are notated as N, each name is represented as n in N)

5. Contact Information

Houfeng Wang, Sujian Li

No 5, Yiheyuan Road, Haidian District, Beijing, China

Institute of Computational Linguistics, Department of Computer Science and Technology, Peking University.

Postal code 100871

Email: {wanghf, lisujian}@pku.edu.cn