 |
| [题目] | 一种快速获取领域新词语的新方法
| |
| [英文题目] | A New Approach for Domain New Words Detection
| |
| [作者] | 刘华;
| |
| [英文名] | LIU Hua(College of Chinese Language and Culture of Jinan University; Guangzhou; Guangdong 510610; China);
| |
| [关键字] | 人工智能; 自然语言处理; 新词语; 识别; 聚类;
| |
| [英文关键字] | artificial intelligence; natural language processing; new words; detection; clustering;
| |
| [摘要] | 本文提出一种新词语识别新方法。该方法直接抽取分类网页上人工标引的关键词,并按照其网页栏目所属类目存储进各分类词表,从而快速完成新词语识别和聚类任务。该方法简单快捷。我们利用该方法从15类6亿字网页中抽取到229237个词条,其中新词语175187个,新词率为76.42%,其中游戏类新词率最高,时政_社会类新词率最低。新词语以命名实体为主,结构固定,意义完整性和专指性强,有助于解决歧义切分和未登录词问题,并能提高文本表示如分类和关键词标引的效果。
| |
| [英文摘要] | The paper puts forward a new method for domain new words detection,which directly extracts labeled by specialist in web pages,and stored them in classified wordlist according to the column of source web page.The simple approach can detects new words and clusters quickly.Using the approach,from 6 hundred million web pages covering 15 domains,we extracted 229237 words,including 175187 new words,the new words ratio is 76.42%.New words are mostly Named Entities,which have steady structure and integrated meaning...
| |
| [期刊] | 2006年第5期 |
|