|
|
|
面向商务信息抽取的产品命名实体识别研究
[作者]刘非凡; 赵军; 吕碧波; 徐波; 于浩; 夏迎炬;
[摘要]市场信息化使得商务信息抽取、市场内容管理日益成为信息科学领域的一个研究热点。产品命名实体识别作为其中非常重要的关键技术之一也逐渐受到人们的关注。本文面向商务信息抽取对产品命名实体进行了定义并系统分析了其识别任务的特点和难点,提出了一种基于层级隐马尔可夫模型(hierarchical hid-den Markov model)的产品命名实体识别方法,实现了汉语自由文本中产品命名实体识别和标注的原型系统。实验表明,该系统在电子数码和手机领域均取得了令人满意的实验结果,对产品名实体、产品型号实体、产品品牌实体整体识别性能的F值分别为79.7%,86.9%,75.8%。通过和最大熵模型相比较,验证了 HHMM对于处理多尺度嵌套序列有更强的表征能力。
[Abstract]Electronic business has fueled increasing research interest recently in business information extraction and market intelligence management.As one of the key techniques,product named entity recognition(product NER) has also begun to draw more attention in the field of natural language processing.In the paper,characteristics and challenges in product NER are explored and analyzed deliberately,and a hierarchical hidden Markov model(HHMM) based approach to product NER from Chinese free text is presented.Experim...
[关键字]计算机应用; 中文信息处理; 产品命名实体识别; 商务信息抽取; 层级隐马尔可夫模型;
| 基于多策略优化的分治多层聚类算法的话题发现研究
[作者]骆卫华; 于满泉; 许洪波; 王斌; 程学旗;
[摘要]话题发现与跟踪是一项评测驱动的研究,旨在依据事件对语言文本信息流进行组织利用。自1996年提出以来,该研究得到了越来越广泛的关注。本文在研究已有成熟算法的基础上,提出了基于分治多层聚类的话题发现算法,其核心思想是把全部数据分割成具有一定相关性的分组,对各个分组分别进行聚类,得到各个分组内部的话题(微类),然后对所有的微类再进行聚类,得到最终的话题,在聚类的过程中采用多种策略进行优化,以保证聚类的效果。基于该算法的系统在TDT4中文语料上进行了测试,结果表明该算法属于目前结果最好的算法之一。
[Abstract]Topic Detection and Tracking is a research driven by evaluation,which intends to organize and utilize information stream of texts according to event.Since being brought forward in 1996,it comes under more and more attention.This paper proposes an algorithm of division and multi-level clustering with multi-strategy optimization,which bases on study of today's mature algorithms.The core thought of the algorithm is to divide all data into groups(each group has intrinsic relevance),and cluster in each group to ...
[关键字]计算机应用; 中文信息处理; 话题发现与跟踪; 分治多层聚类; 系统聚类;
| 现代藏语动词的句法语义分类及相关语法句式
[作者]江荻;
[摘要]本文突破了传统藏文文法关于动词分类的简单描述,建立起以句法语义为纲要的动词类别和相关句法规则。本文区分了藏语12大类动词,各类动词都有不同论元数量和不同句法性质的要求。因此,动词的句法语义类别划分能够较细致和全面反映各种类型藏语句式的语法结构框架,包括句子的语序、词格标记和句法助词。动词的句法语义分类结果可以直接应用于藏语语法信息词典的构建,是藏语计算处理的重要基础。
[Abstract]This paper discusses the classification of Tibetan verbs according to verbal semantic types and syntactic types,which can describe the numbers of arguments and characters of components in sentences.There are 12 types of verbs in Tibetan.They are stative verbs,action verbs,cognition verbs,perception verbs,verbs of change,directional verbs,narrate verbs,copula,verbs of possession,existential verbs,interrelation verbs,causative verbs,each of which requires different case markers for nouns in different position...
[关键字]计算机应用; 中文信息处理; 藏语; 动词句法语义分类; 句法结构; 句法标记;
| 汉语自动分词和词性标注评测
[作者]杨尔弘; 方莹; 刘冬明; 乔羽;
[摘要]本文介绍了2003年“863中文与接口技术”汉语自动分词与词性标注一体化评测的一些基本情况,主要包括评测的内容、评测方法、测试试题的选择与产生、测试指标以及测试结果,并对参评系统的切分和标注错误进行了总结。文中着重介绍了测试中所采用的一种柔性化的自动测试方法,该方法在一定程度上克服了界定一个具体分词单位的困难。同时,对评测的结果进行了一些分析,对今后的评测提出了一些建议。
[Abstract]This paper presents the results from the '863 Chinese and Interface Technology' integrative evaluation on automated Chinese word Segmentation and part of speech(POS) tagging held in 2003,Beijing.It describes the evaluation content,evaluation method,selection of test questions,test guideline,etc.,and summarizes the type of word segmentation and POS tagging errors in the tested result.In this paper,we also present a flexible automatic method used in the evaluation.Finally,we give some analyse on the result an...
[关键字]计算机应用; 中文信息处理; 自动分词; 词性标注; 评测;
| 基于Multigram语言模型的主动学习中文分词
[作者]冯冲; 陈肇雄; 黄河燕; 关真珍;
[摘要]分词是中文处理中的重要基础问题。为了克服Web文本分析中传统方法在适应繁杂的专业领域和多变的语言现象时存在的困难,本文以无督导分词方法为基本框架,使用EM算法建立n元multigram语言模型,提出了一种基于置信度的主动学习分词算法,使得系统在主要利用大量未标注数据的同时,还能够主动选择少量最有价值的数据提交人工标注。实验结果表明算法性能优于相关的几种无督导分词算法。
[Abstract]Word segmentation is a fundamental task in Chinese processing.To solve the difficulties of traditional methods in coping with various application domains and evolutive language phenomena,this paper adopts an unsupervised learning framework,using EM algorithm to train the n-multigram language model.A new certainty-based active learning segmentation algorithm is proposed,which combine labeled data with unlabeled data together to optimize language model.In experiments it outperforms other unsupervised word seg...
[关键字]计算机应用; 中文信息处理; 分词; 无督导机器学习; 主动学习; EM算法;
| 基于HowNet的词汇语义倾向计算
[作者]朱嫣岚; 闵锦; 周雅倩; 黄萱菁; 吴立德;
[摘要]在互联网技术快速发展、网络信息爆炸的今天,通过计算机自动分析大规模文本中的态度倾向信息的技术,在企业商业智能系统、政府舆情分析等诸多领域有着广阔的应用空间和发展前景。同时,语义褒贬倾向研究也为文本分类、自动文摘、文本过滤等自然语言处理的研究提供了新的思路和手段。篇章语义倾向研究的基础工作是对词汇的褒贬倾向判别。本文基于HowNet,提出了两种词汇语义倾向性计算的方法:基于语义相似度的方法和基于语义相关场的方法。实验表明,本文的方法在汉语常用词中的效果较好,词频加权后的判别准确率可达80%以上,具有一定的实用价值。
[Abstract]Nowadays,with the development of Internet and information explosion,automated techniques for analyzing author's attitudes towards specific events will make great effort to business intelligence and public opinion survey.Semantic orientation inference has become a meaningful tool,which could provide useful information for text classification,summarization,filtering etc.Measuring the semantic orientation of words would greatly contribute to predicting the author's attitude in a passage.In this paper,a simple ...
[关键字]计算机应用; 中文信息处理; 态度分类; 语义倾向; 知网;
| 基于模式分类的汉语时态确定方法研究
[作者]林达真; 李绍滋;
[摘要]汉语时态是中文信息处理领域的一个难点。基于规则的处理方法在无时态特征词的句子,多时态特征词的句子处理等方面存在很大问题。本文从统计的角度,提出一种基于模式分类的时态确定方法,该方法综合评价句子中每个词对时态确定所作的贡献,能够处理无时态特征词的句子和多时态特征词的句子,并且该方法使用线性判别函数,具有对多维数据分析,训练与判别速度快的特性。在开放测试环境下,对单句的汉语时态确定正确率与召回率分别为79.8%和95.3%。
[Abstract]As far as NLP is concerned,the tense of the Chinese language is especially hard to tackle.One of the outstanding characteristics of the Chinese language is that its tense is usually implied rather than obvious.Hence,the Rule-based solution is far from suitable for the r
[关键字]计算机应用; 中文信息处理; 汉语; 时态; 特征词; 线性判别函数; 感知器准则函数;
| 基于知网的文本推理
[作者]石晶; 戴国忠;
[摘要]文本推理在自然语言处理的应用中占有极为重要的位置,本文介绍了基于知网的一种推理方法,该方法以语义网络的形式表示知网中的知识,利用“标记传递”实现推理。其特点是引入构造-融合模型的思想,动态生成知识结构,有引导地在文本词汇间建立推理路径。利用16种推理类的实例对其进行测试,结果表明在有足够上下文的条件下,该方法能够得出较为理想的推理,并且代价不高。
[Abstract]Text inference is central to natural language applications.This paper presents an inference method based on HowNet,which organizes knowledge with semantic net and infers with marker passing.The method introduces construction-integration model,generates knowledge structure dynamically and builds paths between text words with guide.Examples of 16 inference classes are used to test it. The results show that ideal inferences can be extracted with low cost if enough contexts are given.ecognition of tense in situations where tense-informing words are missing or more than one of such words are present.In this paper,we introduce a pattern-classification based solution,which evaluates each single word in terms of its contribut...
[关键字]计算机应用; 中文信息处理; 文本推理; 构造-融合模型; 标记传递; 语义网;
| 手写中文信封的地址行字符切分算法
[作者]韩智; 刘昌平; 殷绪成;
[摘要]在手写体中文信封处理系统中,地址行字符切分是实现地址行识别的关键步骤。本文根据邮政信封地址行字符的特点,有针对性的提出了一种字符切分算法。首先对地址行图像利用投影、求连通区域、笔划穿越数分析等基于字符结构的方法进行初始切分,得到基本字段序列;然后通过对相邻的基本字段进行组合形成多条候选切分路径,再通过识别的可信度和邮政目标地址库的先验知识信息对路径进行评价分析,从而得到最优的切分路径。该算法经过邮政分拣机采集的实际信封图像测试, 纯地址行识别正确率达到78.61%,地址行识别与邮政编码识别相结合的分拣正确率达到95.42%。
[Abstract]Character segmentation for mail address has become a crucial step for the address recognition in the automatic post mail sorting system.In this paper,a character segmentation algorithm was proposed according to the characteristics of handwritten mail address character.First a simple segmentation process was fulfilled using the structure-based methods,including vertical projection,connected components extraction and stroke cross number analysis,to extract the block sequence from the mail address image.Next c...
[关键字]人工智能; 模式识别; 邮政信封地址; 脱机手写体汉字; 字符切分; OCR;
| 基于时空分析的线索性事件的抽取与集成系统研究
[作者]吴平博; 陈群秀; 马亮;
[摘要]信息抽取技术能够提供高质量的检索服务。本文面向网络新闻事件,对人们感兴趣的事件关键信息进行了抽取和集成。系统中采用了如下的方法、策略:(1)利用句型模板构造抽取规则,然后直接从经过时间短语和空间短语识别和规范化处理的文本中抽取事件信息,从而跳过了深层句法分析,降低了实现系统的难度;(2) 利用事件的规范化的时空信息关联不同文档中的同一事件,进行事件合并;(3)文档发生事件转移时对文档进行事件切分,从而解决了文档内不同事件信息的归并问题。初步实验结果表明:本文采用的方法和策略是有效的。
[Abstract]Technology of information extraction(IE) can provide high-quality service for retrieval.Targeting at events in web news,this paper conducts a system that can extract and integrate key information of event that interests people.Methodologies and strategies of the system are as follows:(1) Extraction rules are built in terms of sentence patterns,then event information is directly extracted from the text in which temporal phrases(TP) and space phrases(SP) are recognized and normalized.The extraction system can...
[关键字]计算机应用; 中文信息处理; 信息抽取; 句型模板; 线索性事件; 时空信息; 事件合并;
|
共95页 当前第7页 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
|