|
|
|
基于向量空间模型的文本分类系统的研究与实现
[作者]陈治纲; 何丕廉; 孙越恒; 郑小慎;
[摘要]文本分类是信息处理的一个重要的研究课题 ,它可以有效的解决信息杂乱的现象并有助于定位所需的信息。本文综合考虑了频度、分散度和集中度等几项测试指标 ,提出了一种新的特征抽取算法 ,克服了传统的从单一或片面的测试指标进行特征抽取所造成的特征“过度拟合”问题 ,并基于此实现了二级分类模式的文本分类系统。和类中心分类法相比 ,实验结果表明二级分类模式具有较高的精度和召回率。
[Abstract]Text classification is an important research task of natural language processing, which can efficiently resolve the issue of information chaos and help to locate the required information. The traditional approaches of text classification commonly extract feature terms from a single test criterion, which will lead to the problem of “over fitting". This paper comprehensively takes test criterions such as frequency, distribution and concentration into account and proposes a new arithmetic of feature extraction...
[关键字]计算机应用; 中文信息处理; 文本分类; 测试指标; 特征抽取; 二级分类模式;
| 信息检索策略性能的云模型评价方法
[作者]康海燕; 李彦芳; 林培光; 樊孝忠;
[摘要]在信息检索中 ,目前常见的评价方法仅能反映检索策略的平均性能 ,不能反映策略的稳定性、随机性等问题 ,因此对检索策略的评价不够全面。本研究提出了基于云模型的检索策略评价方法 ,该方法建立了定性评价和定量数据之间的自然转换 ,这种转换是通过严格的数学方法来实现的 ,用该方法评价检索策略 ,不仅能反映策略的平均性能 ,而且能反映策略的稳定性。实验数据表明 ,该方法是切实可行的 ,评价结果更加逼近实际情况。该方法也可以用于文本分类策略的评价。
[Abstract]At present the most popular methods of strategy evaluation in information retrieval system cannot reflect stability and randomicity. So the tradition methods are not comprehensive enough for strategy evaluation. This research presents a new method of strategy evaluation based on cloud model. This method can reflect not only average performance of a strategy but also stability and randomicity. This method sets up a transform of qualitative concepts and quantity. This kind of transform is carried out through ...
[关键字]计算机应用; 中文信息处理; 信息检索; 云模型; 策略性能评价;
| 面向Internet的中文新词语检测
[作者]邹纲; 刘洋; 刘群; 孟遥; 于浩; 西野文人; 亢世勇;
[摘要]随着社会的飞速发展 ,新词语不断地在日常生活中涌现出来。搜集和整理这些新词语 ,是中文信息处理中的一个重要研究课题。本文提出了一种自动检测新词语的方法 ,通过大规模地分析从Internet上采集而来的网页 ,建立巨大的词和字串的集合 ,从中自动检测新词语 ,而后再根据构词规则对自动检测的结果进行进一步的过滤 ,最终抽取出采集语料中存在的新词语。根据该方法实现的系统 ,可以寻找不限长度和不限领域的新词语 ,目前正应用于《现代汉语新词语信息 (电子 )词典》的编纂 ,在实用中大大的减轻了人工查找新词语的负担。
[Abstract]With the fast development of the society,more and more new words come out in our life. It is one of the important topics in Chinese natural language processing to collect those new words. A method is presented for detecting these new words automaitcally in this paper. Through analysing webpages grabbed from the Internet, a large word and string set is built, which new words are detected from and filtered by rules. At last new words which exist in the webpages grabbed are extracted. The system built in this ...
[关键字]计算机应用; 中文信息处理; 新词语; 自动检测;
| 结合决策树方法的中文姓名识别
[作者]王振华; 孔祥龙; 陆汝占; 刘绍明;
[摘要]中文姓名识别是自然语言处理中专名识别的一个重要的子问题 ,本文将中文姓名的识别过程细分为三个步骤 :抽取阶段、分类阶段和消歧阶段。利用中文姓和名的用字概率信息 ,在文本中抽取潜在的中文姓名 ,以及其相关的上下文词法、语法和语义特征 ,并将潜在姓名是否是真实姓名的判别看作是两分类问题 ,并利用决策树算法来实现初步判别 ,最后消除初步判别结果中的歧义现象。实验结果表明 ,该方法的召回率和准确率都可达到 90 %以上。
[Abstract]Chinese person name identification is a subfield of Named Entity Identification in natural language processing. This identification is divided into three stages in this paper: extraction, classification, and disambiguation. The candidate Chinese person names are extracted using statistical information. The morphological, syntax, and semantic features of the context are also extracted to compose the sample of classification. The estimation of the candidate is deemed to classification. We classify every candi...
[关键字]人工智能; 自然语言处理; 中文姓名识别; 决策树;
| 基于词汇吸引与排斥模型的共现词提取
[作者]郭锋; 李绍滋; 周昌乐; 林颖; 李胜睿;
[摘要]共现词提取在信息挖掘和自然语言处理中有着十分重要的地位。而传统的共现词提取方法仅仅局限在单一的一种统计量上 ,其结果十分不精确 ,需要人工再进行整理。本文提出了一种基于词汇吸引与排斥模型的共现词提取算法 ,并通过将多种常用统计量进行组合 ,改进了算法的效果。在开放测试环境下 ,所提取的共现词其用户感兴趣度为 6 0 87%。将该算法应用于基于Web的共现词检索系统 ,在速度和共现词的提取精度上均取得了比较好的效果
[Abstract]Co-occurrence word retrieval is very important in information mining and natural language processing. But traditional co-occurrence word retrieval methods used only a single statistic method, so the result is very imprecise, and needs lots of manual collation. In this paper we present a co-occurrence words extraction algorithm based on the lexical attraction and repulsion model, and combine some common statistical methods with the algorithm to improve its effect. In the open test, our system's Interesting p...
[关键字]计算机应用; 中文信息处理; 共现词; 词汇吸引与排斥模型; 共现距离;
| 基于小规模语料库和机器可读词典的二元分布语义获取
[作者]郝秀兰; 杨尔弘;
[摘要]本文提出了一种基于小规模语料库和机器可读词典 (MachineReadableDictionary ,MRD)的无指导的动词语义获取方法。该方法不需要使用有义项标注的语料库 ,而是使用从语料中获得的V +N搭配以及MRD中多义词定义的应用实例中获得的知识。使用两种方法解决数据稀疏问题 :首先 ,将词的相似性度量由直接共现扩展到共现词的共现 ,以共现聚类而不是共现词来计算词的相似度。其次 ,从MRD定义中获取名词的IS-A关系。通过这些方法 ,即使两个词不共享任何词 ,也可认为是相似的。实验表明 ,该方法可从很小规模的语料中获取知识 ,并在不限制词义的情况下达到 85 7%的正确排歧率。
[Abstract]This paper presents a system for unsupervised verb semantic knowledge acquisition using small corpus and a machine-readable dictionary (MRD). The system does not depend on sense-tagged corpus, but learns a set of typical usages listed in the MRD usage examples for each of the senses of a polysemous verb in the MRD definitions and uses verb-object co-occurrences acquired from the corpus. This paper concentrates on the problem of data sparseness in two ways. First, extending word similarity measures from dire...
[关键字]人工智能; 自然语言处理; 机器可读词典; 二元分布; 语义; 知识获取;
| 结合类频率的关联中文文本分类
[作者]钱铁云; 王元珍; 冯小年;
[摘要]该文提出一种词类频率和关联中文文本分类相结合的算法ARCTC。此算法将文档视作事务 ,关键词视作项 ,并针对文本事务的特性 ,提出利用词的类频率筛选与分类相关性不大的词汇 ,然后将改进的关联规则挖掘算法用于挖掘项和类别间的相关关系。挖掘出的规则用于形成类别特征词的集合 ,可用来和类标号未知文档的词的集合求交集 ,交集元素个数最多者即为所分类别。实验证明 ,该算法在提高训练时间和测试时间的同时具有较好的召回率、准确率和F Measure。
[Abstract]In this paper, a new algorithm that integrates class frequency into association rules based document classification is introduced into Chinese text categorization. This algorithm views each document as a transaction and each term as an item. The class frequency of a term is used to filter the words that are irrelevant to classification, and the mining algorithm of association rules is used to mine the correlation between item and category. Class character words sets are formed basing on the rules, and unlab...
[关键字]计算机应用; 中文信息处理; 基于关联的分类; 中文文本分类; 词类频率; 类别特征词集合;
| 面向电子商务的知识描述语言
[作者]何坚; 覃征; 贾晓琳; 谢国彤;
[摘要]针对电子商务自动化、智能化和移动化的新趋势 ,应用本体论对电子商务知识建模 ,提出电子商务知识描述的分层框架 ,结合描述逻辑、框架系统设计了基于XML和本体论技术的电子商务知识描述语言(KDL)。介绍了KDL的语法 ,从一阶逻辑的角度分析KDL的语义特征 ,提供KDL描述到一阶逻辑表达式的映射方法。最后 ,通过实例证明KDL具有规范的语法、精确的语义和较强的逻辑推理能力。
[Abstract]According to the increasingly intelligent and mobile characteristic of E-commerce, an XML-based and ontology-supported E-commerce Knowledge Description Language (KDL) is first presented, which has three-tier structure(Core KDL, Extended KDL and Complex KDL), and takes advantages of strongpoint of ontology, XML, description logics, frame-based systems. And then, we introduce the XML-Based syntax of KDL, and give the methods of translating KDL into first order logic. At last, the reasoning ability of KDL pro...
[关键字]人工智能; 自然语言处理; 本体论; 电子商务; 描述逻辑; 框架系统; 一阶逻辑;
| 基于HNC理论的句法结构歧义消解
[作者]张克亮;
[摘要]歧义消解是自然语言理解和处理所面对的核心问题。基于词组和短语的消歧不能保证消歧结果的正确 ,歧义的成功消解基于对语境或上下文 (context)的正确理解。HNC理论采取的概念基元化、层次化、网络化、形式化策略以及在此基础上建立的句类和句式体系 ,为自然语言的歧义消解提供了最大的可能。基于HNC理论的歧义消解的总体原则是 ,以语句为基础 ,充分利用语句语境提供的句类知识 ,采取宏观消歧与微观消歧相结合的策略。对于经典句法歧义结构V +NP1+的 +NP2 ,本文描述了其三重性歧义性质 ,并提出了三条准则和十个推论以实现对其歧义的消解
[Abstract]Disambiguation has always been the focus of natural language understanding and processing. Successful disambiguation relies on the correct understanding of a given context. The HNC theory is characteristic of its formalized representation of conceptual primitives, its arrangement of concepts in a hierarchical network, and its development of the sentence category (SC) and sentence format (SF) systems. All this provides the utmost possibility for resolving ambiguity in natural languages. The overall principle...
[关键字]人工智能; 自然语言处理; HNC理论; 句法结构歧义; V+NP1+的+NP2; 消解策略; 消解准则;
| 语言工程的软件体系结构研究综述
[作者]冯冲; 陈肇雄; 黄河燕;
[摘要]语言工程的软件体系结构已经逐渐发展成为语言工程的主要研究领域之一。它面向通用的自然语言应用 ,为其提供架构层次的参考方案。研究内容涵盖与体系结构相关的计算资源、语言资源、方法和应用等多个方面。在一定意义上 ,可以把它看作是在语言工程领域内的特定领域软件体系结构 (DSSA)。本文概要介绍了该领域的发展历程和研究意义 ,然后对其基本概念和当前主要研究进展进行了阐述和分析 ,并展望了进一步的发展趋势。
[Abstract]Providing reference architectures for general natural language applications, software architecture for language engineering has gradually became one of the main research fields of language engineering in the past several years. This paper makes a short review on this fresh area, introduces its primary concepts, and discusses some representative progresses. Based on the analysis to the current work, we present some promising direction for future research.
[关键字]人工智能; 自然语言处理; 综述; 语言工程; 软件体系结构;
|
共95页 当前第17页 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
|