基于FIFA算法的文本分类
[作者]朱靖波; 姚天顺;
[摘要]本文提出了一种简单有效的文本分类方法 ,其中采用基于FIFA算法的内容主题分析技术 ,实现文本的自动分类过程。文中详细论述了文本自动分类的基本过程和FIFA算法描述 ,最后给出了文本自动分类的实验结果和评价
[Abstract]We present a simple and effective approach for the task of text classification.The approach uses FIFA algorithm to text classification.In this paper the basic process of text classification task and FIFA algorithm are described in detail.At last some results of experiment and evaluations are discussed.
[关键字]FIFA算法; 主题识别; 文本分类; 自然语言处理;
|
向量空间模型中特征词的区分度的定量研究
[作者]游荣彦; 邓志才; 李传宏;
[摘要]本文提出了关于一个词的文本类间频率的概念 ,给出一个词在文本分类中的区分度的定义 ,讨论了区分度的性质 ,提出了选择特词新的方法 ,定义了特征词的权重 ,建立了向量空间模型的一套加权距离分类规则。实验结果表明 ,本文的方法是有效和有用的
[Abstract]This paper presents a conception of frequencies of a word distributed all over the classes of texts,gives a definition of the degrees of distinction of a word in text categorization,discusses the properties of the degrees of distinction,puts forward a new approach to select the feature words,defines the weights of all selected feature words,and finally establishes the weighted distance categorization rules of VSM. The experiment results show that the method is effective and useful.
[关键字]文本分类; 向量空间模型; Bayes后验概率; 加权距离;
|
一种基于向量空间模型的多层次文本分类方法
[作者]刘少辉; 董明楷; 张海俊; 李蓉; 史忠植;
[摘要]本文研究和改进了经典的向量空间模型 (VSM )的词语权重计算方法 ,并在此基础上提出了一种基于向量空间模型的多层次文本分类方法。也就是把各类按照一定的层次关系组织成树状结构 ,并将一个类中的所有训练文档合并为一个类文档 ,在提取各类模型时只在同层同一结点下的类文档之间进行比较 ;而对文档进行自动分类时 ,首先从根结点开始找到对应的大类 ,然后递归往下直到找到对应的叶子子类。实验和实际系统表明 ,该方法具有较高的正确率和召回率
[Abstract]This paper does research and improves on the classical approach of calculating the term weight in Vector Space Model.Furthermore,an approach of multi hierarchy text classification based on Vector Space Model is proposed.In this approach,all classes are organized as a tree according to some given hierarchical relations,and all the training documents in a class are combined into a class document.In order to construct the class models,it is just only to compare among the class documents attached to the ...
[关键字]文本分类; 向量空间模型; 信息增益; 特征提取;
|
基于多策略分析的复杂长句翻译处理算法
[作者]黄河燕; 陈肇雄;
[摘要]在实用机器翻译系统的研究开发中 ,复杂长句的翻译处理是其面临的一个主要难题。本文提出一种多语种通用的基于多策略分析的复杂长句翻译处理算法 ,该算法通过基于实例模式匹配和规则分析相结合的方法 ,综合利用源语言句子中多种相关的语言特征 ,包括语法语义特征、句子长度、标点符号、功能词以及上下文语境条件等对复杂长句进行切分简化处理和译文的复合生成。另一方面 ,通过对不同语种设计相同的知识表示形式 ,实现该算法对不同语种翻译系统的通用性
[Abstract]The processing of complex long sentences is a difficult problem in the implementation of a practical machine translation system.In this paper,a processing approach to the complex long sentence based on multi strategy is proposed,in which the rule based analysis and case based consistent matching are combined by using various linguistic features,including sentence length,punctuation,functional words and contextual condition and so on,which is used to segment the complex sentence into several simple senten...
[关键字]机器翻译; 多策略分析; 长句切分简化处理;
|
基于粗集的汉语词语义项知识的获取
[作者]杨尔弘; 郝秀兰; 李盛;
[摘要]由于自然语言语序的灵活性 ,使得自然语言知识的自动获取很困难。本文基于粗糙集理论的属性值约简方法 ,结合基于记忆的学习 (MemoryBasedLearning ,简称MBL) ,提出了一种汉语多义动词义项知识的获取方法 ,用该方法获得的知识可用于词义消歧
[Abstract]Due to flexibility of natural language form,automatic acquisition of knowledge about natural language becomes difficult.In this paper,we present a method for acquisition of knowledge about Chinese multi sense verbs based on attributive value reduction of Rough Set and MBL.The knowledge acquired can be used in tasks such as word sense disambiguation,etc.
[关键字]粗集; 记忆学习; 汉语多义动词; 知识获取; 自然语言处理;
|
汉语情感意义的机器标注研究初探
[作者]应英; 周锋; 周昌乐;
[摘要]本文将情感计算引入到汉语的机器理解中 ,在已有的汉语机器理解研究的基础上 ,采用多重松弛迭代计算方法 ,对汉语情感意义的标注问题进行了研究 ,通过语境信息的利用 ,构建了一个实验性系统并取得了较准确的词语情感标注 ,为后续的句子情感意义的理解提供了基础 ,拓宽了汉语机器理解的研究范围
[Abstract]This paper introduces emotions and their computation to machine understanding of Chinese.Based on the existing research of the machine understanding of Chinese,the problem of the emotional meaning tagging of Chinese was studied by using the multiple relaxation alternate algorithm.We designed an experimental system and obtained accurate Emotion tagging matching by using the context information.It is the first step to the machine understanding of the emotional meaning of Chinese sentences,thus the foundation ...
[关键字]汉语机器理解; 情感标注; 多重松驰迭代算法; 学习及纠错机制;
|
基于神经元网络的汉语短语边界识别
[作者]奚晨海; 孙茂松;
[摘要]短语边界的识别是浅层句法分析或组块分析的基础 ,对真实文本的处理具有重要意义。在一个含有 6 442 6词的汉语树库的支持下 ,本文设计并实现了基于神经元网络的汉语短语边界自动识别模型。初步实验结果显示 ,该模型的界定准确率为 93 2 4 % (封闭测试 )和 92 5 6 % (开放测试 )。
[Abstract]Prediction of Chinese phrase boundary location is the base of shallow parsing or chunk parsing.It is also very important for processing real texts.With the support of our Chinese treebank including 64426 words, this paper designs and implements a method for automatic prediction of Chinese phrase boundary location based on neural network. The preliminary results show that the precision is 93.24%(close testing) and 92.56%(open testing) respectively.
[关键字]汉语短语边界自动识别; 神经元网络; 中文信息处理;
|
汉语文本形式结构分析及其标引算法
[作者]单永明;
[摘要]本文从形式化的角度讨论了汉语文本的形式结构及有关的基本概念 ,给出了文本的标题、子标题、段落及其层次结构的一种划分与标记方法 ,提出了规范的与准规范的文本等概念 ,并以此为基础讨论了文本形式结构的标引问题 ,给出了两个标引算法。本文阐明的方法和结果对汉语文本的全文文本标引及结构化分析具有直接的现实意义
[Abstract]In the paper,we discuss chinese text structure from the point of view of formalization.Formal descriptions for the heading,subheading and paragraph as well as their structural relations in a text are present and a systematic tagging method for chinese text structure is proposed.And then, we introduce the conceptions of the normal text and quasi normal text.On the basis of these,indexing algorithm for formal chinese text structure is discussed.The methods and results presented in the paper are of direct and...
[关键字]中文信息处理; 文本结构分析; 标引树; 自动标引算法;
|
运用文本领域的常识改善基于支撑向量机的文本分类器性能
[作者]李辉; 史忠植; 许卓群;
[摘要]本文提出了一种提高中文文本分类器推广性能的方法。一般而言 ,采用机器学习的方法对文本集合进行训练 ,可以获得文本分类器。本文引入了文本语义不变性常识 ,并将其融合到文本分类器中 ,提出了改进文本分类器的方法。与支撑向量机相结合 ,设计并实现了改进的文本分类器。对中文文本分类的实验表明 ,文本语义不变性常识的运用有效地改善了分类器的推广性能
[Abstract]In the paper,a method to improve the generalization performance of the Chinese text classifier is put forward.Generally speaking,a text classifier is obtained by training text set with a machine learning method.A kind of common sense about text semantic invariance is introduced.A method to improve the text classifier is put forward by fusing the common sense into it.With the combination with a Support Vector Machine,we design and implement the improved text classifier.The experiment shows that the generaliz...
[关键字]文本分类; 同语义文档子段替换; 人工文档样本; 相容性条件; 支撑向量机;
|
一种面向口语的译文质量自动评价方法
[作者]程葳; 徐波;
[摘要]译文质量的自动评价对机器翻译研究具有十分重要的意义。但现有方法主要是针对书面语翻译 ,没有考虑到口语翻译的特征。因此 ,本文提出了一种面向口语的新型的自动评价方法 ,通过定义信息段、标注权重和设计多种匹配策略等方法 ,使自动评价结果与人工打分更为接近 ,同时也提高了评价过程对不同输出译文的适应能力。各项实验表明 ,该算法对译文质量变化具有较高的敏感度 ,而且可以对输出译文质量作出与手工评判较为接近的评价结果
[Abstract]The automatic evaluation of output quality for machine translation systems is a difficult problem.Most research in this section is for writing language,which is quit different from oral language.Therefore,this paper provides a new automatic evaluation algorithm for speech translation system.It defines the block according to the word position,tags the information weight and designs several matching methods.All these make the automatic evaluation results more closely to the score of human.And the automatic ev...
[关键字]机器翻译评测; 口语翻译; 自动评价;
|
共95页 当前第34页