汉字输入法码本自动更正设计研究
[作者]陆剑江; 钱培德;
[摘要]本文主要研究了在汉字输入法设计中的码本自动更正的设计与实现 ,提出了码本规则库的概念及设计思想 ,阐述了更正系统的工作原理 ,详细讨论了基于规则库的自动更正设计方案及工作流程 ,最后从实际应用的角度出发 ,提出了如何将输入法更正系统与输入法的集成策略。
[Abstract]This paper mainly studies the design and the realization of the automatic verify check for code table of Chinese input method.It gives out the conception and the design of the rule base and tells the working theory of the verify check system.Then it discusses the design scheme and the working procedure of automatic verify check based on rule base in detail.Finally,it gives out the integration strategy of the automatic verify system and the Chinese input method from the point of the application in realizat...
[关键字]输入法; 码本; 规则库; 自动更正;
|
自动问答综述
[作者]郑实福; 刘挺; 秦兵; 李生;
[摘要]自动问答技术是自然语言处理领域中一个非常热门的研究方向 ,它综合运用了各种自然语言处理技术。本文介绍了自动问答技术的发展现状和自动问答系统中常用的技术。自动问答系统一般包括三个主要组成部分 :问题分析、信息检索和答案抽取。本文分别介绍了这三个主要组成部分的主要功能和常用的方法。最后还介绍了自动问答系统的评价问题。
[Abstract]Question Answering is a hot research field in Natural Language Processing,which includes many kinds of NLP technology.This paper introduces the current research status and the methods that are often used in Question Answering.In general,a Question Answering system is made up of three parts:Question Analysis,Information Retrieval and Answer Extraction.This paper describes the main functions of these three parts and the common approach used in these parts in detail.At last,this paper introduces the evaluat...
[关键字]自动问答; 问题分类; 信息检索; 答案抽取;
|
“CAU”词及其知识图分析
[作者]刘小冬; 张蕾;
[摘要]专家系统是人工智能研究领域的一个重要研究分支。专家系统主要由两部分组成 :知识库和推理机。知识库中的知识主要由“IF—THEN”这样的知识组成。知识图是一种新的知识表示方法。在知识图中 ,含有“IF—THEN”结构的句子是由起因操作符 (causaloperator)或起因关系 (CAU relation)表示的。本文挑选了一些具有一定代表性的起因意义的汉语“CAU”操作符 ,并且基于知识图理论分析了这些操作符 ,并进行了分类 ,目的是为专家系统中知识库的建立做准备。
[Abstract]Expert systems form one of the most important research areas in Artificial Intelligence.The main parts in expert systems are knowledge bases and inference engines.In the knowledge bases the main knowledge is knowledge expressed by"IF THEN"statements.In knowledge graphs,a new form of knowledge representation,the"IF THEN"statements are tired up with causal operators(CAU relations).In this paper,we picked out some Chinese operators with"CAU"meaning,and investigated these operators.The goal is to build knowl...
[关键字]专家系统; 知识图; 知识库; 起因单词;
|
基于统计分词的中文网页分类
[作者]黄科; 马少平;
[摘要]本文将基于统计的二元分词方法应用于中文网页分类 ,实现了在事先没有词表的情况下通过统计构造二字词词表 ,从而根据网页中的文本进行分词 ,进而进行网页的分类。因特网上不同类型和来源的文本内容用词风格和类型存在相当的差别 ,新词不断出现 ,而且易于获得大量的同类型文本作为训练语料。这些都为实现统计分词提供了条件。本文通过试验测试了统计分词构造二字词表用于中文网页分类的效果。试验表明 ,在统计阈值选择合适的时候 ,通过构建的词表进行分词进而进行网页分类 ,能有效地提高网页分类的分类精度。此外 ,本文还分析了单字和分词对于文本分类的不同影响及其原因。
[Abstract]Word segmentation is an important step in Chinese natural language processing.This paper explores the problem of classifying Chinese web pages based on statistical word segmentation.We first construct a Chinese word list of binary words automatically from training Chinese web pages.Then the texts in testing Chinese web pages are segmented with the word list.Web pages are classified based on the segmentation results.Experiments show that statistical word segmentation can efficiently improve classification pr...
[关键字]文本分类; 统计分词; 机器学习; 计算机网络;
|
一种新的基于统计的自动文本分类方法
[作者]刘斌; 黄铁军; 程军; 高文;
[摘要]自动文本分类就是在给定的分类体系下 ,让计算机根据文本的内容确定与它相关联的类别。为了提高分类性能 ,本文提出了中文文本多层次特征提取方法和基于核的距离加权KNN算法。多层次特征提取方法在汉字、常用词表和专业词表三个层次上提取文档的统计特征 ,能够更好地反映文档的统计分布。基于核的距离加权KNN算法解决了样本的多峰分布、边界重叠问题和分类器的精确分类决策问题。实际应用中 ,互联网和文本库提供了大量经过粗分类的训练文本 ,但普遍存在样本质量较差的问题 ,本文通过样本重要性分析技术解决此问题。实验系统证明了新方法的有效性。
[Abstract]Automatic text classification is defined as the task to assign pre defined category labels to documents.To improve the classification performance,this article puts forward the multi level feature selection method and the kernel based distance weighted KNN algorithm.We extract the statistical text features on three different levels as Chinese letters,the common wordlist and the professional wordlist,which can represent more statistical character of the document set.The kernel based weighted KNN algorith...
[关键字]自动文本分类; 多层次特征提取; 基于核的距离加权KNN算法; 样本重要性分析;
|
指代消解的基本方法和实现技术
[作者]王厚峰;
[摘要]指代是自然语言中常见的语言现象 ,大量出现在篇章或对话中。随着篇章处理相关应用日益广泛 ,指代消解也显示出前所未有的重要性 ,并成为自然语言处理上热门的研究问题。针对指代和指代消解的有关问题 ,本文对基本概念作了说明 ,分析了语言中典型的指代现象和指代消解所需的基本语言知识 ;同时 ,介绍了指代消解中有代表性的几种计算模型和近 10年来采用的若干实现技术。
[Abstract]Anaphora occurs throughout discourse or dialogue.Their high frequencies make anaphora resolution one key problem in discourse processing which attract attention of increasing researchers.In this article,some issues of anaphora resolution will be discussed,such as basic concepts,special referring phenomena,necessary knowledge on anaphora resolution.Some typical computational models of anaphora resolution and implement technologies will be given as well.
[关键字]指代消解; 先行语; 突显性;
|
一种基于上下文的中文信息检索查询扩展
[作者]贺宏朝; 何丕廉; 高剑峰; 黄昌宁;
[摘要]在中文信息检索的研究和实践中 ,由于查询中所使用的词可能与文件集中使用的词不匹配而导致一些相关的文件不能被成功地检索出来 ,这是影响检索效果的一个很关键的问题。查询扩展可以在一定程度上解决这种词的不匹配现象 ,然而 ,实验表明 ,通常简单的查询扩展并不能稳定地提高中文信息检索的检索效果。本论文中提出并实现了一种基于上下文的查询扩展方法 ,可以根据查询的上下文对扩展词进行选择 ,是一种相对“智能”的查询扩展方法。在TREC - 9中文信息检索测试集上进行的实验表明 ,相对于通常简单的查询扩展 ,基于上下文的查询扩展方法取得了具有统计意义提高的检索效果。
[Abstract]Term mismatch between queries and documents is a fundamental problem in Chinese Information Retrieval (IR),which affects the effectiveness of retrieval results.Query expansion in IR can deal with this kindof problem in some degree.However,experiments show that the common query expansion in IR cannot get steady retrieval results.In this paper,we propose and realize query expansion based on the context,which can choose the expansion words according to the context of the query.Experiment results with TREC 9 s...
[关键字]查询扩展; 基于上下文; 中文信息检索;
|
汉语基本短语的自动识别
[作者]张昱琪; 周强;
[摘要]本文应用基于实例的MBL(Memory BasedLearning)学习方法 ,对汉语中较常见的 9种基本短语的边界及类别进行识别 ,并利用短语内部构成结构和词汇信息对预测中出现的边界歧义和短语类型歧义进行了排歧处理。实验中还比较了在特征向量中加入词汇信息与否对实验结果的影响。实验取得了比较令人满意的结果 :对这 9种基本短语的识别正确率达到 95 .2 % ;召回率达到 93.7%。
[Abstract]This paper proposed a hybrid model to identify Chinese base phrases.At first step,We use a memory based learning (MBL) approach to the chunking of nine types of Chinese base phrases and compare the results coming from different feature vectors.In the second series of experiments we used grammar rules that represent the inner structures of base phrases and lexical information to correct the incorrect predictions from the first step.The experiments reported in this paper show competitive results:the prec...
[关键字]部分分析; 基本短语; 基于实例学习; 短语结构; 词汇排歧;
|
一种相似汉字的识别算法
[作者]蔺志青; 郭军;
[摘要]本文提出了一种通用的基于部分空间方法的相似汉字识别算法,该算法无须事先确定相似字组,也不必人工选择各个相似字组的部分空间,能够自动决定待识别字是否需要进入相似字识别过程,以及怎样选择部分空间。实验结果证明了本算法的有效性。
[Abstract]An Algorithm of recognition of similar Chinese characters based on part space was proposed in this paper. Using this algorithm, the sets of similar characters and the part space for each set are not need to be pre-decided, whether an input character should enter the procedure of the recognition of similar characters and how to select the part space can be decided automatically. The validity of the algorithm was verified by our experiments.
[关键字]相似汉字; 文字识别; 部分空间;
|
语音识别音字转换中的快速容错算法
[作者]李明琴; 王作英; 陆大■;
[摘要]本文研究了汉语连续语音识别音字转换中的容错算法,以纠正声学识别的替代、插入、删除错误。为了解决容错算法的计算量问题,本文提出了两种快速算法。一是针对单独出现错误的快速容错算法;二是针对关键词的快速容错算法。快速算法有效地限制了容错算法的搜索空间,提高了计算效率。快速容错算法应用在电话对话系统中,字正确率从78.97%提高到86.68%,关键词检测正确率从80.56%提高到88.52%,并且算法运算时间满足实时性要求。
[Abstract]An error-tolerant algorithm in decoding module of Mandarin continuous speech recognition is examined to correct substitution, insertion and deletion errors in acoustic recognition. In order to reduce the computation complexity of the error-tolerant algorithm, two fast algorithms are proposed. In both these algorithms searching space is effectively reduced, and computation efficiency is greatly improved. In the first algorithm only single isolated errors are corrected,and in the other algorithm only errors i...
[关键字]容错算法; 稳健语音识别; 对话系统; 关键词检测;
|
共95页 当前第31页