基于笔式交互的中文字处理系统:SketchEditor
[作者]韩勇; 须德;
[摘要]纸笔是人们日常生活中常用的交流方式之一。但传统的纸笔工具有许多缺点,如纸上记录的内容难以修改和再加工,大量的信息缺乏有效的维护和检索等等。然而,纸笔工作方式的这些弱点却正好是计算机在信息处理方面的优势。笔式界面的研究就是要将这些传统的工作可计算化并且保持传统工作方式的自然性。本文的研究中,通过笔设备的输入事件描述了文本输入、编辑等交互任务,并且结合改进后的Rubin识别算法和基于规则的识别等方法,模拟人们日常生活中纸———笔工作方式,设计实现了以用户为中心的新型笔式文本编辑器。
[Abstract]Pen and paper is one of the most important ways of communication in our daily life. Traditional pen and paper has its own fault, such as the content in a paper is difficult to modify or reprocess and information kept in a paper lack of effective way of maintenance and search, whereas computer is superior in those aspects. Research on pen user interface is to make those traditional works computable through study of hardware and software of relative fields. In our research work, we describe interaction tasks ...
[关键字]计算机应用; 中文信息处理; 人机交互; 笔式界面; 手势识别; POST-WIMP;
|
基于多层过滤的统计机器翻译
[作者]周玉; 宗成庆; 徐波;
[摘要]本文提出了一种基于多层过滤的算法。该算法主要实现从对齐的中英文句子中自动的抽取与对齐双语语块。根据不同语块具备的不同特性,采用不同的层次对其处理。该算法不同于传统的算法,它不需要对句子进行标注,句法分析,词法分析甚至不需要对汉语句子进行分词等操作。初步的实验结果表明该算法性能较好,测试的结果是:抽取语块的准确率能达到F =0 70 ,对齐语块的准确率能达到F =0 80 ;而且将此算法获得的对齐双语语块用于统计机器翻译系统,跟基于词的系统做对比,结果表明基于语块的翻译系统明显提高了翻译水平,差不多能提高10 %。
[Abstract]In this paper we propose a new algorithm called multi-layer filtering to extract the bilingual alignment chunks automatically from Chinese-English parallel texts. Various layers are used to extract bilingual chunks according to different features possessed by different chunks in the bilingual corpus. Our chunking and alignment algorithm does not rely on the information from tagging, parsing or syntax analyzing as most conventional algorithms do. The preliminary experimental results express that our algorith...
[关键字]人工智能; 机器翻译; 多层过滤; 双语语块识别与对齐;
|
机器翻译评测中的模糊匹配
[作者]刘洋; 刘群; 林守勋;
[摘要]目前,大多数机器翻译自动评测方法都没有考虑在未匹配的词语中可能包含被忽略的信息。本文提出一种在参考译文和待评测译文之间自动搜索模糊匹配词对的方法,并给出相似度的计算方法。模糊匹配和计算相似度的整个过程将通过一个例子进行说明。实验表明,我们的方法能够较好地找到被忽略的、有意义的词对。更重要的是,通过引入模糊匹配,BLEU的性能得到显著的提高。模糊匹配可以用来提高其他机器翻译自动评测方法的性能。
[Abstract]Most current automatic metrics of machine translation evaluation do not consider that among unmatched words there may be neglected information. In this paper, we describe a strategy to find fuzzy-matched word pairs between reference and candidate translations automatically and propose an approach to compute the similarity. The whole process of finding fuzzy-matched word pairs and computing their similarity is demonstrated in detail. Experiments show that our method is capable of finding neglected meaningful...
[关键字]人工智能; 机器翻译; 机器翻译评测; 模糊匹配;
|
融合丰富语言知识的汉语统计句法分析
[作者]熊德意; 刘群; 林守勋;
[摘要]知识获取一直以来是自然语言处理中的瓶颈,基于树库的统计句法分析也不例外。树库中潜在隐含的语言知识是非常丰富的,但它们并不是可以直接得到,往往需要特定的策略才能将它们融合到模型中。我们的汉语统计句法分析模型从3个方面融合潜在的丰富语言知识:1)重新标注树库中的非递归名词短语和非递归动词短语;2 )设计新的中心词映射表;3)引进上下文配置框架以更具体地描述二元依存结构。由于融合了以上三种潜在语言知识,模型的F1值提高了2 37% ,完全匹配正确率提高了5 36 %。
[Abstract]Knowledge acquisition is always regarded as a bottleneck in many NLP tasks, such as machine translation, information extraction. Treebank-based statistical parsing is not an exceptant. The latent linguistic knowledge in treebank is very rich, which, however, cant be acquired directly.In our model, the following three ways are used to incorporate such rich linguistic features for Chinese statistical parsing. First of all, non-recursive noun and verb phrases are annotated in the Penn Chinese Treebank because...
[关键字]人工智能; 自然语言处理; 统计句法分析; 非递归短语; 中心词映射表; 上下文配置框架;
|
一种基于可信度的人名识别方法
[作者]罗智勇; 宋柔;
[摘要]专名识别技术是影响中文自动分词精度的一个重要方面,也是自动分词技术的难点之一。本文以人名识别为例,分析了目前流行的基于语料库和统计语言模型的专名识别方法中在概率估值问题上存在的弊端;同时在规则和统计相结合的基础上,提出了一种基于可信度的人名识别方法,并给出了一个渐进式模型训练方法,克服了人工标注语料库规模的限制。从我们对《人民日报》1998年1月、2 0 0 0年12月(共约379万字)语料的测试结果来看,基于可信度的人名识别方法比传统的概率估值方法识别效果有一定的提高。
[Abstract]Recognition of proper noun is one of the most important parts in word segmentation system in modern Chinese. This paper firstly analyzes the shortcomings of traditional proper noun recognition method in statistical language models and other corpus-based models. Secondly, we put forward a recognition strategy of person names based on reliability. We also train the model with a bootstrapping method without the limit of manually tagged corpus. Large-scale test on real corpus shows that this method successfully...
[关键字]计算机应用; 中文信息处理; 自动分词; 人名识别; 统计方法; 可信度;
|
Co-training机器学习方法在中文组块识别中的应用
[作者]刘世岳; 李珩; 张俐; 姚天顺;
[摘要]采用半指导机器学习方法co training实现中文组块识别。首先明确了中文组块的定义,co training算法的形式化定义。文中提出了基于一致性的co training选取方法将增益的隐马尔可夫模型(TransductiveHMM)和基于转换规则的分类器(fnTBL)组合成一个分类体系,并与自我训练方法进行了比较,在小规模汉语树库语料和大规模未带标汉语语料上进行中文组块识别,实验结果要比单纯使用小规模的树库语料有所提高,F值分别达到了85 34%和83 4 1% ,分别提高了2 13%和7 2 1%。
[Abstract]In this paper we discuss the application of semi-supervised machine learning method-co-training on Chinese Text Chunking. Firstly, we give the definition of Chinese chunk,then the formalized definition of co-training algorithm.We proposed a example selection method based on the consistence, using two classifiers : Transductive HMM and fnTBL to combine a classification system to perform the Chinese text chunking task with the small-scale labled Chinese treebank and large-scale unlabled Chinese corpus. The ...
[关键字]计算机应用; 中文信息处理; co-training算法; 中文组块; 分类器;
|
现代汉语介词短语边界识别研究
[作者]王立霞; 孙宏林;
[摘要]汉语中介词结构右边界歧义是汉语结构歧义中最突出的现象之一,这给汉语的句法分析带来了很大的困难。本文研究的目标是:在不引进复杂的句法分析的前提下实现介词短语边界的自动识别,期望其作为句法分析预处理的一部分为句法分析提供一定的帮助。本文对汉语中最常用的介词“在”进行了实验,封闭测试和开放测试的准确率分别达到97%和93%。与前人的同类研究相比,准确率有了较大的提高,解决了过去遗留的一些问题。
[Abstract]The right boundary ambiguity of prepositional phrase (PP) is one of the most prominent phenomena in Chinese structural ambiguities. The goal of this paper is to recognize the boundaries of prepositional phrases automatically without introducing parsing technique. A statistical algorithm is applied for the recognition of prepositional phrases in which each word after a preposition in a sentence is viewed as a candidate for the right boundary of a PP and the likelihood of each position being right boundary of...
[关键字]计算机应用; 中文信息处理; 右边界; 概率信息; 删除插值法;
|
基于HowNet概念获取的中文自动文摘系统
[作者]王萌; 何婷婷; 姬东鸿; 王晓荣;
[摘要]本文提出了一种中文自动文摘的方法。不同于其它的基于词频统计的一般方法,运用概念(词义)作为特征取代词语。用概念统计代替传统的词形频率统计方法,建立概念向量空间模型,计算出句子重要度,并对句子进行冗余度计算,抽取文摘句。对于文摘测试,采用两种不同的方法进行测试:一是用机器文摘和专家文摘进行比较的内部测试;二是对不同文摘方法进行分类,通过对分类正确率的比较的外部评测方法。
[Abstract]The paper presents an approach for Chinese text summarization. Unlike normal statistical method, we use concept (word sense) as feature, instead of word. Weight of sentence can be carried out in terms of weight of paragraph and thematic conceptual vector space model, after the weight of all the sentences have been carried out, the weights are ordering according to their magnitude. Sentences with high weight are selected as summarization sentences. In order to evaluate the summarization system, we use two d...
[关键字]计算机应用; 中文信息处理; HowNet; 自动文摘; 概念向量空间模型;
|
基于主题语言模型的中文信息检索系统研究
[作者]张俊林; 孙乐; 孙玉芳;
[摘要]准确的文档语言模型估计对于改善语言模型检索系统的性能是非常重要的。在本文中我们提出了基于主题语言模型的信息检索系统,首先设计了“改进的两阶段K Means聚类算法”来对文档集合进行聚类,通过引入AspectModel结合聚类结果可以得到基于主题的语言模型。这个新的语言模型较深入地刻画了词汇在不同主题下的分布规律以及文档所蕴含不同主题的分布规律。将主题语言模型和文档本身的语言模型通过线性插值可以更准确地估计文档语言模型。实验结果表明我们提出的这个方法显著改善了检索系统的性能,与Jelinek Mercer模型方法相比较,主题语言模型检索系统的平均精度提高大约16 17% ,召回率提高大约9 6 4%。
[Abstract]Exact estimation of the document language model is important to the performance of the language model based IR system. In this paper we proposed a topic-based approach to language modeling for ad-hoc Information Retrieval. An improved two-stage k-means clustering method is designed to deal with the document collection and the clustered results are regarded as the topic information contained in the collection. Through combing the aspect model and text clustering technology, we can derive a more accurate docu...
[关键字]人工智能; 自然语言处理; 主题语言模型; 信息检索;
|
汉字双向有穷自动机的研究
[作者]蔡增玉; 谷文祥;
[摘要]汉字的计算机输入是中文信息处理的关键问题之一,而汉字计算机输入的数学模型对汉字的计算机输入的研究有重要的意义。本文从自动机理论的角度对汉字输入的数学模型进行了研究,把控制操作引入了输入模型,并给出确定汉字双向有穷自动机和不确定汉字双向有穷自动机的模型。新的模型较之以前的数学模型,能刻画出汉字输入的控制操作,表达能力进一步增强,是对以前汉字键盘输入数学模型的推广。
[Abstract]Chinese character inputting play a key role in the Chinese information processing, and the input model is very important for the study on the Chinese character input. This paper makes some studied in the building of Chinese character input modes based on automata theoretic, handles the control operation in the input model, and introduces the concepts of two-way deterministic Chinese finite automaton and two-way nondeterministic Chinese finite automaton. Compared to the previous models, our models can handle...
[关键字]计算机应用; 中文信息处理; 汉字输入; 数学模型; 双向有穷自动机;
|
共95页 当前第13页