[ 2010 March,10, Wednesday ]
中国中文信息学会
Chinese Information Processing Society of China
首页
学会简介
学会领导
学会办公室
工作委员会
专业委员会
学术活动
发展会员
钱伟长中文信息处理奖
科技工作者之家
中文信息学报
新书介绍
按年代和期次浏览(最新数据: 2006年第5期)
2005统计机器翻译研讨班研究报告
[作者]徐波; 史晓东; 刘群; 宗成庆; 庞薇; 陈振标; 杨振东; 魏玮; 杜金华; 陈毅东; 刘洋; 熊德意; 侯宏旭; 何中军;

[摘要]2005年7月13日至15日,中国科学院自动化研究所、计算技术研究所和厦门大学计算机系联合举办了我国首届统计机器翻译研讨班。本文主要介绍本次研讨班参加单位的测试系统和实验结果,并给出相应的分析。测试结果表明,我国的统计机器翻译研究起步虽晚,但已有快速进展,参评系统在短期内得到了较好的翻译质量,与往年参加863评测的基于规则方法的系统相比性能虽还有差距,但差距已经不大。从目前国际统计机器翻译研究的现状和发展趋势来看,随着数据资源规模的不断扩大和计算机性能的迅速提高,统计机器翻译还有很大的发展空间。在未来几年内,在基于短语的主流统计翻译方法中融入句法、语义信息,必将成为机器翻译发展的趋势。

[Abstract]Institute of Automation,Institute of Computing Technology of Chinese Academy of Sciences,and Department of Computer Science of Xiamen University held the first Statistical Machine Translation Workshop in China together,from July 13th to 15th in 2005.This paper describes the tested systems of involved institutions,and analyzes the results of their experiments.The test results show that although the research of statistical machine translation started late in China,it develops rapidly.The tested systems got qu...
[关键字]人工智能; 机器翻译; 统计机器翻译; 基于短语的翻译模型; 机器翻译评测;



基于规则和统计的中文自动文摘系统
[作者]傅间莲; 陈群秀;

[摘要]自动文摘是自然语言处理领域里一个重要课题,本文在传统方法基础上提出了一种中文自动文摘的方法。在篇章结构分析里,我们提出了基于连续段落相似度的主题划分算法,使生成的文摘更具内容全面性与结构平衡性。同时结合了若干规则对生成的文摘初稿进行可读性加工处理,使最终生成的文摘更具可读性。最后提出了一种新的文摘评价方法(F-new-m easure)对系统进行测试。系统测试表明该方法在不同文摘压缩率时,评价值均较为稳定。

[Abstract]As automatic summarization is an important research topic in the natural language processing,the paper presents an approach for Chinese text summarization on the basis of traditional methods.For text structure analysis,an algorithm is proposed for multi-topic text partitioning based on sequential paragraphic similarity,which can makes the abstract of the multi-topic article have more general content and more balanced structure.Futhermore,a series of rules are combined to enhance the readability of the outpu...
[关键字]计算机应用; 中文信息处理; 自动文摘; 向量空间模型; 主题划分; 可读性; 评价;



一种快速获取领域新词语的新方法
[作者]刘华;

[摘要]本文提出一种新词语识别新方法。该方法直接抽取分类网页上人工标引的关键词,并按照其网页栏目所属类目存储进各分类词表,从而快速完成新词语识别和聚类任务。该方法简单快捷。我们利用该方法从15类6亿字网页中抽取到229237个词条,其中新词语175187个,新词率为76.42%,其中游戏类新词率最高,时政_社会类新词率最低。新词语以命名实体为主,结构固定,意义完整性和专指性强,有助于解决歧义切分和未登录词问题,并能提高文本表示如分类和关键词标引的效果。

[Abstract]The paper puts forward a new method for domain new words detection,which directly extracts labeled by specialist in web pages,and stored them in classified wordlist according to the column of source web page.The simple approach can detects new words and clusters quickly.Using the approach,from 6 hundred million web pages covering 15 domains,we extracted 229237 words,including 175187 new words,the new words ratio is 76.42%.New words are mostly Named Entities,which have steady structure and integrated meaning...
[关键字]人工智能; 自然语言处理; 新词语; 识别; 聚类;



双数组Trie树算法优化及其应用研究
[作者]王思力; 张华平; 王斌;

[摘要]本文对双数组Trie树(Doub le-Array Trie)算法提出了一种优化策略,即在采用Trie树构造数组的过程中,优先处理分支结点数更多的结点。这种优化策略可以在保证该算法数据查找效率不变的同时,进一步减少数据稀疏,提高空间利用率。我们基于该优化算法实现了一个词典管理程序,并与利用其他索引机制的词典进行了实验对比。实验结果表明,利用优化的双数组Trie树算法的词典不仅在查询速度上优于用其他索引机制的词典,而且存储数据的空间占用也比较小。

[Abstract]This paper proposes an improved strategy for the algorithm of Double-Array Trie that is,the node with most child nodes is praessed firstly when constructing the array.This strategy can reduce the data sparseness and keep the search efficiency meanwhile.We implement a program for lexicon management base on the improved Double-Array Trie and compare it with other index mechanisms.The results clearly show that the improved Double-Array-Trie algorithm has a much higher search speed and needs a smaller space for...
[关键字]计算机应用; 中文信息处理; 双数组; Trie树; 词典; 分词;



汉语词典的快速查询算法研究
[作者]李江波; 周强; 陈祖舜;

[摘要]汉语词典查询是中文信息处理系统的重要基础部分,对系统效率有重要的影响。本文对汉语词典查询算法研究作了简要回顾,设计实现了基于双数组TR IE机制的汉语词典查询算法,并提出了基于双编码机制的词典查询算法。最后以逐字二分法查询性能为基准,使用这两种词典询机制进行了词语直接查询和分词查询两种应用的性能测试。经过实验分析,双数组TR IE机制的词典查询算法在查询速度上提高明显,查询速度约是逐字二分法的5倍。双编码机制的的词典查询算法查询速度有一定提高,而且调整机制更加灵活。

[Abstract]The dictionary mechanism serves as one of the basic components in Chinese information processing systems.Its performance influences the performances of these systems significantly.In this paper,we review the algorithms for Chinese dictionary lookup at first,then design and implement a Chinese dictionary based on Double-Array TRIE mechanism,and present a new Chinese dictionary based on Double Coding mechanism.In the end,we compare their space and time complexity experimentally with the binary-seek-by-charact...
[关键字]计算计应用; 中文信息处理; 汉语词典查询; 双数组TRIE; 双编码算法;



一种基于信息熵的中文高频词抽取算法
[作者]任禾; 曾隽芳;

[摘要]为扩展分词词典,提高分词的准确率,本文提出了一种基于信息熵的中文高频词抽取算法,其结果可以用来识别未登录词并扩充现有词典。我们首先对文本进行预处理,将文本中的噪音字和非中文字符转化为分隔符,这样文本就可以被视为用分隔符分开的中文字符串的集合,然后统计这些中文字符串的所有子串的相关频次信息,最后根据这些频次信息计算每一个子串的信息熵来判断其是否为词。实验证明,该算法不仅简单易行,而且可以比较有效地从文本中抽取高频词,可接受率可达到91.68%。

[Abstract]Targeting at extending the dictionary for word segmentation so as to improve its accuracy,this paper presents a high-frequency Chinese word extraction algorithm based on information entropy.We firstly transform noisy words and characters to separators,thus a text can be viewed as a Chinese string collection isolated by separators.Then we compute the frequencies of all the substrings of these Chinese strings.Finally,we judge whether each substring is a word by computing its information entropy.Preliminary ex...
[关键字]人工智能; 自然语言处理; 分词; 中文抽词; 信息熵; 高频词;



边界模板和局部统计相结合的中国人名识别
[作者]李中国; 刘颖;

[摘要]本文提出了一种基于篇章信息的中国人名识别算法。我们从标注语料中提取人名左右边界词语及人名用字频度作为系统知识源。识别过程是:首先利用带有频度的边界模板识别出可能的人名,并把识别结果扩散到整篇文章以召回数据稀疏导致的遗漏人名。然后应用上下文局部统计量及几条启发式规则对识别结果进行边界校正。该算法具有线性时间复杂度,大规模开放测试(针对1354篇新闻报道约304万字,含人名3.7万个)的正确率为94.52%,召回率为98.97%,效果非常令人满意。

[Abstract]In this paper an effective algorithm for Chinese person name recognition is proposed.Person name's left and right boundary words and person name's character frequency are extracted from tagged corpus,which will be used as the knowledge for recognition.First we use these boundary templates to find possible person names.Then these recognized person names are used to match the missed occurrence in the text.At last,the local frequency obtained from the whole text is used to check and correct the name boundaries...
[关键字]计算机应用; 中文信息处理; 人名识别; 命名实体识别; 边界模板; 局部统计量; 词法分析;



SVM与规则相结合的中文地名自动识别
[作者]李丽双; 黄德根; 陈春荣; 杨元生;

[摘要]在分析中文文本中地名特点的基础上,提出了一种支持向量机(SVM)与规则相结合的中文地名自动识别方法:按字抽取特征向量的属性,然后将这些属性转换成二进制向量并建立训练集,采用多项式Kernel函数,得到SVM识别地名的机器学习模型;通过对错误识别结果的分析,构建规则库对识别结果进行后处理, 弥补了机器学习模型获取知识不够全面导致召回率偏低的不足。实验表明,用SVM与规则相结合的机制识别中文文本中的地名是有效的:系统开式召回率、精确率和F-值分别达89.57%、93.52%和91.50%。

[Abstract]By analyzing the characteristics of place names in Chinese texts,a method of automatic recognition of Chinese place names is presented,which combining support vector machines(SVMs) with rules.Firstly,feature vectors based on characters are extracted,and transferred into binary vectors.A training set is established,and the machine learning models for automatic identification of Chinese place names are obtained using polynomial kernel functions.Then,through careful error analysis,a rulebase is constructed and...
[关键字]计算机应用; 中文信息处理; 中文地名识别; 支持向量机; 机器学习; 基于规则的后处理;



面向连续字符识别的手写汉字部件集及统计规律
[作者]赵巍; 李春娣; 刘家锋; 唐降龙;

[摘要]本文面向手写字符序列输入信号连续识别研究,分析了汉字及联机手写文本的特点,提出并构建了手写汉字部件集。基于该部件集,完成了GB2312-80的 6,763个汉字的部件拆分编码和部件集的测试。统计编码数据发现,汉字依手写部件数的分布规律呈对数正态分布。本文从统计学和字符识别技术的角度对手写部件的构字能力作了分析和讨论,部件集的设计方案在部件选择和汉字拆分上均满足设计要求。实验表明,基于手写部件构造的部件识别器对手写汉字和连续汉字的部件识别率分别达到70.21%和58.49%。

[Abstract]The paper introduces a handwritten Chinese character radical set which is established oriented for the research on continuous handwritten character sequence recognition.Based on the set,the task of radical-based splitting and coding for 6,763 Chinese characters was done.From the statistical data we can found that the distribution of Chinese character numbers with regard to radical numbers fits the logarithmic normal distribution model.Futhermore the composing power of handwritten radicals are analyzed and d...
[关键字]人工智能; 模式识别; 连续字符识别; 手写汉字部件; 对数正态分布;



基于高阶统计的手写字符形变弹性匹配法
[作者]马瑞; 杨静宇;

[摘要]针对传统弹性匹配法在手写字符识别中存在着由于过匹配而造成误识别的不足,提出一种基于高阶统计的形变弹性匹配法。根据高阶统计量包含字符形状上的细节变化信息,采用独立分量分析抽取出每个字符类的内在变化方向,并将其应用到弹性匹配的形变模型中。字符的任意种形状变化由这组独立分量的线性叠加来表示。通过形变模型,类模板字符发生形变逐次向输入待识别字符趋近,从而在两个字符之间求得一种最佳匹配。在实验结果中,识别率达到92.81%,得到了提高,表明该方法的有效性。

[Abstract]Aiming at the problem of misrecognitions due to overfitting in conventional elastic matching for handwritten character recognitin,a deformable elastic matching approach based on high order statistics is proposed in this paper.According to the handwriting variations in shape details contained in high order statistics,the intrinsic deformations within each character class are extracted from the actual deformations by independent component analysis.Then they are applied to the deformable model.Thus any deforma...
[关键字]人工智能; 模式识别; 手写字符识别; 高阶统计; 弹性匹配; 内在形变; 独立分量分析(ICA);



共95页 当前第1页 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95   
©中国中文信息学会 1981-2007
京ICP备05039057号