[ 2010 September,09, Thursday ]
中国中文信息学会
Chinese Information Processing Society of China
首页
学会简介
学会领导
学会办公室
工作委员会
专业委员会
学术活动
发展会员
钱伟长中文信息处理奖
科技工作者之家
中文信息学报
新书介绍
按年代和期次浏览(最新数据: 2003年第5期)
词性标注中生词处理算法研究
[作者]张孝飞; 陈肇雄; 黄河燕; 蔡智;

[摘要]词性兼类是自然语言理解必须解决的一类非常重要的歧义现象,尤其是对生词的词性歧义处理有很大的难度。文章基于隐马尔科夫模型(HMM),通过将生词的词性标注问题转化为求词汇发射概率,在词性标注中提出了一种生词处理的新方法。该方法除了用到一个标注好的单语语料库外,没使用任何其他资源(比如语法词典、语法规则等),封闭测试正确率达97%左右,开放测试正确率也达95%左右,基本上达到了实用的程度。同时还给出了与其他同样基于HMM的词性标注方法的测试比较结果,结果表明本文方法的标注正确率有较大的提高。

[Abstract]Ambiguity of part of speech (POS) which urgent needs to be resolved is a very important ambiguous phenomenon in natural language processing. Furthermore, it is very difficult to disambiguate the ambiguity of part of speech of the new words. In this paper, through converting the problem of tagging of POS to the problem of calculation of word's emission probability; a new approach based on HMM is proposed to solve this problem. This approach uses nothing more than a tagged corpus (e.g. no grammar dictionaries...
[关键字]计算机应用; 中文信息处理; 自然语言理解; 词性兼类; 隐马尔科夫模型; 语料库;



复杂彩色文本图像中字符的提取
[作者]陈又新; 刘长松; 丁晓青;

[摘要]从复杂彩色文本图像中提取和识别字符已经成为一个既困难又有趣的问题。本文给出了一个具有创新性和实用性的区域生长算法用于彩色图像的分割:彩色图像游程邻接算法CRAG(colorrun lengthadja cencygraphalgorithm)。我们将该算法用于彩色文本图像,首先得到图像的彩色连通域,再对这些连通域的平均颜色进行颜色聚类,可得到若干个聚类中心,然后根据不同的颜色中心将图像分为相应的彩色层面,最后通过连通域分析判断所需的文字层。该生长算法修改并扩展了传统的BAG算法,并将其运用于彩色印刷体文本图像中,充分利用了彩色图像的颜色和位置信息。实验结果表明新的方法能很好的从彩色印刷图像中提取多种常见的艺术字,并具有较高的提取速度,同时保留了文字和背景图像的原始色彩,便于将来的图像恢复。

[Abstract]Today there are a lot of documents with text characters printed on colored and/or complex backgrounds. To recognize these characters, they must be extracted from the images in advance. In this paper, two novel techniques are proposed and constitute a robust character extraction algorithm. First, we search color connected component by applying a new regiongrowth algorithmcolor runlength adjacency graph algorithm (CRAG), then divide the image to several layers by clustering the central color of all the co...
[关键字]人工智能; 模式识别; 字符提取; 图像分割; CRAG算法; 区域生长; 彩色文本图像;



基于记忆的自适应汉语语言模型的研究
[作者]曲卫民; 张俊林; 孙乐; 孙玉芳;

[摘要]基于记忆的自适应语言模型虽然在一定程度上增强了语言模型对不同领域的适应性,但其假设过于简单,即认为一个在文章的前面部分出现过的词往往会在后面重复出现,它没有考虑到常用词的影响,以及不同单词间的相互影响。本文针对这一问题从两个方面对原有模型进行了改进,一是采用TFIDF公式代替了原有的简单频率统计法;二是建立了一种基于记忆的扩展二元模型,并采用权重过滤法以节省模型计算量。实验表明这两种改进在很大程度上提高了原有模型的性能,增强了模型的自适应性。

[Abstract]Though cachebased language models can better adapt to crossdomain environment, the hypothesis that it has made is too simple. It assumes that a word that has appeared in the article often reappears later in the same article. But it does not take into account the influence of stop words and mutual action between different words. According to this problem, we have made two improvements to the model. First, we use TFIDF scheme instead of simple statistics. Second, we adopt an extended cachebased 2gram mode...
[关键字]计算机应用; 中文信息处理; 语言模型; 自适应; TFIDF公式; 扩展二元模型;



基于韵律特征和语法信息的韵律边界检测模型
[作者]吴晓如; 王仁华; 刘庆峰;

[摘要]韵律短语边界的自动检测,对语音合成中语料库的韵律标注以及语音识别中韵律短语的自动划分都有重要意义。本文通过对影响韵律短语边界的声学、韵律等参量的分析,得到和韵律短语边界关联性较大的一组声学特征参数、韵律环境参数和语法信息;同时引入语音合成中的韵律预测思想,在假定所有音节边界均为非韵律短语边界时,预测每个音节的基频。最后使用决策树模型,将音节边界处的韵律环境信息、语法信息以及预测结果作为决策树的输入,利用决策树综合判定当前音节边界是否为韵律短语的边界。实验表明,这种方法对于基于确定性文本(text dependent)的语音韵律短语边界的检测,具有较好效果,同时可以显著提高语音合成中语料库的标注效率和标注结果的一致性。

[Abstract]Automatic detection of prosodic boundary for continuous speech is very useful for labeling corpus in TTS system and for separating phrase in speech recognition. we propose an automatic break detection algorithm for mandarin Chinese speech. Our labeling model includes following steps: Firstly acoustic parameters are analyzed to select some useful parameters for detection model. Then relationship between syntactic information and prosodic word is obtained by statistical method. At the same time F0 value is es...
[关键字]计算机应用; 中文信息处理; 韵律边界的自动检测; 韵律预测; 决策树; 分类与回归树;



汉英双语语料库中名词短语的自动对应
[作者]刘冬明; 赵军; 杨尔弘;

[摘要]本文提出了一种在汉英双语语料库句子对齐的基础上,自动进行汉英名词短语划分和对应的方法。该方法的主要特点在于在无需严格识别汉语名词短语的情况下,对高频短语和低频短语分别进行处理,对于高频短语,利用英语短语和汉语词在双语语料库中的关联信息,采用一种迭代重估算法进行双语短语的对应;对于低频短语,根据双语词典中源词和译词之间的对应信息,结合一套人工编写的句法规则进行双语低频短语的对应。该方法能够从整体上把握对应信息,并具有很高的覆盖率。

[Abstract]In this paper, a method is proposed to align bilingual noun phrases automatically in sentencealigned ChineseEnglish bilingual corpus. The characteristic of our method is to deal with highfrequency noun phrases and lowfrequency noun phrases separately without recognizing Chinese noun phrase accurately. Highfrequency noun phrases in English corpus are aligned to those in Chinese corpus using an iterative reevaluation algorithm according to the cooccurrence between English phrases and Chinese words in b...
[关键字]人工智能; 机器翻译; 名词短语识别; 短语对齐; 迭代重估; 相似度;



关于“中文网页自动分类竞赛”结果的分析
[作者]冯是聪; 王继民;

[摘要]在最近召开的"全国搜索引擎与网上信息挖掘学术研讨会"上,举办了一场"中文网页自动分类竞赛",共有来自全国各地的10个队参加。本文在介绍本次竞赛活动规则和过程的基础上,详细分析了竞赛的结果,从而使我们对于目前中文网页自动分类技术的现状有了一种具体的认识:目前已有分类器的性能没有呈现出明显的差距,中文网页的分类比普通文本的分类要困难的多。同时,本文还尝试推出一个标准的中文网页分类的实例样本集,希望通过不断完善,最终作为中文网页分类技术研究的基本语料。

[Abstract]A Chinese Web page automatic categorization contest was hold in national symposium on Search Engine and Web Mining and ten teams took part in this contest. After describing the contest rules, this paper analyses the contest results in details and we can have an explicit view on the present technologies of Chinese Web page automatic categorization: no explicit difference is shown among those classifiers had been developed and Chinese Web page categorization is more difficult than plain text categorization....
[关键字]计算机应用; 中文信息处理; 机器学习; 中文网页自动分类; TREC评测;



基于对话回合衰减的cache语言模型在线自适应研究
[作者]何伟; 李红莲; 袁保宗; 林碧琴;

[摘要]目前由于特定任务域语料的稀疏并且难以收集,这严重阻碍了对话系统的可移植性。如何利用在线收集的少量训练语料,实现语言模型的快速自适应,从而有效提高对话系统在新任务域的识别率是本文的目的所在。本文对传统cache模型修正后,提出了基于历史单元衰减的cache语言模型,以在线递增方式收集语料进行自适应,并与通用语言模型进行线性插值。在对话系统中,以对话回合为历史单元,也可称为基于对话回合衰减的cache语言模型。在两个完全不同任务域———颐和园导游与火车票订票任务域进行的实验表明,在自适应语料不到1千句时,与无自适应模型相比,有监督模式下的识别错误率分别降低了47 8%和74 0%,无监督模式下的识别错误率分别降低了30 1%和51 1%。

[Abstract]The substantial investment required for developing a spoken language system in each specific task is a hamper to the widespread use of speech technology. In this paper, to develop the toolkits for porting a spoken language system to a new application rapidly and simply, an improved cache modela history unit based decaying cache model is provided for online language model adaptation of spoken language systems. To capture the dialog state change, each user's utterance and system response are collected and ...
[关键字]计算机应用; 中文信息处理; 口语对话系统; 语言模型; cache自适应;



HNC作用效应句的汉英句类转换
[作者]张克亮; 黄曾旸中科院声学所;

[摘要]作用效应句是作用句的一个特殊子类,是HNC57组基本句类中一个极富个性的重要句类。从HNC概念网络的角度看,作用效应句主要由使役类动词和逼迫类动词直接形成,或者由一般作用类动词(含泛动类动词)通过"得"字结构间接形成。由这三类动词形成的作用效应句遵循不同的句类转换和格式转换规则,因此在汉英机器翻译中,需要采取不同的句类转换框架,以确保译文语句句法语义结构的正确性。初步的试验表明,有关作用效应句的这些句类-格式转换规则具有很好的适用性和覆盖率。

[Abstract]In the light of the HNC conceptual network, actioneffect sentences in the Chinese language arise directly from causative verbs and compelling verbs, and indirectly from general acting verbs, i.e. via the use of "de(得)"construction. In ChineseEnglish machine translation, actioneffect sentences arising from the three conceptual types of verbs mentioned above follow different SC and SF (sentence format) transfer rules. Therefore, different transfer frames (TransFrame) should be adopted so as to ensure the g...
[关键字]人工智能; 机器翻译; HNC理论; 作用效应句; 句类转换; 语句格式转换; 转换框架;



甲骨文象形码编码方法研究
[作者]肖明; 赵慧; 甘仲惟;

[摘要]甲骨文因字形独特、年代久远,所以一直没能进行有效编码。本文吸取现代编码思想,采用模糊数学模型分析甲骨文的部件(字根)特点,对其进行模糊聚类,并使用32个字符(25个英文字母和7个阿拉伯数字)作为码元,与甲骨文中的500多个字根相对应,实现了一字一码的编码方案。在此基础上,运用信息论中的熵理论,分析了这种编码的效率和科学性,得出甲骨文编码的最佳码长大致接近于3,从而为5000多个甲骨文字进行科学编码提供理论基础。

[Abstract]This paper researches JiaGuWen symbol coding using the fuzzy Mathematical theory, and sets up a method for clustering JiaGuWen symbol code roots and coding JiaGuWen characters. Then on the basis, we use the entropy in informatics to analyze the efficiency and rationality, and thus provide theory foundation for coding scientifically for JiaGuWen characters.
[关键字]计算机应用; 中文信息处理; 甲骨文; 字根; 象形码; 模糊聚类; 熵; 码长;



连续语音识别中声学建模的组合聚类算法研究
[作者]韩兆兵; 贾磊; 张树武; 徐波;

[摘要]基于三音子连续语音识别的一个关键问题是在有限训练数据的条件下对大量声学模型参数的鲁棒性估计。为了解决这个问题 ,有两个主要的上下文相关的聚类算法被提出 ,它们是合并 (AgglomerativeClustering)聚类 (AGG)和决策树 (Tree based)聚类 (TB)。本文分析了这两种算法的优缺点 ,并分别对其进行了改进 ,然后提出了最大似然框架下组合聚类算法。大词汇量连续语音识别 (LVCSR)的实验结果表明 ,和单一的决策树聚类算法比较 ,提出的组合聚类算法对识别率有显著的提高。

[Abstract]A crucial issue in triphone-based continuous speech recognition is the large number of parameters to be estimated against the limited availability of training data. To cope with the problem, two major context-clustering methods, agglomerative (AGG) and tree-based (TB), have been widely investigated. We analyze both algorithms with respect to their advantage and disadvantage, develop several methods to improve on them, and introduce a novel combined method in the maximum likelihood framework. For LVCSR, the ...
[关键字]计算机应用; 中文信息处理; 语音识别; 合并聚类; 决策树聚类; 声学建模;



共95页 当前第26页 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95   
©中国中文信息学会 1981-2007
京ICP备05039057号