|
|
|
基于混合语言模型的文档相似性计算模型
[作者]李晓光; 于戈; 王大玲;
[摘要]为了克服现有文档相似性模型对文档特性拟合的不完全性和缺乏理论根据的弱点,本文在统计语言模型的基础上,提出了一种基于混合语言模型(M ixture Language Model,MLM)文档相似性计算模型。MLM利用统计语言模型描述文档特征,将相关影响因素作为模型的潜在子模型,文档语言模型由各子模型混合构成, 从而准确和全面地反映文档特征。由于MLM根据具体应用确定相关影响因素,并以此构建相应文档描述模型,因此具有很强的灵活性和扩展性。在MLM的基础上,本文给出了一个基于文档主题内容相似性的实例,在TREC9数据集上的实验表明MLM优于向量空间模型(VSM)。
[Abstract]To overcome the incompleteness of modeling document characteristics and the lack of theory for current document similarity models,this paper puts forward to utilize mixture language model(MLM) to evaluate document-to-document similarity.In MLM,the characteristic of a document is described based on statistic language model,and the factors of influencing its characteristic are viewed as the latent models,and then the document language model is a mixture model combined with each latent models.MLM not only mode...
[关键字]人工智能; 自然语言处理; 文档相似性; 统计语言模型; 混合模型; EM算法;
| 基于概念匹配的中文问答处理模型核心问题探讨
[作者]吴晨; 张全;
[摘要]为了解决问答处理系统中的语义模糊问题,提高问答处理的性能,研究人员尝试采用概念作为系统处理的对象,而不再是语言表层符号,然而,在引入概念进行处理的同时引来了一些新的问题,如概念的抽取、概念关联计算以及特定于问答系统的问题理解、问题求解、答案生成等问题。在概念抽取、概念关联计算方面,已有一些比较成功的算法。本文将在此基础上,针对实现这样一个问答处理系统所存在的一些未涉及的核心问题进行一个探讨,同时提出解决以上问题的方法。实验及实际应用表明基于所提出算法的概念问答系统具有较强的性能,系统总体自动处理准确率将近达到40%。在实际应用中也表现出较高的应用价值。
[Abstract]Concept-Based Question Answering(QA) is a brand new research topic which takes concepts,instead of the lexical terms,as the processing object.Concepts,as a formalized meaning,can well help to resolve the word sense ambiguities.However,using concepts brings some new problems,such as the concept extracting;the semantic relativity calculation for concept as well as the QA-specialized issues such as how to comprehend the query;how to search the answers and how to generate the nature language answers.Most of the...
[关键字]计算机应用; 中文信息处理; 中文问答系统; 语言概念空间; 核心问题研究; 概念匹配; 算法;
| 蒙古语语言-文字的自动化处理
[作者]伊·达瓦; 张玉洁; 上园一知; 大川茂树; 章森; 井佐原均; 白井克彦;
[摘要]本文首先叙述了蒙文电子化的意义以及蒙文电子化数据的现状。然后重点讨论了在不同地区和国家使用的蒙文书面语以及口语的不同和蒙文在计算机处理时所面临的问题。最后,介绍了我们在日本建设的针对蒙古语语言信息处理的两种语言资源:蒙古语多方言口语语料库和蒙文多文种-多语言并行语法标注电子词典,后者得到了2005年中日蒙韩国际合作课题“蒙文自然语言处理技术的研究”的资助。
[Abstract]In this paper,we firstly address the significance of digitizing of Mongolian and the current technical situation of this problem.Then,we focus on the differents of spoken and written Mongolian among different aear and countries and the problems related to process Mongolian by computer.Finally,we introduce our work in creating and designing the Mongolian language corpus at Japan.This work includs 2 kinds of coupus,one is the multi-dialectal speech corpus,and the other is the Multilingual Parallel Electronic ...
[关键字]计算机应用; 中文信息处理; 蒙文语言文字信息处理; 文本-口语语料库; 多文种-多语言电子词典;
| 基于HMM的满文文本识别后处理的研究
[作者]赵骥; 李晶皎; 王丽君; 张继生;
[摘要]将满文单词识别系统的识别信息和满文的词组信息有机的结合起来,建立满文词组和待定词集统计信息库,采用基于统计的隐马尔可夫模型的方法,依据贝叶斯准则,综合满文待定词的后验概率和词组的先验概率信息,建立合理有效便于实现的数据结构,采用动态规划法对满文单词识别系统输出存在的拒识词和错识词进行检测和纠正,从而有效的提高满文文本识别系统的识别率。实验表明:后处理性能除取决于语言模型外,还取决于概率的精确估计。另外,在单词识别系统识别率高的情况下,后处理的纠错能力会增强。
[Abstract]The study proposes a post-processing method to improve the performance of Manchu character recognition.A evaluation model based on the Bayes rule are used to estimate the probability of the candidate Manchu words,which takes both the posterior probability of candidate and the prior probability of Manchu phrases into account.A Hidden Markov Model and Viterbi dynamic programming algorithm are adopted to check the output of the character recognition and to correct the rejected and mistaken words.This efficient...
[关键字]计算机应用; 中文信息处理; 满文; 后处理; 模糊矩阵; 贝叶斯准则; 特征矢量;
| 基于trigram语体特征分类的语言模型自适应方法
[作者]梁奇; 郑方; 徐明星; 吴文虎;
[摘要]本文从书面语和口语存在的差异出发,提出了语言模型的语体自适应方法。自适应采用了几种不同的计数意义上的插值算法。考虑Katz平滑的插值算法根据 trigram单元的可信度来分配权值。基于trigram语体特征分类的自适应算法根据trigram单元的语体特征倾向动态分配权值,并选取了几种不同的权值生成函数。对口语语料做音转字的实验证明,使用这几种自适应算法可以让基准模型的性能有不同程度的提高,其中综合考虑单元可信度和特征倾向的算法效果最好,相对于本文的两个基准的汉字错误率下降率分别达到了50.2%和23.7%。
[Abstract]In this paper,a language style based adaptive method for language model is proposed based on the differences between oral and written languages.Several interpolation methods based on trigram counts are used for the adaptation.An interpolation method considering Katz smoothing computes weights according to the confidence score of a trigram.An adaptation method based on the classification of a trigram's style feature computes weights dynamically according to the trigram's language style tendency with several ...
[关键字]计算机应用; 中文信息处理; 统计语言模型; trigram; 自适应; 语体; 插值算法;
| 基于HMM的可训练中文语音合成
[作者]吴义坚; 王仁华;
[摘要]本文将基于HMM的可训练语音合成方法应用到中文语音合成。通过对HMM建模参数的合理选择和优化,并基于中文语音特性设计上下文属性集以及用于模型聚类的问题集,提高其建模和训练效果。从对比评测实验结果来看,98.5%的合成语音在改进后其音质得到改善。此外,针对合成语音节奏感不强的问题,提出了一种基于状态和声韵母单元的两层模型用于时长建模和预测,集外时长预测RMSE由29.56m s降为27.01m s。从最终的合成系统效果来看,合成语音整体稳定流畅,而且节奏感也比较强。由于合成系统所需的存贮量非常小,特别适合嵌入式应用。
[Abstract]In this paper,the HMM-based trainable speech synthesis was applied for Chinese application.The appropriate HMM parameters are selected and optimized,and the contextual features and corresponding question set for tree-based HMM clustering are designed by considering the characteristics of Chinese,to improve the effect of HMM modeling and training.From the evaluation results,the preference score of the synthetic speech after the above improvement is 98.5%.Furthermore,in order to improve the rhythm of syntheti...
[关键字]计算机应用; 中文信息处理; 语音合成; HMM; 可训练语音合成; 时长模型;
| 一种新的基于主题的语言模型自适应方法
[作者]任纪生; 王作英;
[摘要]基于主题的语言模型自适应方法应尽可能提高语言模型权重系数的更新速度并降低语言模型的调用量以满足语音识别实时性要求。本文采用基于聚类的方法实现连续相邻二元词对的量化表示并以此刻画语音识别预测历史和各个文本主题中心,依据语音识别历史矢量和各个文本主题中心矢量的相似度更新语言模型权重系数并摒弃全局语言模型。同传统的基于EM算法的自适应方法相比,实验表明该方法明显提高了语音识别性能和实时性,识别错误率相对下降5.1%,说明该方法可比较准确地判断测试内容所属文本主题。
[Abstract]Topic-based language model adaptation algorithm should meet the real time need for speech recognition,this goal can be implemented through improving the updating speed of language model weighting coefficient and reducing the using of language model.In this paper,a novel quantization representation scheme for continuous adjoining bigram word pair was proposed via clustering,then it was used to characterize the speech recognition predictive history and each text topic center.The global language model was not ...
[关键字]计算机应用; 中文信息处理; 语言模型; 主题自适应; 语音识别; 文本分类;
| LINUX下维、哈、柯文多语种图形化处理平台的设计与实现
[作者]苏国平; 缪成; 夏国平;
[摘要]针对维吾尔文字、哈萨克文字、柯尔克孜文字(以下简称“维哈柯文”)的特点以及进行维哈柯文、西文等多语种混合处理时的特殊需求,本文通过对L inux的I18N体系中NLS(National Language Support)研究分析,提出了基于L inux的多语种图形化处理平台的设计目标与总体架构。该平台由维哈柯文本地化环境、维哈柯文显示、自适应维哈柯文输入和维哈柯文打印输出等4个子系统的十余个模块组成。本文详细介绍了各子系统主要模块的实现技术。通过在redhat linux 8.0、turbolinux上测试表明,该平台在桌面环境、编辑软件、网络浏览、数据库软件、多媒体软件、图形处理软件等应用中均能较好的实现维哈柯文、汉文、西文的混合输入、显示、编辑、排版、打印等功能。
[Abstract]According to the lingual characteristics of Uighur,Kazakh and Khalkhas(abbreviated as UKK in the following) and the special requirements for supporting those minority languages with Chinese and English at the same time,in this paper we presents the design goals and general framework of multilingual GUI processing platform under Linux environment based on the analysis and research of national language support in the system of I18N,The platform consists of four sub-systems,including localization,display,auto-...
[关键字]计算机应用; 中文信息处理; 多语种; 图形化处理平台; Linux;
| 基于Qt的国际化图形用户界面设计与实现
[作者]刘汇丹; 芮建武; 姚延栋; 吴健;
[摘要]一次开发多语言使用是国际化软件开发的主要目标。但是世界上的文字多种多样,它们的书写方向也有所不同,除了水平从左向右书写的英文、水平从右往左书写的阿拉伯文外,还有类似蒙古文这样垂直排列的文字,这对计算机图形用户界面提出了更高的要求,现有的计算机系统将这类垂直排列的文字沿水平方向输出,极不符合少数民族人民的习惯。在分析现有Qt库对类似阿拉伯文这样从右向左书写的文字的部分支持机制的基础上,我们设计并实现了支持四种方向模式的国际化的图形用户界面,现在它已经能够适应世界上几乎所有的文字。这对于软件国际化以及民族语言信息处理有重要意义。
[Abstract]There are various scripts in the world which have different writing directions.It's a challenge to develop graphical user interface which can be adaptable to the writing direction of the script being processed.In this paper,the requirements of graphical user interface adaptable to various scripts are analyzed and four kinds of run-time modes are presented in according with writing directions of the scripts.Then the mechanism of Qt library to support scripts like Arabic,which is written from right to left,is...
[关键字]计算机应用; 中文信息处理; 图形用户界面; Qt库; 国际化; 民族文字处理;
| 智能型汉字数码输入技术的研究
[作者]顾平; 朱巧明; 李培峰; 钱培德;
[摘要]针对数字编码的特点,本文提出了一种在不改变编码方案的情况下通过改进输入规则,结合语言模型,实现汉字数字编码的智能输入技术。文章首先讨论了怎样设计字词码本结构,使之能够满足灵活多样的输入方式,继而设计了一种动态自学习语言模型,重点分析了数据平滑算法在语言模型中的应用与改进,最后通过一个输入法示例程序,对改进前后不同情况下的输入效果进行了测试。实验表明,这种输入技术不但降低了输入法的平均码长,而且显著地提高了首字命中率。
[Abstract]An intelligent digital code-based input technique for Chinese characters,which features in improving the input rules without modifying the original coding scheme and combining the language model,is proposed.The paper disusses how to design the Chinese character and word code to meet the various input modes at first.then designs a dynamic self-study language model,and analyses the data smoothing algorithm in the language model.The experimental results regarding the input performance are given at last,by comp...
[关键字]计算机应用; 中文信息处理; 汉字输入; 数字编码; 智能输入; 动态自学习语言模型;
|
共95页 当前第3页 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
|