基于统计方法的中文姓名识别
[作者]黄德根; 杨元生; 王省; 张艳丽; 钟万勰;
[摘要]专有名词的识别对自动分词有重要意义。本文针对如何识别中文姓名做了有益的尝试 ,主要采用基于统计方法 ,进行中文姓名识别。本文建立了有监督学习机制 ,提出了句子切分结果可信度等概念 ,并在此基础上建立了较好的统计模型 ,系统闭式精确率和召回率分别达 95 .97%和 95 .5 2 % ,开式精确率和召回率分别达 92 .37%和 88.6 2 %
[Abstract]Identification of Chinese names is one of important techniques to improve the accuracy of automatic word segmentation. This paper proposes an effective model based on statistics to identify Chinese names. It establishes rewards punishment mechanism and supervised learning mechanism, and presents the reliability for the word segmentation in the model. The experiments show that the precision and recall rate respectively reach 95.97% and 95.52% by close test, while the precision and recall rate are 92.37% an...
[关键字]双词同现频度; 单词频度; 学习机制; 中文姓名识别;
|
一种基于EM非监督训练的自组织分词歧义解决方案
[作者]王伟; 钟义信; 孙建; 杨力;
[摘要]本文旨在提供一种基于非监督训练的分词歧义解决方案和一种分词算法。基于EM的思想 ,每个句子所对应的所有 (或一定范围内 )的分词结果构成训练集 ,通过这个训练集和初始的语言模型可以估计出一个新的语言模型。最终的语言模型通过多次迭代而得到。通过一种基于该最终语言模型的统计分词算法 ,对于每个句子至少带有一个歧义的测试集的正确切分精度达到 85 .36 % (以句子为单位 )
[Abstract]This paper is mainly to present a word segmentation ambiguity resolution scheme based on unsupervised training.According to the idea of EM,a language model is built increasingly by collection the fractional counts of patterns (such as bigram pair)from the augmentations of all the segmentation candidates of a sentence.The learned language model is incorporated into a statistical segmentor.Experiments show that this scheme can resolve 85.36% ambiguity on test set each sentence of which has at least one ambigu...
[关键字]EM算法; 分词歧义; 非监督;
|
汉语声调识别中的基音平滑新方法
[作者]朱小燕; 王昱; 刘俊;
[摘要]汉语普通话是一种带声调的语言。声调可以用基音的轮廓信息进行描述。传统基音的平滑方法 :线性平滑、中值平滑和一般的线性插值方法都不能很好地处理连续的基音频率有随机错误点的情况。本文提出了一种通过搜索来得到更精确的基音轮廓的新的基音平滑方法。这种方法具有简单可靠 ,快速高效的特点。实验表明这种方法比传统的方法识别错误率降低约 40 %
[Abstract]Mandarin is a tonal language.The tones are recognized by using the pitch contour information which can be expressed by fundamental frequencies.The classic approaches for fundamental frequencies smoothing,such as linear smoothing,median smoothing and linear interpolation,can not work well in the case of that fundamental frequency is not detected correctly and several continulus frames.In this paper,a new smoothin approach is presented,in which a searching method is used to get a preferable accurate pitch con...
[关键字]基音检测; 平滑; 声调识别; 语音识别;
|
地形图数字注记的自动提取与识别
[作者]徐战武; 张涛; 刘肖琳;
[摘要]地形图的自动扫描矢量化是GIS领域亟待解决的一个重要难题。地形图中包含了大量的字体丰富的数字注记 ,用以表示地物地貌的属性等特征 ,正确提取并识别这些数字是图纸处理中的重要组成部分。本文分析了现有的提取方法的不足 ,提出了一种新的数字注记自动提取与识别算法 ,首先根据先验的尺寸大小确定候选数字 ,再采用OCON结构的BP神经网络识别出真正的数字 ,然后利用近邻关系提取出扩展数字。实验表明 ,该算法是快速、高效、可靠的
[Abstract]Automatic vectorization of scanned topographic maps is an important and difficult problem that needs to be solved urgently.Atopographic map includes plenty of numbers with various fonts which indicate properties and other features of general configuration.Extracting and recognizing these numbers correctly is an important part in map disposal.Many present methods of extraction are analyzed on their disadvantages and a new extraction and recognition algorithm of numbers is presented in this paper.The algorith...
[关键字]地形图; 数字注记; BP网络;
|
蒙古文整词编码研究
[作者]S·苏雅拉图;
[摘要]作者基于蒙古文黏着记录其词汇方式和按书面音节拼读书写整词规则 ,提出了蒙古文整词编码方法。本文依据可计算性理论 ,提出了拼音文字非键盘映射编码方法 ,将整词编码分为输写码与计算码。整词输写码设计模仿传统蒙古文整词固有拼读书写规则 ,达到了最佳人机键盘交互目的。整词计算码既可载荷整词复杂特征知识信息、又可保证信息的可计算性 ,从而为蒙古文整词复杂特征合一计算和并行处理奠定了可行性科学基础
[Abstract]Based on the agglutinative pattern of Mongolian language recording and syllably writing rules of words, a whole word coding method for Mongolian language is proposed.With the use of computability theory dividing whole_word coding into two parts:writing input coding and computational coding, an method of none keyboard mapping for spelling language is proposed.An best human_computer interaction pattern is reached with the imitation of the natural spelling,writing rules of traditional Mongolian whole word in ...
[关键字]蒙古文整词; 输写码; 计算码; 可计算性; 复杂特征载荷;
|
利用平行网页建立中英文统计翻译模型
[作者]聂建云; 陈江;
[摘要]建立翻译模型的目的是试图从平行文本 (或翻译例句 )中自动抽取翻译关系。本文将描述我们在建立中英文统计翻译模型上的尝试。我们所用的平行文本是从万维网上自动获得的半结构性平行文本。在训练过程中 ,我们尽量利用文本中的HTML结构信息。实验表明 ,所训练的翻译模型能达到 80 %的准确率。对于象跨语言信息检索这样的应用 ,这样的准确率已经能大致满足需要。这一工作表明 ,对于检索引擎上的问句的翻译可以使用比机器翻译成本更低的工具
[Abstract]A statistical translation model tries to capture translation relationships from a set of parallel texts (or translation examples).This paper describes our attempt to train such translation models from a set of semi structured parallel texts in Chinese and English.These texts are gathered from the Web by an automatic mining tool PTMiner.Our work takes advantage of the HTML structure of the texts.Some special processing is necessary on Chinese.Our experiments show that we can obtain a translation precision ...
[关键字]中英问句翻译; 平行网页; 句对齐; 统计翻译模型; 跨语言信息检索;
|
提高汉语自动分词精度的多步处理策略
[作者]赵铁军; 吕雅娟; 于浩; 杨沐昀; 刘芳;
[摘要]:汉语自动分词在面向大规模真实文本进行分词时仍然存在很多困难。其中两个关键问题是未登录词的识别和切分歧义的消除。本文描述了一种旨在降低分词难度和提高分词精度的多步处理策略 ,整个处理步骤包括 7个部分 ,即消除伪歧义、句子的全切分、部分确定性切分、数词串处理、重叠词处理、基于统计的未登录词识别以及使用词性信息消除切分歧义的一体化处理。开放测试结果表明分词精确率可达 98%以上
[Abstract]The automatic word segmentation of Chinese sentences is difficult when the processing mechanism faces large scale real texts.The crucial two issues in Chinese segmentation are the identification of unknown words and the disambiguation of segmentation strings.This paper describes a strategy based on multi steps processing for decreasing the difficulties and improving the accuracy of the segmentation.The processing steps include seven parts,i.e.,disambiguation of pseudo ambiguities,full segmentation of a s...
[关键字]汉语自动分词; 歧义; 多步处理;
|
利用遗传算法实现词类标记集的优化
[作者]孙宏林; 陆勤; 俞士汶;
[摘要]过去词类标记集的选择主要基于专家的经验知识 ,缺乏自动或半自动的方法来辅助这一过程。本文提出了一种利用遗传算法来搜索优化的标记集的新方法。这种方法可以在一个候选标记集集合中自动搜索一个最优或较优的标记集 ,并可根据应用的需求调整参数以适应特定任务的需求。实验表明 :遗传算法为标记集的选择提供了一种系统的有效的辅助手段
[Abstract]POS tagset selection in the past was mainly done by experts using human knowledge manually,since there is no automatic or semi automatic way to assist the selection process.This paper proposes a novel method to search for an optimal POS tagset using genetic algorithms (GA).The experiment shows that GA provides an efficient optimization of POS tagset and allows for the adjustment of parameters according to user requirement.It provides a systematic way to help people in making an intelligent choice on the se...
[关键字]词性标注; 词类; 标记集; 遗传算法;
|
中科院自动化所模式识别国家重点实验室正式成为国际语音翻译研究协会核心成员
[作者]宗成庆;
[摘要]
[Abstract]
[关键字]自动化; 模式识别; 国家重点实验室; 语音翻译;
|
基于分解与动态规划策略的汉语未登录词识别
[作者]吕雅娟; 赵铁军; 杨沐昀; 于浩; 李生;
[摘要]:未登录词的识别是汉语自动分词中的主要问题。本文以对中国人名 ,中国地名和外国译名进行整体识别为目标 ,采用分解处理策略降低了整体处理难度 ,并使用动态规划方法实现了最佳路径的搜索 ,较好地解决了未登录词之间的冲突问题。通过对真实语料识别的测试 ,证明该方法可以全面提高未登录词识别的正确率和召回率
[Abstract]Unknown word resolution is a dilemma for automatic Chinese segmentation.Aiming at solving Chinese human names,Chinese place names and translated names of other language,this paper puts forward a leveled unknown word resolution strategy with dynamic programming searching the best path.This method successfully solves the contradictions among these unknown words identification.Experiment on real corpus shows that the proposed method possesses a high performance.
[关键字]未登录词识别; 分解处理; 动态规划;
|
共95页 当前第41页