|
|
|
图文互斥版面中文字阅读顺序的确定
[作者]贾娟; 陈堃銶; 周东浩;
[摘要]图文互斥版面中确定文字的阅读顺序是排版及版面理解过程中的一个难点。尤其是中文等东方文字特有的分栏串文互斥,其空间关系的复杂性使得阅读顺序存在歧义。针对此问题,建立新的版面布局模型,并引入新的版面对象PMRegion。给出了版面逐层快速分解构造版面对象和基于有序树的阅读顺序确定算法。已成功运用于专业中日文排版系统,取得了满意的效果,并对更深入研究文档图像理解具有十分重要的理论和实践意义。
[Abstract]Detecting reading order for text layout exclud ed by image is a key problem in document image understanding (DIU) and text typese tting. Especially in Chinese and other orient languages, text region in which wo rds are reflected to next line when they meet a graph boundary makes reading ord er various. A new layout model, which uses a new page object called PMRegion, is defined. Based on ordered tree, an algorithm for reading order detection after page top-down decomposition for constructing layout objects ...
[关键字]计算机应用; 中文信息处理; 阅读顺序; 图文互斥; 有序树;
| 一个基于多代码页的中文屏幕实时解释引擎的设计
[作者]李培峰; 朱巧明; 钱培德;
[摘要]目前,在计算机中汉字有多种代码页,汉字的多代码页并存现象将长期存在。为了实现汉字多代码页并存,需要汉字代码页自动识别技术的支撑。屏幕实时解释引擎是目前各种在线字典、词典以及教学软件的核心技术,此技术目前存在不能跨代码页,取词不全面、不正确等缺陷。本文主要针对以上情况,描述了采用汉字内码的代码页自动识别技术以及优化的自动屏幕取词技术的中文屏幕实时解释引擎的系统架构,并阐述了数据词典的设计以及在设计中采用的关键技术。对五百万汉字样本的测试中,应用此引擎的在线词典对有意义短字符串(不包括单字)代码页的识别率可以达到99%以上。
[Abstract]Nowadays, in the computer the Chinese Character s are represented by various code pages, and it is a long existing phenomenon. In order to use all kinds of Chinese code pages including GB2312, GBK, GB18030, BIG -5, HKSCS and ISO10646/Unicode at same time, the technology of Chinese code page s auto recognition is required. The Chinese screen real-time paraphrase engineer is the key technology to build many kinds of online dictionary, teaching softw are and so on. This paper describes the system architecture ...
[关键字]计算机应用; 中文信息处理; 汉字代码页自动识别; 屏幕取词; ISO10646;
| 基于内容的垃圾邮件过滤技术综述
[作者]王斌; 潘文锋;
[摘要]垃圾邮件问题日益严重,受到研究人员的广泛关注。基于内容的过滤是当前解决垃圾邮件问题的主流技术之一。目前基于内容的垃圾邮件过滤主要包括基于规则的方法和基于概率统计的方法。本文综述了目前用于垃圾邮件过滤研究的各种语料和评价方法,并总结了目前使用的垃圾邮件过滤技术以及它们之间的对比实验,包括Ripper、决策树、Rough Set、Rocchio、Boosting、Bayes、kNN、SVM、Winnow等等。实验结果表明,Boosting、Flexible Bayes、SVM、Winnow方法是目前较好的垃圾邮件过滤方法,它们在评测语料上的结果已经达到很高水平,但是,要走向真正实用化,还有很多的工作要做。
[Abstract]The volume of junk emails on the Internet has grown tremendously in th e past few years and is causing serious problems. Content-based filtering is on e of the mainstream technologies used so far. This paper aims to provide an overv iew on the state of art in this research field, including benchmark corpora, eva luation methods and filtering approaches. Many filtering approaches, including R ipper, Decision Trees, Rough Sets, Rocchio, Boosting, Bayes, kNN, SVM and Winnow , are discussed and compared in this...
[关键字]计算机应用; 中文信息处理; 综述; 垃圾邮件; 反垃圾邮件; 信息过滤; 文本分类;
| 基于k-means聚类的无导词义消歧
[作者]陈浩; 何婷婷; 姬东鸿;
[摘要]无导词义消歧避免了人工词义标注的巨大工作量,可以适应大规模的多义词消歧工作,具有广阔的应用前景。这篇文章提出了一种无导词义消歧的方法,该方法采用二阶context构造上下文向量,使用k-means算法进行聚类,最后通过计算相似度来进行词义的排歧.实验是在抽取术语的基础上进行的,在多个汉语高频多义词的两组测试中取得了平均准确率82·67%和80·87%的较好的效果。
[Abstract]An unsupervised WSD(word sense disambiguation) can avoid big labor cost and it is possible to adjust to deal with large-scale ,so WSD has extensive applications in many fields. This paper presents an unsupervised approach which constructs context vector by means of second-order context, clustering by k-means and disambiguates by calculating the similarity. Our experiments are based on the extraction of term and average accuracy is 82.62% and 80.87% for 8 ambiguous words in open test by this method.
[关键字]计算机应用; 中文信息处理; 词义消歧; HowNet; 二阶context; k-means聚类;
| 汉语介词短语的自动识别
[作者]干俊伟; 黄德根;
[摘要]本文运用规则和统计相结合的方法构造了一个汉语介词短语识别算法。首先,根据介词和介词短语右边界组成的搭配模板自动提取可信搭配关系,并用这些搭配关系对介词短语进行识别。之后,用基于词性的三元边界统计模型和规则相结合的方法识别其它未处理的介词短语。通过对含有7323个介词短语的语料作交叉测试,精确率达到87·48%,召回率达到87·27%。
[Abstract]This paper proposes a hybrid algorithm to identify Chinese prepositional phrase. The algorithm is composed of two steps. Firstly, the algorithm extracts reliable frames automatically according to the frame templates which consist of the prepositions and the right border of prepositional phrases. Then it identifies part of the prepositional phrases using these frames. Secondly the algorithm integrates a statistical model based on part-of-speech and rules to identify the prepositional phrases that haven't bee...
[关键字]计算机应用; 中文信息处理; 短语识别; 介词短语;
| 采用优先选择策略的中文人称代词的指代消解
[作者]李国臣; 罗云飞;
[摘要]指代是自然语言中常见的语言现象,指代消解是文本信息处理中的一个重要任务。随着篇章处理相关应用日益广泛,指代消解也显示出前所未有的重要性。本文针对中文人称代词的指代特点,提出了一种基于语料库的,运用决策树机器学习算法并结合优先选择策略,进行指代消解的方法。该方法充分考虑了与指代相关的若干属性,及相互之间的影响。实验表明,对中文人称代词的消解特别是第三人称的消解获得了一定的效果。
[Abstract]Anaphora is a common phenomenon in the research on NLP (Natural Language Processing), Anaphora resolution plays an important role in text information processing. With the increasing development of dealing with the discourses, anaphora resolution shows the unprecedented importance. In this paper, according to features of Chinese personal pronoun we present an approach which is based on corpus. It adopts the decision tree arithmetic and combines with the preference selection strategy. The method takes into ac...
[关键字]计算机应用; 中文信息处理; 语料库; 人称代词; 指代消解; 最优选择;
| 词语间依存关系的定量识别
[作者]王建会; 王雷; 胡运发;
[摘要]本文扩展和改进了现有的词语间依存关系定量识别算法,充分考虑词项概率分布的影响;明确区分词项之间的搭配关系、并列关系和从属关系,针对它们不同的特点,提出不同的识别算法;提出字串匹配模型;充分考虑两个词项之间相互位置的离散分布和距离的影响、以及它们的概率分布特性,提出词项间的依存强度模型,并据此构建词语间依存关系树;提出更新策略,对已经建好的依存关系树进行裁剪,并挖掘出潜在的依存关系。应用实验结果表明,本文提出的算法可以有效地识别出词语间的依存关系。
[Abstract]In order to identify the dependent relationship between words based on statistics efficiently and accurately, this paper has rectified part of the shortcomings of present algorithms by making the best of the distribution characteristic between words, distinguishing the collocation, coordinate and affiliation relationship between words, identifying them respectively by different strategies, presenting a new module of matching between strings and a new module of dependent intensity between words, constructing...
[关键字]计算机应用; 中文信息处理; 词语搭配; 依存关系; 定量识别;
| 用逻辑和篇章知识来约束模板匹配——逻辑结构和篇章结构知识在信息抽取中的运用
[作者]袁毓林;
[摘要]本文以文献[2]的语料为主要对象,讨论语句的逻辑结构和篇章结构怎样约束信息模板的类型,并约束对当前句中缺失的或以代词等形式表达的信息项目的求解。首先说明什么是基于论元结构的逻辑结构和篇章结构知识,然后分析否定算子、时体成分怎样改变事件的类型及其跟有关事件模板的匹配关系。接着,讨论动词的论元结构的内嵌和名词化等句法操作,怎样造成有关论元及相应的信息项目的分布位置发生变化。最后,讨论怎样利用篇章结构知识来求解本句中缺失的或以代词、指示词形式表达的信息项目。
[Abstract]This paper demonstrates how to use the knowledge of logic and discourse structure to restrain the template-matching in information extraction (briefly, IE), and to recover the missing information items or ones expressed by pronouns or deixis. It firstly explains what is the knowledge of the argument structure-based logic structure and discourse structure. Then it illustrates how the negative and aspect operators can change the type of event of a sentence and the matching relation between the sentence and th...
[关键字]计算机应用; 中文信息处理; 信息抽取; 论元结构; 逻辑结构; 篇章结构; 代词; 指示词;
| 基于互信息的统计语言模型平滑技术
[作者]黄永文; 何中市;
[摘要]数据平滑主要是用来解决统计语言模型在实际应用中数据稀疏问题。现有平滑技术虽然已有效地对数据稀疏问题进行了处理,但对已出现事件频率分布的合理性并没有作出有效的分析。本文则针对二元模型,提出了一种基于互信息的平滑技术,其基本思想是根据模型中每个二元对的互信息的高低对其概率进行折扣或补偿,并用极小化困惑度原则体现了模型的合理性。实验结果表明该技术优于目前常用的Katz平滑技术。
[Abstract]Smoothing techniques are mainly used to solve the problem of sparse data for statistical language model. The present smoothing techniques have solved the data sparse problem effectively but have not further analyzed the reasonableness for the frequency distribution of events occurring. This paper presents a new kind of smoothing technique based on the mutual information for Bi-gram model. The model parameters, probabilities for bigram, are discounted or compensated according to the mutual information, whose...
[关键字]计算机应用; 中文信息处理; 统计语言模型; 平滑技术; 互信息; 困惑度;
| 基于遗传和BP算法的车牌图像快速匹配
[作者]刘栓; 孟庆春;
[摘要]将基于遗传的BP神经网络算法用于智能交通中的车牌图像匹配,结合了遗传算法和BP算法的优点。先采用遗传学习算法进行全局寻优、再利用BP算法进行精确训练、优化BP(BackPropagation)神经网络权重学习和训练的神经网络图像匹配算法。实验结果表明:本文设计算法较好地达到了匹配要求,能够对目标图像与样本图像进行正确匹配,匹配概率达到了92%,而传统的BP神经网络仅有79%,并且在匹配速度上也明显优于传统的BP神经网络及其他改进算法,具有精确性、收敛性和匹配快等特点。
[Abstract]The improved algorithm based on genetic algorithm and BP algorithm is used for image matching of licence plate. It combines virtues of the two kinds of algorithms: first optimizing the BP networks learning process by genetic algorithm, then accurately training it using the BP algorithm, according to the obtained weights and thresholds, the matching result are obtained. Experimental results show that this improved image matching algorithm can better meet matching requirement. And in matching rate it can reac...
[关键字]人工智能; 模式识别; 遗传算法; BP(backpro; pagation)神经网络; 图像匹配; 小波变换;
|
共95页 当前第11页 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
|