[ 2010 September,09, Thursday ]
中国中文信息学会
Chinese Information Processing Society of China
首页
学会简介
学会领导
学会办公室
工作委员会
专业委员会
学术活动
发展会员
钱伟长中文信息处理奖
科技工作者之家
中文信息学报
新书介绍
按年代和期次浏览(最新数据: 2001年第6期)
双语交叉分类模型的设计与实现
[作者]林鸿飞; 王剑峰;

[摘要]利用交叉分类机制共享因特网上各种语言的信息资源是知识挖掘的重要方法 ,本文给出了双语交叉分类的模型以及实现方法。其主要思想是不需要进行机器翻译和人工标注 ,利用文本特征抽取机制提取类别特征项和文本特征项 ,通过基于概念扩充的对译映射规则自动生成类别和文本特征向量 ,在此基础上利用潜在语义分析 ,将双语文本在语义层面上统一起来 ,通过类别与文本的语义相似度进行分类。从而获取较高的精度

[Abstract]It is essential to knowledge discovery that multi linguistic text categorization is applied to share the information sources in the Internet.The model for bi linguistic text categorization is presented in this paper.It utilizes the mechanism of text feature extraction to extract the features of classes and texts,and it generates the feature vectors of classes and texts by the rule of word translation based on concept expansion. As a result,it uses Latent Semantic Indexing to integrate the bi linguistic t...
[关键字]双语交叉文本分类; 概念扩充; 潜在语义分析; 空间向量模型;



一种中文文档的非受限无词典抽词方法
[作者]金翔宇; 孙正兴; 张福炎;

[摘要]本文提出了一种非受限无词典抽词模型 ,该模型通过自增长算法获取中文文档中的汉字结合模式 ,并引入支持度、置信度等概念来筛选词条。实验表明 :在无需词典支持和利用语料库学习的前提下 ,该算法能够快速、准确地抽取中文文档中的中、高频词条。适于对词条频度敏感 ,而又对计算速度要求很高的中文信息处理应用 ,例如实时文档自动分类系统

[Abstract]A domain independent dictionary free lexical acquisition model is presented in this paper,which introduces a self increasing algorithm to acquire the co occurrence patterns of Chinese characters,and introduces some criteria such as support and confidence to filter these co occurrence patterns to get lexical items.Experiments show that it can acquire lexical items with high frequency effectively and efficiently without the support of the dictionary and the supervised learning in term of corpus.The model...
[关键字]中文信息处理; 自动分词; 非受限无词典抽词; 汉字结合模式;



用于信息检索的古文统计分析
[作者]张敏; 马少平;

[摘要]根据中文古籍信息检索技术的需求 ,本文在大规模语料库上对古汉语进行了统计分析。首先给出了在信息处理中多个专用语料库的动态知识合并的方法。在此基础上 ,对三千五百万字的中文古籍语料进行了统计分析 ,总结出在古汉语中 ,汉字在高频字上集中分布而在低频字上相当散布 ,且总体变化成指数递减的规律 ,并对二元语法进行了分析。然后分别与现代汉语的单字及双字进行比较 ,得出相应结论 ,并按照使用频度 ,把古汉语的汉字进行了分类。最后 ,这些统计学习到的知识 ,在中文古籍信息检索系统中得到了实际的应用

[Abstract]Based on the need of information retrieval technology on Chinese ancient books,we made the statistical analyses of the ancient Chinese on a large scale corpus.Firstly,we propose a method to cooperate corpus on different fields.In this method,we analyzed the statistics of ancient Chinese on more than 35,000,000 characters.It shows that the common used characters are concentrated but the remaining is diffused with the decreasing speed of exponential.Then we give some more analyses on bigrams.Comparisons are ...
[关键字]信息检索; 古籍检索; 字频统计; 二元语法; 中文信息处理;



距离加权统计语言模型及其应用
[作者]金凌; 吴文虎; 郑方; 吴根清;

[摘要]本文在统计语言模型构造中 ,提出了将词间距离信息结合到N gram统计语言模型中的思路 ,并称之为距离加权的关联词统计语言模型。该模型可以考虑一个句子中非相邻词之间的关系 ,基于“词距越近关系越密切”的原则 ,通过距离加权函数来引入距离信息 ,提高模型的预测能力。本文还将其应用到一个中文整句拼音输入法系统中。实验表明 ,该模型与传统的N gram统计语言模型相比 ,汉字误识率有所降低 ,模型性能有了一定提高

[Abstract]Proposed in this paper is a novel language model based on the traditional N gram model,where the inter word distance information is integrated,and therefore the model is referred to as the distance weighted statistical language model.In this model,the relationship between disconnected words is taken into consideration.Based on the principle that closer words(in distance)have a closer relation.A distance weighted function has been used to integrate the information so as to improve the model's prediction ...
[关键字]N-gram; 关联词模型; 距离加权; 数据平滑;



自然语言处理中逻辑词的知识图分析
[作者]张蕾; 李学良; 刘小冬;

[摘要]知识图是一种新的知识表示方法。本文从本体论的角度出发 ,将知识图的本体论分别与Aristotle、Kant和Peirce的三种知识表示的本体论进行了比较 ,表明知识图方法的有效性以及本原性 ,说明知识图是一种更为一般的知识表示方法。从知识图本体论的观点 ,研究了各类逻辑词的知识图表示。本文结合汉语的特点 ,从结构的角度 ,研究并揭示了逻辑词的共性和规律性。进一步阐明知识图“结构就是含义”的思想。逻辑词的知识图分析将为自然语言分析中词典的建立奠定基础

[Abstract]Knowledge graph theory is a new method of knowledge representation.In this paper,we compare the knowledge graph ontology with other ontology,such as Aristotle's?Kant's and Peirce's.As a result,knowledge graph theory is more primitive than others.On the base of the comparing,the classification of logic words in natural language processing is also studied.The logic words are classified into two kinds,according to their different structures in knowledge graphs.For each kind of the logic words,we analysis the w...
[关键字]自然语言处理; 知识图; 本体论; 逻辑词; 词图;



IBM大型机与小型机间汉字转换解决方案
[作者]翟凌慧; 马少平;

[摘要]本文描述了在IBM的大型机ES/90 0 0 (基于MVS/VSE操作系统 )与小型机RS/60 0 0 (基于AIX操作系统 )间通过CICS传输中文数据存在的数据转换问题 ,分析了汉字E BCDIC码与汉字ASCII码单纯通过CICS配置不能正确转换的原因 ,给出了两种解决方案 :第一种方案通过CICS程序、JAVA程序、CICS配置结合实现汉字转换 ;第二种方案只通过JA VA程序、CICS配置实现汉字转换

[Abstract]This paper describes the data conversion problem when CICS on mainframe ES/9000(based on MVS/VSE operating system) and CICS on minicom RS/6000(based on AIX operating system) communicate with each other.We discuss the reason that Chinese characters in EBCDIC and ones in ASCII cannot convert to each other with CICS. Then we provides two solutions for this question:one solution is implemented on the combination of CICS program,Java program and CICS configuration,the other is achieved through the combination of...
[关键字]汉字转换; CICS; Java; EBCDIC; IBM; ES/9000; RS/6000; 1381; 935;



现代汉语分词系统通用接口设计与实现
[作者]娄珽; 宋柔; 李卫亮; 罗智勇;

[摘要]现代汉语文本自动分词是中文信息处理的重要基石 ,为此提供一个通用的分词接口是非常重要的。本文提出了通用分词接口的目标 ,论述了它的原理和设计方案。该系统已经初步实现

[Abstract]Automatic word segmentation of modern Chinese text is the base of Chinese information processing.So a general purpose application interface for word segmentation is important.In this thesis,the goal of a general purpose word segmentation system is presented,its principle and schemes are discussed.The development of this system has been completed basically.
[关键字]中文信息处理; 汉语分词; 通用接口;



确定切词单位的某些非语法因素
[作者]冯志伟;

[摘要]本文提出了“形式词”概念 ,并在形式词的基础上 ,进一步研究了确定切词单位的非语法因素 ,包括语义因素、语音因素。在语义因素方面 ,研究了意义单纯性测定法、意义紧密性测定法、引申意义测定法、常用性测定法 ;在语音因素方面 ,研究了停顿判定法、双音节化判定法。最后 ,提出了视读原则、多元化原则、领域针对性原则等确定切词单位的非语言学的原则

[Abstract]Based on the conception of formal word,the author inquires into the non grammatical factors for determination of segmentation element.They include the semantic factors,phonetic factors.He put forward the corresponding approaches: the semantic simplicity test approach,the semantic tightness test approach,the semantic derivation test approach,the pause test approach,the disyllabism test approach.Some non linguistic principles are studied: easy read principle,pluralism principle,domain oriented principle.
[关键字]理论词; 形式词; 意义单纯性测定法; 意义紧密性测定法; 引申意义测定法; 停顿判定法; 双音节化判定法; 视读原则; 多元化原则; 领域针对性原则;



书面汉语分词连写的合理性与紧迫性及其实现
[作者]李辉阳; 韩忠愿; 周经野;

[摘要]本文结合信息处理技术的发展 ,指出在书面语中采用分词连写的合理性和紧迫性 ,提出应将这一思想纳入相应的中文信息处理标准中 ,并在一些未来的信息平台 (如eBook、WWW )上加以体现。同时对分词连写在具体实施时所面临的如何适应人们长期以来形成的读写习惯问题 ,给出了一个可行的解决办法

[Abstract]According to the development of the information processing technology,this paper points out the rationality and the urgency of link writing for Chinese word in written text,and also puts forward the necessary of bringing this thought into the standard of CIP and embodys it in the future information platforms (such as,eBook,WWW).In addition,it also provides a feasible method to the problem,how to accustom to the people's longtime habit of reading and writing,when implementing the link writing.
[关键字]分词连写; 中文信息平台; 中文信息处理;



基于边界点词性特征统计的韵律短语切分
[作者]牛正雨; 柴佩琪;

[摘要]由于基于规则方法的文本处理系统在系统建立时需要总结大量的规则 ,而且很难保证它在处理大规模真实文本时的强壮性 ,因此本文在使用统计方法进行韵律短语切分方面做了一些有益的探索。先对文本进行自动分词和自动词性标注 ,然后利用从已经经过人工标注的语料库中得到的韵律短语切分点的边界模式以及概率信息 ,对文本中的韵律短语切分点进行自动预测 ,最后利用规则进行适当的纠错。通过对一千句的真实文本进行封闭和开放测试 ,词性标注的正确率在 95%左右 ,韵律短语切分的召回率在 6 0 %左右 ,正确率达到了 80 %。

[Abstract]It is often difficult to construct a rule based parser and adapt it to largescale real text.So we tried a statistical approach to prosodic phrasing.At first the text was segmented into Chinese words,then word sequences are tagged automatically by POS tagger.The boundary pattern and boundary distribution probabilities are used in the algorithm to predict phrase breaks.The boundary distribution probabilities are derived from hand annotated corpus.The errors caused by statistical method are corrected by rule...
[关键字]韵律短语切分; 自动词性标注; 语料库; 统计方法;



共95页 当前第37页 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95   
©中国中文信息学会 1981-2007
京ICP备05039057号