词形化java

我正在寻找Java中英语的lemmatisation实现。我已经find了一些，但我需要一些不需要太多内存来运行（1 GB的顶部）。谢谢。我不需要一个词干。

斯坦福大学CoreNLP Java库包含一个稍微有点资源密集的lemmatizer ，但是我已经在我的笔记本电脑上运行了<512MB的RAM。

要使用它：

下载jar文件 ;
在您select的编辑器中创build一个新项目/制作一个ant脚本，其中包含您刚刚下载的档案中包含的所有jar文件;
创build一个新的Java，如下所示（基于斯坦福大学网站的片段）;

import java.util.Properties; public class StanfordLemmatizer { protected StanfordCoreNLP pipeline; public StanfordLemmatizer() { // Create StanfordCoreNLP object properties, with POS tagging // (required for lemmatization), and lemmatization Properties props; props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma"); // StanfordCoreNLP loads a lot of models, so you probably // only want to do this once per execution this.pipeline = new StanfordCoreNLP(props); } public List<String> lemmatize(String documentText) { List<String> lemmas = new LinkedList<String>(); // create an empty Annotation just with the given text Annotation document = new Annotation(documentText); // run all Annotators on this text this.pipeline.annotate(document); // Iterate over all of the sentences found List<CoreMap> sentences = document.get(SentencesAnnotation.class); for(CoreMap sentence: sentences) { // Iterate over all tokens in a sentence for (CoreLabel token: sentence.get(TokensAnnotation.class)) { // Retrieve and add the lemma for each word into the list of lemmas lemmas.add(token.get(LemmaAnnotation.class)); } } return lemmas; } }

克里斯关于斯坦福德Lemmatizer的答案是伟大的！简直美极了。他甚至包含了一个指向jar文件的指针，所以我不必为此而使用google。

但他的一行代码有一个语法错误（他以某种方式改变了以“lemmas.add …”开头的行中的结尾closures括号和分号），并且忘记了包括import。

至于NoSuchMethodError错误，通常是由于该方法不是公共静态的，但如果你看看代码本身（在http://grepcode.com/file/repo1.maven.org/maven2/com.guokr /stan-cn-nlp/0.0.2/edu/stanford/nlp/util/Generics.java?av=h ）这不是问题。我怀疑问题是在构buildpath中的某处（我使用Eclipse Kepler，所以configuration我在项目中使用的33个jar文件没有问题）。

下面是我对Chris的代码的一个小修改，以及一个例子（我对Evanescence屠杀他们完美的歌词表示歉意）：

 import java.util.LinkedList; import java.util.List; import java.util.Properties; import edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation; import edu.stanford.nlp.ling.CoreLabel; import edu.stanford.nlp.pipeline.Annotation; import edu.stanford.nlp.pipeline.StanfordCoreNLP; import edu.stanford.nlp.util.CoreMap; public class StanfordLemmatizer { protected StanfordCoreNLP pipeline; public StanfordLemmatizer() { // Create StanfordCoreNLP object properties, with POS tagging // (required for lemmatization), and lemmatization Properties props; props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma"); /* * This is a pipeline that takes in a string and returns various analyzed linguistic forms. * The String is tokenized via a tokenizer (such as PTBTokenizerAnnotator), * and then other sequence model style annotation can be used to add things like lemmas, * POS tags, and named entities. These are returned as a list of CoreLabels. * Other analysis components build and store parse trees, dependency graphs, etc. * * This class is designed to apply multiple Annotators to an Annotation. * The idea is that you first build up the pipeline by adding Annotators, * and then you take the objects you wish to annotate and pass them in and * get in return a fully annotated object. * * StanfordCoreNLP loads a lot of models, so you probably * only want to do this once per execution */ this.pipeline = new StanfordCoreNLP(props); } public List<String> lemmatize(String documentText) { List<String> lemmas = new LinkedList<String>(); // Create an empty Annotation just with the given text Annotation document = new Annotation(documentText); // run all Annotators on this text this.pipeline.annotate(document); // Iterate over all of the sentences found List<CoreMap> sentences = document.get(SentencesAnnotation.class); for(CoreMap sentence: sentences) { // Iterate over all tokens in a sentence for (CoreLabel token: sentence.get(TokensAnnotation.class)) { // Retrieve and add the lemma for each word into the // list of lemmas lemmas.add(token.get(LemmaAnnotation.class)); } } return lemmas; } public static void main(String[] args) { System.out.println("Starting Stanford Lemmatizer"); String text = "How could you be seeing into my eyes like open doors? \n"+ "You led me down into my core where I've became so numb \n"+ "Without a soul my spirit's sleeping somewhere cold \n"+ "Until you find it there and led it back home \n"+ "You woke me up inside \n"+ "Called my name and saved me from the dark \n"+ "You have bidden my blood and it ran \n"+ "Before I would become undone \n"+ "You saved me from the nothing I've almost become \n"+ "You were bringing me to life \n"+ "Now that I knew what I'm without \n"+ "You can've just left me \n"+ "You breathed into me and made me real \n"+ "Frozen inside without your touch \n"+ "Without your love, darling \n"+ "Only you are the life among the dead \n"+ "I've been living a lie, there's nothing inside \n"+ "You were bringing me to life."; StanfordLemmatizer slem = new StanfordLemmatizer(); System.out.println(slem.lemmatize(text)); } }

这里是我的结果（我印象非常深刻，它抓住了“有”的“有”，几乎所有的事情都完美无缺）：

开始斯坦福大学Lemmatizer

添加注释器标记大小

添加注释器ssplit

添加注释者pos

从edu / stanford / nlp / models / pos-tagger / english-left3words / english-left3words-distsim.tagger …完成[1.7秒]阅读POS标记模型。

添加注释器引理

[你怎么可以看到我的眼睛就像打开门一样，你，领导，我，往下，进入，我的，核心的，在哪里，没有，灵魂，我的灵魂，睡觉，某个地方，冷，直到你find它在那里和领导它回到家在你醒来我在里面，呼叫，我的名字，而且，除了我之外，还有，我的，我的，我的，我的，我的，我的，从现在起，我，现在，我，现在，我，现在，什么，我，现在，几乎，变成，你，是，不pipe，离开，我，你，呼吸，进入，我，和，让我，真实，冻结，里面，没有，你，触摸，没有，你，爱，亲爱的，只有你，其中，死者，我，有，是，活着，a，谎言，在那里，是，没有，在里面，你，是，带来，我，生活，。

hunspell有一个JNI，是开放式办公室和FireFox中使用的检查器。 http://hunspell.sourceforge.net/

您可以在这里尝试免费的Lemmatizer API： http ： //twinword.com/lemmatizer.php

向下滚动以findLemmatizer终点。

这样可以让你把“狗”变成“狗”，“能力”变成“能力”。

如果传递一个名为“text”的POST或GET参数，并带有“walked plants”之类的string：

 // These code snippets use an open-source library. http://unirest.io/java HttpResponse<JsonNode> response = Unirest.post("[ENDPOINT URL]") .header("X-Mashape-Key", "[API KEY]") .header("Content-Type", "application/x-www-form-urlencoded") .header("Accept", "application/json") .field("text", "walked plants") .asJson();

你得到这样的回应：

 { "lemma": { "plant": 1, "walk": 1 }, "result_code": "200", "result_msg": "Success" }

检查出Lucene雪球。

词形化java

设置短值Java

Java：通过HashMap迭代，效率更高？

我们可以使用JDBC在Android中连接远程MySQL数据库吗？

从技术上说，Oracle JDK和Open JDK之间的主要区别是什么？

Eclipse中无法访问的代码错误与Java中的死代码警告？

如何replacestring中的特殊字符？

有穆斯林祈祷时间计算Java库吗？

默认构造函数与内联字段初始化

Java线程垃圾收集与否

Java中的double tilde（~~）是什么意思？