当使用jsoup将html转换为纯文本时,如何保留换行符?

我有以下代码:

public class NewClass { public String noTags(String str){ return Jsoup.parse(str).text(); } public static void main(String args[]) { String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" + "<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> <a href=\"http://google.com\">googlez</a></p></BODY> </HTML> "; NewClass text = new NewClass(); System.out.println((text.noTags(strings))); } 

我得到的结果是:

 hello world yo googlez 

但是我想打破这个界限:

 hello world yo googlez 

我已经看过jsoup的TextNode#getWholeText(),但我不知道如何使用它。

如果在我parsing的标记中有一个<br> ,我怎么能在我的结果输出中得到一个换行符?

保留linebreaks的真正解决scheme应该是这样的:

 public static String br2nl(String html) { if(html==null) return html; Document document = Jsoup.parse(html); document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing document.select("br").append("\\n"); document.select("p").prepend("\\n\\n"); String s = document.html().replaceAll("\\\\n", "\n"); return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false)); } 

它满足以下要求:

  1. 如果原始html包含换行符(\ n),它将被保留
  2. 如果原始html包含br或p标记,则会将其转换为换行符(\ n)。

 Jsoup.parse("A\nB").text(); 

你有输出

 "AB" 

并不是

 A B 

为此,我正在使用:

 descrizione = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text(); text = descrizione.replaceAll("br2n", "\n"); 
 Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false)); 

我们在这里使用这个方法:

 public static String clean(String bodyHtml, String baseUri, Whitelist whitelist, Document.OutputSettings outputSettings) 

通过传递它Whitelist.none()我们确保所有的HTML被删除。

通过传递new OutputSettings().prettyPrint(false)我们确保输出没有被重新格式化,并且换行符被保留。

尝试通过使用jsoup:

 public static String cleanPreserveLineBreaks(String bodyHtml) { // get pretty printed html with preserved br and p tags String prettyPrintedBodyFragment = Jsoup.clean(bodyHtml, "", Whitelist.none().addTags("br", "p"), new OutputSettings().prettyPrint(true)); // get plain text with preserved line breaks by disabled prettyPrint return Jsoup.clean(prettyPrintedBodyFragment, "", Whitelist.none(), new OutputSettings().prettyPrint(false)); } 

你可以遍历给定的元素

 public String convertNodeToText(Element element) { final StringBuilder buffer = new StringBuilder(); new NodeTraversor(new NodeVisitor() { boolean isNewline = true; @Override public void head(Node node, int depth) { if (node instanceof TextNode) { TextNode textNode = (TextNode) node; String text = textNode.text().replace('\u00A0', ' ').trim(); if(!text.isEmpty()) { buffer.append(text); isNewline = false; } } else if (node instanceof Element) { Element element = (Element) node; if (!isNewline) { if((element.isBlock() || element.tagName().equals("br"))) { buffer.append("\n"); isNewline = true; } } } } @Override public void tail(Node node, int depth) { } }).traverse(element); return buffer.toString(); } 

并为您的代码

 String result = convertNodeToText(JSoup.parse(html)) 
 text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text(); text = descrizione.replaceAll("br2n", "\n"); 

如果html本身不包含“br2n”

所以,

 text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "<pre>\n</pre>")).text(); 

工作更可靠,更容易。

这是我的翻译HTML到文本的版本(实际上user121196答案的修改版本)。

这不仅保留换行符,而且还格式化文本和删除多余的换行符,HTML转义符号,并且从HTML获得更好的结果(在我的情况下,我从邮件中接收到)。

它最初是用Scala编写的,但是你可以很容易地把它改成Java

 def html2text( rawHtml : String ) : String = { val htmlDoc = Jsoup.parseBodyFragment( rawHtml, "/" ) htmlDoc.select("br").append("\\nl") htmlDoc.select("div").append("\\nl") htmlDoc.select("p").prepend("\\nl\\nl") htmlDoc.select("p").append("\\nl\\nl") org.jsoup.parser.Parser.unescapeEntities( Jsoup.clean( htmlDoc.html(), "", Whitelist.none(), new org.jsoup.nodes.Document.OutputSettings().prettyPrint(true) ),false ). replaceAll("\\\\nl", "\n"). replaceAll("\r",""). replaceAll("\n\\s+\n","\n"). replaceAll("\n\n+","\n\n"). trim() } 

尝试这个:

 public String noTags(String str){ Document d = Jsoup.parse(str); TextNode tn = new TextNode(d.body().html(), ""); return tn.getWholeText(); } 

使用textNodes()来获取文本节点的列表。 然后将它们与\n作为分隔符连接。 下面是我使用的一些scala代码,java端口应该很简单:

 val rawTxt = doc.body().getElementsByTag("div").first.textNodes() .asScala.mkString("<br />\n") 
 /** * Recursive method to replace html br with java \n. The recursive method ensures that the linebreaker can never end up pre-existing in the text being replaced. * @param html * @param linebreakerString * @return the html as String with proper java newlines instead of br */ public static String replaceBrWithNewLine(String html, String linebreakerString){ String result = ""; if(html.contains(linebreakerString)){ result = replaceBrWithNewLine(html, linebreakerString+"1"); } else { result = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", linebreakerString)).text(); // replace and html line breaks with java linebreak. result = result.replaceAll(linebreakerString, "\n"); } return result; } 

通过调用问题的html来调用,包含br,以及任何想用作临时换行占位符的string。 例如:

 replaceBrWithNewLine(element.html(), "br2n") 

recursion将确保用作换行符/换行符占位符的string永远不会在源html中,因为它将继续添加“1”,直到HTML中找不到链接断点符占位符string。 它不会有Jsoup.clean方法似乎遇到特殊字符的格式问题。

根据user121196和Green Beret对select和s的回答,唯一对我有用的解决scheme是:

 org.jsoup.nodes.Element elementWithHtml = .... elementWithHtml.select("br").append("<pre>\n</pre>"); elementWithHtml.select("p").prepend("<pre>\n\n</pre>"); elementWithHtml.text(); 

基于这个问题的其他答案和评论,似乎来到这里的大多数人真的在寻找一个通用的解决scheme,将提供一个HTML文档的格式很好的纯文本表示forms。 我知道我是。

幸运的是JSoup已经提供了一个非常全面的例子来说明如何实现: HtmlToPlainText.java

FormattingVisitor示例可以轻松调整为您的偏好,并处理大多数块元素和换行。

为了避免链接腐烂,这里是乔纳森·赫德利的解决scheme:

 package org.jsoup.examples; import org.jsoup.Jsoup; import org.jsoup.helper.StringUtil; import org.jsoup.helper.Validate; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.nodes.Node; import org.jsoup.nodes.TextNode; import org.jsoup.select.Elements; import org.jsoup.select.NodeTraversor; import org.jsoup.select.NodeVisitor; import java.io.IOException; /** * HTML to plain-text. This example program demonstrates the use of jsoup to convert HTML input to lightly-formatted * plain-text. That is divergent from the general goal of jsoup's .text() methods, which is to get clean data from a * scrape. * <p> * Note that this is a fairly simplistic formatter -- for real world use you'll want to embrace and extend. * </p> * <p> * To invoke from the command line, assuming you've downloaded the jsoup jar to your current directory:</p> * <p><code>java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]</code></p> * where <i>url</i> is the URL to fetch, and <i>selector</i> is an optional CSS selector. * * @author Jonathan Hedley, jonathan@hedley.net */ public class HtmlToPlainText { private static final String userAgent = "Mozilla/5.0 (jsoup)"; private static final int timeout = 5 * 1000; public static void main(String... args) throws IOException { Validate.isTrue(args.length == 1 || args.length == 2, "usage: java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]"); final String url = args[0]; final String selector = args.length == 2 ? args[1] : null; // fetch the specified URL and parse to a HTML DOM Document doc = Jsoup.connect(url).userAgent(userAgent).timeout(timeout).get(); HtmlToPlainText formatter = new HtmlToPlainText(); if (selector != null) { Elements elements = doc.select(selector); // get each element that matches the CSS selector for (Element element : elements) { String plainText = formatter.getPlainText(element); // format that element to plain text System.out.println(plainText); } } else { // format the whole doc String plainText = formatter.getPlainText(doc); System.out.println(plainText); } } /** * Format an Element to plain-text * @param element the root element to format * @return formatted text */ public String getPlainText(Element element) { FormattingVisitor formatter = new FormattingVisitor(); NodeTraversor traversor = new NodeTraversor(formatter); traversor.traverse(element); // walk the DOM, and call .head() and .tail() for each node return formatter.toString(); } // the formatting rules, implemented in a breadth-first DOM traverse private class FormattingVisitor implements NodeVisitor { private static final int maxWidth = 80; private int width = 0; private StringBuilder accum = new StringBuilder(); // holds the accumulated text // hit when the node is first seen public void head(Node node, int depth) { String name = node.nodeName(); if (node instanceof TextNode) append(((TextNode) node).text()); // TextNodes carry all user-readable text in the DOM. else if (name.equals("li")) append("\n * "); else if (name.equals("dt")) append(" "); else if (StringUtil.in(name, "p", "h1", "h2", "h3", "h4", "h5", "tr")) append("\n"); } // hit when all of the node's children (if any) have been visited public void tail(Node node, int depth) { String name = node.nodeName(); if (StringUtil.in(name, "br", "dd", "dt", "p", "h1", "h2", "h3", "h4", "h5")) append("\n"); else if (name.equals("a")) append(String.format(" <%s>", node.absUrl("href"))); } // appends text to the string builder with a simple word wrap method private void append(String text) { if (text.startsWith("\n")) width = 0; // reset counter if starts with a newline. only from formats above, not in natural text if (text.equals(" ") && (accum.length() == 0 || StringUtil.in(accum.substring(accum.length() - 1), " ", "\n"))) return; // don't accumulate long runs of empty spaces if (text.length() + width > maxWidth) { // won't fit, needs to wrap String words[] = text.split("\\s+"); for (int i = 0; i < words.length; i++) { String word = words[i]; boolean last = i == words.length - 1; if (!last) // insert a space if not the last word word = word + " "; if (word.length() + width > maxWidth) { // wrap and reset counter accum.append("\n").append(word); width = word.length(); } else { accum.append(word); width += word.length(); } } } else { // fits as is, without need to wrap text accum.append(text); width += text.length(); } } @Override public String toString() { return accum.toString(); } } } 

尝试通过使用jsoup:

  doc.outputSettings(new OutputSettings().prettyPrint(false)); //select all <br> tags and append \n after that doc.select("br").after("\\n"); //select all <p> tags and prepend \n before that doc.select("p").before("\\n"); //get the HTML from the document, and retaining original new lines String str = doc.html().replaceAll("\\\\n", "\n"); 

对于更复杂的HTML,上述解决scheme都不是很正确。 我能够成功地进行转换,同时保留换行符:

 Document document = Jsoup.parse(myHtml); String text = new HtmlToPlainText().getPlainText(document); 

(版本1.10.3)