页面内容是用JavaScript加载和Jsoup没有看到它

页面上的一个块用javascript填充内容，用Jsoup加载页面后没有任何信息。用Jsoupparsing页面时，有没有办法获得javascript生成的内容？

Marcin特别UPD：
无法粘贴页面代码，因为它太长了： http : //pastebin.com/qw4Rfqgw

以下是我需要的内容元素： <div id='tags_list'></div>

我需要用Java获取这些信息。预先使用Jsoup。元素是在javascript帮助下的字段：

 <div id="tags_list"> <a href="/tagsc0t20099.html" style="font-size:14;">разведчик</a> <a href="/tagsc0t1879.html" style="font-size:14;">Sr</a> <a href="/tagsc0t3140.html" style="font-size:14;">стратегический</a> </div>

Java代码：

 import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.io.IOException; public class Test { public static void main( String[] args ) { try { Document Doc = Jsoup.connect( "http://www.bestreferat.ru/referat-32558.html" ).get(); Elements Tags = Doc.select( "#tags_list a" ); for ( Element Tag : Tags ) { System.out.println( Tag.text() ); } } catch ( IOException e ) { e.printStackTrace(); } } }

JSoup是一个HTMLparsing器，而不是某种embedded式浏览器引擎。这意味着它完全不知道在初始页面加载后，通过Javascript添加到DOM的任何内容。

要访问这种types的内容，你需要一个embedded式浏览器组件，关于这种组件，有很多关于SO的讨论。例如，有没有一种方法可以在Java中embedded浏览器？

在我的情况下解决与com.codeborne.phantomjsdriver注意：这是groovy代码。

的pom.xml

  <dependency> <groupId>com.codeborne</groupId> <artifactId>phantomjsdriver</artifactId> <version> <here goes last version> </version> </dependency>

PhantomJsUtils.groovy

 import org.jsoup.Jsoup import org.jsoup.nodes.Document import org.openqa.selenium.WebDriver import org.openqa.selenium.phantomjs.PhantomJSDriver class PhantomJsUtils { private static String filePath = 'data/temp/'; public static Document renderPage(String filePath) { System.setProperty("phantomjs.binary.path", 'libs/phantomjs') // path to bin file. NOTE: platform dependent WebDriver ghostDriver = new PhantomJSDriver(); try { ghostDriver.get(filePath); return Jsoup.parse(ghostDriver.getPageSource()); } finally { ghostDriver.quit(); } } public static Document renderPage(Document doc) { String tmpFileName = "$filePath${Calendar.getInstance().timeInMillis}.html"; FileUtils.writeToFile(tmpFileName, doc.toString()); return renderPage(tmpFileName); } }

ClassInProject.groovy

 Document doc = PhantomJsUtils.renderPage(Jsoup.parse(yourSource))

您需要了解正在发生的事情：

当从网站查询页面时，无论是使用Jsoup还是浏览器，发回给您的是一些HTML。 Jsoup能够parsing这个。
但是，大多数网站都在该HTML中包含Javascript，或者从该HTML链接，这将填充页面的内容。您的浏览器能够执行Javascript，从而填充页面。 Jsoup不是。

理解这一点的方法如下：parsingHTML代码很容易。执行Javascript代码并更新相应的HTML代码要复杂得多，而且是浏览器的工作。

以下是这类问题的一些解决scheme：

如果您可以findJavascript代码所做的Ajax调用，那就是加载内容，那么您可以使用Jsoup调用这些调用的URL。为了做到这一点，请使用浏览器中的开发者工具。但是这不能保证工作：
- 这可能是该url是dynamic的，并取决于当时在网页上的内容
- 如果内容不公开，将涉及到Cookie，只是简单地查询资源URL是不够的
在这些情况下，您将需要“模拟”浏览器的工作。幸运的是，这样的工具存在。我所知道的和推荐的是PhantomJS 。它与JavaScript一起工作，你需要通过启动一个新的进程从Java启动它。如果你想坚持Java，这篇文章列出了一些Java的select。

我其实有一个“方法”！也许这是一个“解决办法”比“方式…下面的代码检查元属性”REFRESH“和JavaScriptredirect…如果其中任何一个存在RedirectedUrlvariables设置。所以你知道你的目标…然后，您可以检索目标网页，然后继续…

  String RedirectedUrl=null; Elements meta = page.select("html head meta"); if (meta.attr("http-equiv").contains("REFRESH")) { RedirectedUrl = meta.attr("content").split("=")[1]; } else { if (page.toString().contains("window.location.href")) { meta = page.select("script"); for (Element script:meta) { String s = script.data(); if (!s.isEmpty() && s.startsWith("window.location.href")) { int start = s.indexOf("="); int end = s.indexOf(";"); if (start>0 && end >start) { s = s.substring(start+1,end); s =s.replace("'", "").replace("\"", ""); RedirectedUrl = s.trim(); break; } } } } } ... now retrieve the redirected page again...

用Jsoupparsing页面时，有没有办法获得JavaScript生成的内容？

我会猜测不，如果没有用Java构build一个完整的JavaScript解释器，考虑这将会是多么的困难。

尝试：

 Document Doc = Jsoup.connect(url) .header("Accept-Encoding", "gzip, deflate") .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0") .maxBodySize(0) .timeout(600000) .get();

页面内容是用JavaScript加载和Jsoup没有看到它

Java VisualVM中的总方法时间

在Java中安全地将long转换为int

在java中，Comparable.compareTo的返回值是什么意思？

在Java中运行命令行

如何在Java中监视计算机的CPU，内存和磁盘使用情况？

如何检查一个JSON键是否存在？

为什么业务逻辑应该移出JSP？

在java中可打印的字符

Java 8是否提供重复值或函数的好方法？

Mockitovalidation方法调用的顺序/顺序