Javascript可以读取任何网页的来源？

我正在从事屏幕抓取，并希望检索特定页面的源代码。

如何可以实现这一点与JavaScript？请帮帮我。

简单的方法开始，试试jQuery

 $("#links").load("/Main_Page #jq-p-Getting-Started li");

更多在jQuery文档

以更结构化的方式进行屏幕抓取的另一种方法是使用YQL或Yahoo查询语言。它将返回结构化为JSON或XML的抓取的数据。
例如
让我们来刮一下stackoverflow.com

 select * from html where url="http://stackoverflow.com"

会给你一个JSON数组（我select了这个选项）

  "results": { "body": { "noscript": [ { "div": { "id": "noscript-padding" } }, { "div": { "id": "noscript-warning", "p": "Stack Overflow works best with JavaScript enabled" } } ], "div": [ { "id": "notify-container" }, { "div": [ { "id": "header", "div": [ { "id": "hlogo", "a": { "href": "/", "img": { "alt": "logo homepage", "height": "70", "src": "http://i.stackoverflow.com/Content/Img/stackoverflow-logo-250.png", "width": "250" } ……..

这样做的好处在于，您可以进行预测，并且可以最终为您提供所需的数据结构，并且只需要您所需要的数据 （最终可以减lessnetworking上的带宽）
例如

 select * from html where url="http://stackoverflow.com" and xpath='//div/h3/a'

会得到你

  "results": { "a": [ { "href": "/questions/414690/iphone-simulator-port-for-windows-closed", "title": "Duplicate: Is any Windows simulator available to test iPhone application? as a hobbyist who cannot afford a mac, i set up a toolchain kit locally on cygwin to compile objecti … ", "content": "iphone\n simulator port for windows [closed]" }, { "href": "/questions/680867/how-to-redirect-the-web-page-in-flex-application", "title": "I have a button control ....i need another web page to be redirected while clicking that button .... how to do that ? Thanks ", "content": "How\n to redirect the web page in flex application ?" }, …..

现在只得到我们做的一个问题

 select title from html where url="http://stackoverflow.com" and xpath='//div/h3/a'

注意预测中的标题

  "results": { "a": [ { "title": "I don't want the function to be entered simultaneously by multiple threads, neither do I want it to be entered again when it has not returned yet. Is there any approach to achieve … " }, { "title": "I'm certain I'm doing something really obviously stupid, but I've been trying to figure it out for a few hours now and nothing is jumping out at me. I'm using a ModelForm so I can … " }, { "title": "when i am going through my project in IE only its showing errors A runtime error has occurred Do you wish to debug? Line 768 Error:Expected')' Is this is regarding any script er … " }, { "title": "I have a java batch file consisting of 4 execution steps written for analyzing any Java application. In one of the steps, I'm adding few libs in classpath that are needed for my co … " }, { ……

一旦你写你的查询，它会为你生成一个url

http://query.yahooapis.com/v1/public/yql?q=select%20title%20from%20html%20where%20url%3D%22http%3A%2F%2Fstackoverflow.com%22%20and%20% 20％20％20％20％20xpath％3D '％2F％2Fdiv％2Fh3％2FA' ％0A％20％20％20％20＆格式= JSON＆callback= cbfunc

在我们的情况。

所以最终你最终会做这样的事情

 var titleList = $.getJSON(theAboveUrl);

并玩它。

美丽，不是吗？

可以使用Javascript，只要你通过你的域名上的代理获取你想要的任何页面：

 <html> <head> <script src="/js/jquery-1.3.2.js"></script> </head> <body> <script> $.get("www.mydomain.com/?url=www.google.com", function(response) { alert(response) }); </script> </body>

您可以简单地使用XmlHttp （AJAX）来访问所需的URL，并且URL中的HTML响应将在responseText属性中可用。如果这个域不是相同的，那么你的用户会收到一个类似于“这个页面试图访问不同域名的浏览器警报”，你想允许这个吗？

作为一项安全措施，Javascript无法读取来自不同域的文件。虽然可能会有一些奇怪的解决方法，但我会为这个任务考虑一种不同的语言。

使用jQuery

 <html> <head> <script src="http://jqueryjs.googlecode.com/files/jquery-1.3.2.js" ></script> </head> <body> <script> $.get("www.google.com", function(response) { alert(response) }); </script> </body>

如果您绝对需要使用JavaScript，则可以使用ajax请求加载页面源代码。

请注意，使用JavaScript，您只能检索与请求页面位于同一个域下的页面。

我用了ImportIO 。他们让你从任何网站请求HTML，如果你与他们build立了一个帐户（这是免费的）。他们让你每年提出5万个请求。我没有花时间去寻找替代品，但是我确定有一些。

在你的Javascript，你基本上只是做一个GET请求像这样：

 var request = new XMLHttpRequest(); request.onreadystatechange = function() { jsontext = request.responseText; alert(jsontext); } request.open("GET", "https://extraction.import.io/query/extractor/THE_PUBLIC_LINK_THEY_GIVE_YOU?_apikey=YOUR_KEY&url=YOUR_URL", true); request.send();

您可以生成一个XmlHttpRequest并请求页面，然后使用getResponseText（）来获取内容。

您可以使用FileReader API获取文件，并在select文件时，将网页的url放入select框。使用此代码：

 function readFile() { var f = document.getElementById("yourfileinput").files[0]; if (f) { var r = new FileReader(); r.onload = function(e) { alert(r.result); } r.readAsText(f); } else { alert("file could not be found") } } }

您可以通过创build浏览器扩展名甚至在Windows（HTML应用程序）中将文件另存为.hta来绕过同源策略。

尽pipe许多意见相反，我相信有可能用简单的JavaScript来克服相同的来源要求。

我并不是说以下是原创的，因为我相信我在前段时间看到了类似的东西。

我只用Mac上的Safari进行testing。

以下演示将获取基本标记中的页面，并将其innerHTML移至新窗口。我的脚本添加了html标签，但是对于大多数现代浏览器，这可以通过使用outerHTML来避免。

 <html> <head> <base href='http://apod.nasa.gov/apod/'> <title>test</title> <style> body { margin: 0 } textarea { outline: none; padding: 2em; width: 100%; height: 100% } </style> </head> <body onload="w=window.open('#'); x=document.getElementById('t'); a='<html>\n'; b='\n</html>'; setTimeout('x.innerHTML=a+w.document.documentElement.innerHTML+b; w.close()',2000)"> <textarea id=t></textarea> </body> </html>

Javascript可以读取任何网页的来源？

是否有可能使用JavaScript更改CSS样式表？（不是一个对象的样式，而是样式表本身）

替代<blink>

如何从css中断行，而不使用<br />？

打开popup窗口并在closurespopup窗口刷新父页面

为什么使用HTML表单的定义列表（DL，DD，DT）标签而不是表格？

在Play2 Scala模板中声明variables

概述只有一个边界

如何解决PHP错误'注意：数组到string转换在…'

调整列表样式图像的位置？

是否有可能从JavaScript访问SQLite数据库？

Javascript可以读取任何网页的来源？

是否有可能使用JavaScript更改CSS样式表？ （不是一个对象的样式，而是样式表本身）

替代<blink>

如何从css中断行，而不使用<br />？

打开popup窗口并在closurespopup窗口刷新父页面

为什么使用HTML表单的定义列表（DL，DD，DT）标签而不是表格？

在Play2 Scala模板中声明variables

概述只有一个边界

如何解决PHP错误'注意：数组到string转换在…'

调整列表样式图像的位置？

是否有可能从JavaScript访问SQLite数据库？

是否有可能使用JavaScript更改CSS样式表？（不是一个对象的样式，而是样式表本身）