你能提供parsingHTML的例子吗?

如何用各种语言和parsing库parsingHTML?


当回答:

个别评论将被链接到关于如何使用正则expression式parsingHTML的问题的答案作为显示正确的方式来做事情的一种方式。

为了保持一致性,我要求示例parsing一个HTML文件,用于锚标记中的href 。 为了方便search这个问题,我要求你遵循这个格式

语言:[语言名称]

图书馆:[图书馆名称]

 [example code] 

请使图书馆链接到图书馆的文件。 如果您想提供除提取链接以外的示例,还包括:

目的:[parsing是什么]

语言: JavaScript
库: jQuery

 $.each($('a[href]'), function(){ console.debug(this.href); }); 

(使用萤火虫console.debug输出…)

并加载任何html页面:

 $.get('http://stackoverflow.com/', function(page){ $(page).find('a[href]').each(function(){ console.debug(this.href); }); }); 

使用另一个这个function的每个function,我认为它链接方法更干净。

语言:C#
库: HtmlAgilityPack

 class Program { static void Main(string[] args) { var web = new HtmlWeb(); var doc = web.Load("http://www.stackoverflow.com"); var nodes = doc.DocumentNode.SelectNodes("//a[@href]"); foreach (var node in nodes) { Console.WriteLine(node.InnerHtml); } } } 

语言:Python
库: BeautifulSoup

 from BeautifulSoup import BeautifulSoup html = "<html><body>" for link in ("foo", "bar", "baz"): html += '<a href="http://%s.com">%s</a>' % (link, link) html += "</body></html>" soup = BeautifulSoup(html) links = soup.findAll('a', href=True) # find <a> with a defined href attribute print links 

输出:

 [<a href="http://foo.com">foo</a>, <a href="http://bar.com">bar</a>, <a href="http://baz.com">baz</a>] 

也有可能:

 for link in links: print link['href'] 

输出:

 http://foo.com http://bar.com http://baz.com 

语言:Perl
库: pQuery

 use strict; use warnings; use pQuery; my $html = join '', "<html><body>", (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/), "</body></html>"; pQuery( $html )->find( 'a' )->each( sub { my $at = $_->getAttribute( 'href' ); print "$at\n" if defined $at; } ); 

语言:shell
库: lynx (好吧,它不是库,但在shell中,每个程序都是类库)

 lynx -dump -listonly http://news.google.com/ 

语言:ruby
图书馆: Hpricot

 #!/usr/bin/ruby require 'hpricot' html = '<html><body>' ['foo', 'bar', 'baz'].each {|link| html += "<a href=\"http://#{link}.com\">#{link}</a>" } html += '</body></html>' doc = Hpricot(html) doc.search('//a').each {|elm| puts elm.attributes['href'] } 

语言:Python
库: HTMLParser

 #!/usr/bin/python from HTMLParser import HTMLParser class FindLinks(HTMLParser): def __init__(self): HTMLParser.__init__(self) def handle_starttag(self, tag, attrs): at = dict(attrs) if tag == 'a' and 'href' in at: print at['href'] find = FindLinks() html = "<html><body>" for link in ("foo", "bar", "baz"): html += '<a href="http://%s.com">%s</a>' % (link, link) html += "</body></html>" find.feed(html) 

语言:Perl
库: HTML :: Parser

 #!/usr/bin/perl use strict; use warnings; use HTML::Parser; my $find_links = HTML::Parser->new( start_h => [ sub { my ($tag, $attr) = @_; if ($tag eq 'a' and exists $attr->{href}) { print "$attr->{href}\n"; } }, "tag, attr" ] ); my $html = join '', "<html><body>", (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/), "</body></html>"; $find_links->parse($html); 

语言Perl
库: HTML :: LinkExtor

Perl的美妙之处在于你拥有非常具体任务的模块。 像链接提取一样。

整个计划:

 #!/usr/bin/perl -w use strict; use HTML::LinkExtor; use LWP::Simple; my $url = 'http://www.google.com/'; my $content = get( $url ); my $p = HTML::LinkExtor->new( \&process_link, $url, ); $p->parse( $content ); exit; sub process_link { my ( $tag, %attr ) = @_; return unless $tag eq 'a'; return unless defined $attr{ 'href' }; print "- $attr{'href'}\n"; return; } 

说明:

  • 使用strict – 打开“strict”模式 – 简化潜在的debugging,与示例不完全相关
  • 使用HTML :: LinkExtor – 加载有趣的模块
  • 使用LWP :: Simple – 只是一个简单的方法来获得一些HTMLtesting
  • 我的$ url =' http://www.google.com/ ' – 我们将从哪个页面提取url
  • 我的$ content = get($ url) – 提取页面html
  • 我的$ p = HTML :: LinkExtor-> new(\&process_link,$ url) – 创buildLinkExtor对象,获取它将被用作每个url的callback函数的引用,$ url用作相对URL的BASEURL
  • $ p-> parse($ content) – 很明显我猜
  • 退出 – 程序结束
  • 子process_link – 函数process_link的开始
  • 我的($标记,%attr) – 获取参数,这是标记名称,它的属性
  • 除非$ tag eq'a',否则返回 – 如果标签不是<a>,则跳过处理
  • 返回除非被诅咒$ attr {'href'} – 如果<a>标签没有href属性,则跳过处理
  • 打印“ – $ attr {'href'} \ n”; – 很明显,我想:)
  • 返回; – 完成function

就这样。

语言:ruby
图书馆: Nokogiri

 #!/usr/bin/env ruby require 'nokogiri' require 'open-uri' document = Nokogiri::HTML(open("http://google.com")) document.css("html head title").first.content => "Google" document.xpath("//title").first.content => "Google" 

语言:Common Lisp
图书馆: Closure Html , Closure Xml , CL-WHO

(使用DOM API显示,不使用XPATH或STP API)

 (defvar *html* (who:with-html-output-to-string (stream) (:html (:body (loop for site in (list "foo" "bar" "baz") do (who:htm (:a :href (format nil "http://~A.com/" site)))))))) (defvar *dom* (chtml:parse *html* (cxml-dom:make-dom-builder))) (loop for tag across (dom:get-elements-by-tag-name *dom* "a") collect (dom:get-attribute tag "href")) => ("http://foo.com/" "http://bar.com/" "http://baz.com/") 

语言: Clojure
库: Enlive (一个基于select器的(CSS)的模板和Clojure转换系统)


select器expression式:

 (def test-select (html/select (html/html-resource (java.io.StringReader. test-html)) [:a])) 

现在,我们可以在REPL中执行以下操作(我在test-select添加了换行符):

 user> test-select ({:tag :a, :attrs {:href "http://foo.com/"}, :content ["foo"]} {:tag :a, :attrs {:href "http://bar.com/"}, :content ["bar"]} {:tag :a, :attrs {:href "http://baz.com/"}, :content ["baz"]}) user> (map #(get-in % [:attrs :href]) test-select) ("http://foo.com/" "http://bar.com/" "http://baz.com/") 

您将需要以下来尝试:

前言:

 (require '[net.cgrand.enlive-html :as html]) 

testingHTML:

 (def test-html (apply str (concat ["<html><body>"] (for [link ["foo" "bar" "baz"]] (str "<a href=\"http://" link ".com/\">" link "</a>")) ["</body></html>"]))) 

语言:Perl
库: XML :: Twig

 #!/usr/bin/perl use strict; use warnings; use Encode ':all'; use LWP::Simple; use XML::Twig; #my $url = 'http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser'; my $url = 'http://www.google.com'; my $content = get($url); die "Couldn't fetch!" unless defined $content; my $twig = XML::Twig->new(); $twig->parse_html($content); my @hrefs = map { $_->att('href'); } $twig->get_xpath('//*[@href]'); print "$_\n" for @hrefs; 

警告:可以得到像这样的网页广泛的字符错误(更改url到一个注释会得到这个错误),但上面的HTML :: Parser解决scheme不共享这个问题。

语言:Perl
库: HTML :: Parser
目的: 如何用Perl正则expression式去除未使用的,嵌套的HTML span标签?

语言:Java
图书馆: XOM , TagSoup

我在这个示例中包含了故意格式不正确和不一致的XML。

 import java.io.IOException; import nu.xom.Builder; import nu.xom.Document; import nu.xom.Element; import nu.xom.Node; import nu.xom.Nodes; import nu.xom.ParsingException; import nu.xom.ValidityException; import org.ccil.cowan.tagsoup.Parser; import org.xml.sax.SAXException; public class HtmlTest { public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException { final Parser parser = new Parser(); parser.setFeature(Parser.namespacesFeature, false); final Builder builder = new Builder(parser); final Document document = builder.build("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>", null); final Element root = document.getRootElement(); final Nodes links = root.query("//a[@href]"); for (int linkNumber = 0; linkNumber < links.size(); ++linkNumber) { final Node node = links.get(linkNumber); System.out.println(((Element) node).getAttributeValue("href")); } } } 

TagSoup在默认情况下将一个引用XHTML的XML名称空间添加到文档中。 我已经select在这个示例中压制这个。 使用默认行为需要调用root.query来包含名称空间,如下所示:

 root.query("//xhtml:a[@href]", new nu.xom.XPathContext("xhtml", root.getNamespaceURI()) 

语言:C#
库: System.XML (标准.NET)

 using System.Collections.Generic; using System.Xml; public static void Main(string[] args) { List<string> matches = new List<string>(); XmlDocument xd = new XmlDocument(); xd.LoadXml("<html>...</html>"); FindHrefs(xd.FirstChild, matches); } static void FindHrefs(XmlNode xn, List<string> matches) { if (xn.Attributes != null && xn.Attributes["href"] != null) matches.Add(xn.Attributes["href"].InnerXml); foreach (XmlNode child in xn.ChildNodes) FindHrefs(child, matches); } 

语言: JavaScript
库: DOM

 var links = document.links; for(var i in links){ var href = links[i].href; if(href != null) console.debug(href); } 

(使用萤火虫console.debug输出…)

语言: 球拍

图书馆:( 星球ashinn / htmlparsing器:1)和(星球分岔/ sxml2:1)

 (require net/url (planet ashinn/html-parser:1) (planet clements/sxml2:1)) (define the-url (string->url "http://stackoverflow.com/")) (define doc (call/input-url the-url get-pure-port html->sxml)) (define links ((sxpath "//a/@href/text()") doc)) 

上面的例子使用来自新包系统的包: html-parsing和sxml

 (require net/url html-parsing sxml) (define the-url (string->url "http://stackoverflow.com/")) (define doc (call/input-url the-url get-pure-port html->xexp)) (define links ((sxpath "//a/@href/text()") doc)) 

注意:从命令行使用'raco'安装所需的软件包,其中包含:

 raco pkg install html-parsing 

和:

 raco pkg install sxml 

语言:Python
库: lxml.html

 import lxml.html html = "<html><body>" for link in ("foo", "bar", "baz"): html += '<a href="http://%s.com">%s</a>' % (link, link) html += "</body></html>" tree = lxml.html.document_fromstring(html) for element, attribute, link, pos in tree.iterlinks(): if attribute == "href": print link 

lxml还有一个用于遍历DOM的CSSselect器类,可以使用它非常类似于使用JQuery:

 for a in tree.cssselect('a[href]'): print a.get('href') 

语言:PHP
库: SimpleXML (和DOM)

 <?php $page = new DOMDocument(); $page->strictErrorChecking = false; $page->loadHTMLFile('http://stackoverflow.com/questions/773340'); $xml = simplexml_import_dom($page); $links = $xml->xpath('//a[@href]'); foreach($links as $link) echo $link['href']."\n"; 

语言: Objective-C
库: libxml2 + Matt Gallagher的libxml2包装 + Ben Copsey的ASIHTTPRequest

 ASIHTTPRequest *request = [ASIHTTPRequest alloc] initWithURL:[NSURL URLWithString:@"http://stackoverflow.com/questions/773340"]; [request start]; NSError *error = [request error]; if (!error) { NSData *response = [request responseData]; NSLog(@"Data: %@", [[self query:@"//a[@href]" withResponse:response] description]); [request release]; } else @throw [NSException exceptionWithName:@"kMyHTTPRequestFailed" reason:@"Request failed!" userInfo:nil]; ... - (id) query:(NSString *)xpathQuery WithResponse:(NSData *)resp { NSArray *nodes = PerformHTMLXPathQuery(resp, xpathQuery); if (nodes != nil) return nodes; return nil; } 

语言:Perl
库: HTML :: TreeBuilder

 use strict; use HTML::TreeBuilder; use LWP::Simple; my $content = get 'http://www.stackoverflow.com'; my $document = HTML::TreeBuilder->new->parse($content)->eof; for my $a ($document->find('a')) { print $a->attr('href'), "\n" if $a->attr('href'); } 

语言: Python
库: HTQL

 import htql; page="<a href=a.html>1</a><a href=b.html>2</a><a href=c.html>3</a>"; query="<a>:href,tx"; for url, text in htql.HTQL(page, query): print url, text; 

简单而直观。

语言:ruby
图书馆: Nokogiri

 #!/usr/bin/env ruby require "nokogiri" require "open-uri" doc = Nokogiri::HTML(open('http://www.example.com')) hrefs = doc.search('a').map{ |n| n['href'] } puts hrefs 

哪些产出:

 / /domains/ /numbers/ /protocols/ /about/ /go/rfc2606 /about/ /about/presentations/ /about/performance/ /reports/ /domains/ /domains/root/ /domains/int/ /domains/arpa/ /domains/idn-tables/ /protocols/ /numbers/ /abuse/ http://www.icann.org/ mailto:iana@iana.org?subject=General%20website%20feedback 

这是上面的一个小调,产生一个可用于报告的输出。 我只返回hrefs列表中的第一个和最后一个元素:

 #!/usr/bin/env ruby require "nokogiri" require "open-uri" doc = Nokogiri::HTML(open('http://nokogiri.org')) hrefs = doc.search('a[href]').map{ |n| n['href'] } puts hrefs .each_with_index # add an array index .minmax{ |a,b| a.last <=> b.last } # find the first and last element .map{ |h,i| '%3d %s' % [1 + i, h ] } # format the output 1 http://github.com/tenderlove/nokogiri 100 http://yokolet.blogspot.com 

语言:Java
库: jsoup

 import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import org.xml.sax.SAXException; public class HtmlTest { public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException { final Document document = Jsoup.parse("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>"); final Elements links = document.select("a[href]"); for (final Element element : links) { System.out.println(element.attr("href")); } } } 

语言:PHP库:DOM

 <?php $doc = new DOMDocument(); $doc->strictErrorChecking = false; $doc->loadHTMLFile('http://stackoverflow.com/questions/773340'); $xpath = new DOMXpath($doc); $links = $xpath->query('//a[@href]'); for ($i = 0; $i < $links->length; $i++) echo $links->item($i)->getAttribute('href'), "\n"; 

有时在$doc->loadHTMLFile之前放@符号可以抑制无效的htmlparsing警告

使用phantomjs,将这个文件保存为extract-links.js:

 var page = new WebPage(), url = 'http://www.udacity.com'; page.open(url, function (status) { if (status !== 'success') { console.log('Unable to access network'); } else { var results = page.evaluate(function() { var list = document.querySelectorAll('a'), links = [], i; for (i = 0; i < list.length; i++) { links.push(list[i].href); } return links; }); console.log(results.join('\n')); } phantom.exit(); }); 

跑:

 $ ../path/to/bin/phantomjs extract-links.js 

语言:Coldfusion 9.0.1+

库: jSoup

 <cfscript> function parseURL(required string url){ var res = []; var javaLoader = createObject("javaloader.JavaLoader").init([expandPath("./jsoup-1.7.3.jar")]); var jSoupClass = javaLoader.create("org.jsoup.Jsoup"); //var dom = jSoupClass.parse(html); // if you already have some html to parse. var dom = jSoupClass.connect( arguments.url ).get(); var links = dom.select("a"); for(var a=1;a LT arrayLen(links);a++){ var s={};s.href= links[a].attr('href'); s.text= links[a].text(); if(s.href contains "http://" || s.href contains "https://") arrayAppend(res,s); } return res; } //writeoutput(writedump(parseURL(url))); </cfscript> <cfdump var="#parseURL("http://stackoverflow.com/questions/773340/can-you-provide-examples-of-parsing-html")#"> 

返回一个结构数组,每个结构体包含一个HREF和TEXT对象。

语言:JavaScript / Node.js

图书馆: 请求和Cheerio

 var request = require('request'); var cheerio = require('cheerio'); var url = "https://news.ycombinator.com/"; request(url, function (error, response, html) { if (!error && response.statusCode == 200) { var $ = cheerio.load(html); var anchorTags = $('a'); anchorTags.each(function(i,element){ console.log(element["attribs"]["href"]); }); } }); 

请求库下载html文档,Cheerio让你使用jQuery的CSSselect器来定位html文档。