在VBA中parsingHTML内容

我有一个关于HTMLparsing的问题。 我有一个网站的一些产品,我想抓到页面内的文字到我目前的电子表格。 这个电子表格相当大,但在第三列中包含ItemNbr,我期望第14列中的文本和一行对应于一个产品(项目)。

我的想法是在标签后面的Innertext里面获取网页上的“材质”。 身份证号码从一个页面更改为页面(有时)。

这里是网站的结构:

<div style="position:relative;"> <div></div> <table id="list-table" width="100%" tabindex="1" cellspacing="0" cellpadding="0" border="0" role="grid" aria-multiselectable="false" aria-labelledby="gbox_list-table" class="ui-jqgrid-btable" style="width: 930px;"> <tbody> <tr class="jqgfirstrow" role="row" style="height:auto"> <td ...</td> <td ...</td> </tr> <tr role="row" id="1" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"> <td ...</td> <td ...</td> </tr> <tr role="row" id="2" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"> <td ...</td> <td ...</td> </tr> <tr role="row" id="3" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"> <td ...</td> <td ...</td> </tr> <tr role="row" id="4" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"> <td ...</td> <td ...</td> </tr> <tr role="row" id="5" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"> <td ...</td> <td ...</td> </tr> <tr role="row" id="6" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"> <td ...</td> <td ...</td> </tr> <tr role="row" id="7" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr"> <td role="gridcell" style="padding-left:10px" title="Material" aria-describedby="list-table_">Material</td> <td role="gridcell" style="" title="600D polyester." aria-describedby="list-table_">600D polyester.</td> </tr> <tr ...> </tr> </tbody> </table> </div> 

我想要得到“600D涤纶”。

我(不工作)的代码片段是这样的:

 Sub ParseMaterial() Dim Cell As Integer Dim ItemNbr As String Dim AElement As Object Dim AElements As IHTMLElementCollection Dim IE As MSXML2.XMLHTTP60 Set IE = New MSXML2.XMLHTTP60 Dim HTMLDoc As MSHTML.HTMLDocument Dim HTMLBody As MSHTML.HTMLBody Set HTMLDoc = New MSHTML.HTMLDocument Set HTMLBody = HTMLDoc.body For Cell = 1 To 5 'I iterate through the file row by row ItemNbr = Cells(Cell, 3).Value 'ItemNbr isin the 3rd Column of my spreadsheet IE.Open "GET", "http://www.example.com/?item=" & ItemNbr, False IE.send While IE.ReadyState <> 4 DoEvents Wend HTMLBody.innerHTML = IE.responseText Set AElements = HTMLDoc.getElementById("list-table").getElementsByTagName("tr") For Each AElement In AElements If AElement.Title = "Material" Then Cells(Cell, 14) = AElement.nextNode.value 'I write the material in the 14th column End If Next AElement Application.Wait (Now + TimeValue("0:00:2")) Next Cell 

谢谢你的帮助 !

只是希望能让你朝着正确的方向发展的一些事情:

  • 清理一下:删除readystate属性testing循环。 readystate属性返回的值在这个上下文中永远不会改变 – 代码会在发送指令之后暂停,只有在收到服务器响应后才会恢复,或者没有这样做。 readystate属性将相应地设置,代码将恢复执行。 你仍然应该testing就绪状态,但是循环是不必要的

  • 定位正确的HTML元素:您正在通过tr元素进行search – 而在代码中如何使用这些元素的逻辑实际上看起来指向td元素

  • 确保属性实际上可用于您正在使用它们的对象:为了帮助您,请尝试将所有variables声明为特定对象而不是通用对象。 这将激活智能感知。 如果您很难在相关库中定义对象的实际名称,请将其声明为通用对象,运行您的代码,然后检查对象的types – 通过打印typename(your_object)以debugging窗口为例。 这应该让你在你的路上

我还列出了一些可能有所帮助的代码。 如果你仍然无法得到这个工作,你可以分享你的url – PLZ做到这一点。

 Sub getInfoWeb() Dim cell As Integer Dim xhr As MSXML2.XMLHTTP60 Dim doc As MSHTML.HTMLDocument Dim table As MSHTML.HTMLTable Dim tableCells As MSHTML.IHTMLElementCollection Set xhr = New MSXML2.XMLHTTP60 For cell = 1 To 5 ItemNbr = Cells(cell, 3).Value With xhr .Open "GET", "http://www.example.com/?item=" & ItemNbr, False .send If .readyState = 4 And .Status = 200 Then Set doc = New MSHTML.HTMLDocument doc.body.innerHTML = .responseText Else MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _ vbNewLine & "HTTP request status: " & .Status End If End With Set table = doc.getElementById("list-table") Set tableCells = table.getElementsByTagName("td") For Each tableCell In tableCells If tableCell.getAttribute("title") = "Material" Then Cells(cell, 14).Value = tableCell.NextSibling.innerHTML End If Next tableCell Next cell End Sub 

编辑:作为您在下面的评论中提供的进一步信息的后续 – 以及我已经添加的额外评论

 'Determine your product number 'Open an xhr for your source url, and retrieve the product number from there - search for the tag which 'text include the "productnummer:" substring, and extract the product number from the outerstring 'OR 'if the product number consistently consists of the fctkeywords you are entering in your source url 'with two "0" appended - just build the product number like that 'Open an new xhr for this url "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=" & product_number & "&_search=false&rows=-1&page=1&sidx=&sord=asc" 'Load the response in an XML document, and retrieve the material information Sub getInfoWeb() Dim xhr As MSXML2.XMLHTTP60 Dim doc As MSXML2.DOMDocument60 Dim xmlCell As MSXML2.IXMLDOMElement Dim xmlCells As MSXML2.IXMLDOMNodeList Dim materialValueElement As MSXML2.IXMLDOMElement Set xhr = New MSXML2.XMLHTTP60 With xhr .Open "GET", "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=10031700&_search=false&rows=-1&page=1&sidx=&sord=asc", False .send If .readyState = 4 And .Status = 200 Then Set doc = New MSXML2.DOMDocument60 doc.LoadXML .responseText Else MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _ vbNewLine & "HTTP request status: " & .Status End If End With Set xmlCells = doc.getElementsByTagName("cell") For Each xmlCell In xmlCells If xmlCell.Text = "Materiaal" Then Set materialValueElement = xmlCell.NextSibling End If Next MsgBox materialValueElement.Text End Sub 

EDIT2:一个替代自动化的IE浏览器

 Sub searchWebViaIE() Dim ie As SHDocVw.InternetExplorer Dim doc As MSHTML.HTMLDocument Dim anchors As MSHTML.IHTMLElementCollection Dim anchor As MSHTML.HTMLAnchorElement Dim prodSpec As MSHTML.HTMLAnchorElement Dim tableCells As MSHTML.IHTMLElementCollection Dim materialValueElement As MSHTML.HTMLTableCell Dim tableCell As MSHTML.HTMLTableCell Set ie = New SHDocVw.InternetExplorer With ie .navigate "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2facetmain.p?fctkeywords=100317&world=general#tabs-4" .Visible = True Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True DoEvents Loop Set doc = .document Set anchors = doc.getElementsByTagName("a") For Each anchor In anchors If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then anchor.Click Exit For End If Next anchor Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True DoEvents Loop End With For Each anchor In anchors If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then Set prodSpec = anchor End If Next anchor Set tableCells = doc.getElementById("list-table").getElementsByTagName("td") If Not tableCells Is Nothing Then For Each tableCell In tableCells If tableCell.innerHTML = "Materiaal" Then Set materialValueElement = tableCell.NextSibling End If Next tableCell End If MsgBox materialValueElement.innerHTML End Sub