我可以将整个HTML文档加载到Internet Explorer中的文档片段中吗？

这是我一直有一些困难的东西。我有一个本地的客户端脚本，需要允许用户获取远程网页，并search结果页面的forms。为了做到这一点（没有正则expression式），我需要parsing文档到一个完全遍历的DOM对象。

我想强调一些限制：

我不想使用库（如jQuery）。我在这里需要做的事情太多了。
在任何情况下都不应该执行远程页面的脚本（出于安全原因）。
DOM API（如getElementsByTagName ）需要可用。
它只需要在Internet Explorer中工作，但至less在7。
让我们假装我没有访问服务器。我这样做，但我不能用它。

我试过了

假设我在variableshtml有一个完整的HTML文档string（包括DOCTYPE声明），以下是我迄今为止所尝试的：

 var frag = document.createDocumentFragment(), div = frag.appendChild(document.createElement("div")); div.outerHTML = html; //-> results in an empty fragment div.insertAdjacentHTML("afterEnd", html); //-> HTML is not added to the fragment div.innerHTML = html; //-> Error (expected, but I tried it anyway) var doc = new ActiveXObject("htmlfile"); doc.write(html); doc.close(); //-> JavaScript executes

我也尝试从HTML中提取<head>和<body>节点，并将它们添加到片段中的<HTML>元素，仍然没有运气。

有没有人有任何想法？

小提琴 ： http : //jsfiddle.net/JFSKe/6/

DocumentFragment没有实现DOM方法。将document.createElement与innerHTML结合使用可以删除<head>和<body>标记（即使创build的元素是根元素<html> ）。因此，应该在其他地方寻求解决办法。我创build了一个跨浏览器的string到DOMfunction，它使用了一个不可见的内联框架。

所有外部资源和脚本将被禁用。有关更多信息，请参阅代码说明 。

码

 /* @param String html The string with HTML which has be converted to a DOM object @param func callback (optional) Callback(HTMLDocument doc, function destroy) @returns undefined if callback exists, else: Object HTMLDocument doc DOM fetched from Parameter:html function destroy Removes HTMLDocument doc. */ function string2dom(html, callback){ /* Sanitise the string */ html = sanitiseHTML(html); /*Defined at the bottom of the answer*/ /* Create an IFrame */ var iframe = document.createElement("iframe"); iframe.style.display = "none"; document.body.appendChild(iframe); var doc = iframe.contentDocument || iframe.contentWindow.document; doc.open(); doc.write(html); doc.close(); function destroy(){ iframe.parentNode.removeChild(iframe); } if(callback) callback(doc, destroy); else return {"doc": doc, "destroy": destroy}; } /* @name sanitiseHTML @param String html A string representing HTML code @return String A new string, fully stripped of external resources. All "external" attributes (href, src) are prefixed by data- */ function sanitiseHTML(html){ /* Adds a <!-\"'--> before every matched tag, so that unterminated quotes aren't preventing the browser from splitting a tag. Test case: '<input style="foo;b:url(0);><input onclick="<input type=button onclick="too() href=;>">' */ var prefix = "<!--\"'-->"; /*Attributes should not be prefixed by these characters. This list is not complete, but will be sufficient for this function. (see http://www.w3.org/TR/REC-xml/#NT-NameChar) */ var att = "[^-a-z0-9:._]"; var tag = "<[az]"; var any = "(?:[^<>\"']*(?:\"[^\"]*\"|'[^']*'))*?[^<>]*"; var etag = "(?:>|(?=<))"; /* @name ae @description Converts a given string in a sequence of the original input and the HTML entity @param String string String to convert */ var entityEnd = "(?:;|(?!\\d))"; var ents = {" ":"(?:\\s|&nbsp;?|&#0*32"+entityEnd+"|&#x0*20"+entityEnd+")", "(":"(?:\\(|&#0*40"+entityEnd+"|&#x0*28"+entityEnd+")", ")":"(?:\\)|&#0*41"+entityEnd+"|&#x0*29"+entityEnd+")", ".":"(?:\\.|&#0*46"+entityEnd+"|&#x0*2e"+entityEnd+")"}; /*Placeholder to avoid tricky filter-circumventing methods*/ var charMap = {}; var s = ents[" "]+"*"; /* Short-hand space */ /* Important: Must be pre- and postfixed by < and >. RE matches a whole tag! */ function ae(string){ var all_chars_lowercase = string.toLowerCase(); if(ents[string]) return ents[string]; var all_chars_uppercase = string.toUpperCase(); var RE_res = ""; for(var i=0; i<string.length; i++){ var char_lowercase = all_chars_lowercase.charAt(i); if(charMap[char_lowercase]){ RE_res += charMap[char_lowercase]; continue; } var char_uppercase = all_chars_uppercase.charAt(i); var RE_sub = [char_lowercase]; RE_sub.push("&#0*" + char_lowercase.charCodeAt(0) + entityEnd); RE_sub.push("&#x0*" + char_lowercase.charCodeAt(0).toString(16) + entityEnd); if(char_lowercase != char_uppercase){ RE_sub.push("&#0*" + char_uppercase.charCodeAt(0) + entityEnd); RE_sub.push("&#x0*" + char_uppercase.charCodeAt(0).toString(16) + entityEnd); } RE_sub = "(?:" + RE_sub.join("|") + ")"; RE_res += (charMap[char_lowercase] = RE_sub); } return(ents[string] = RE_res); } /* @name by @description second argument for the replace function. */ function by(match, group1, group2){ /* Adds a data-prefix before every external pointer */ return group1 + "data-" + group2 } /* @name cr @description Selects a HTML element and performs a search-and-replace on attributes @param String selector HTML substring to match @param String attribute RegExp-escaped; HTML element attribute to match @param String marker Optional RegExp-escaped; marks the prefix @param String delimiter Optional RegExp escaped; non-quote delimiters @param String end Optional RegExp-escaped; forces the match to end before an occurence of <end> when quotes are missing */ function cr(selector, attribute, marker, delimiter, end){ if(typeof selector == "string") selector = new RegExp(selector, "gi"); marker = typeof marker == "string" ? marker : "\\s*="; delimiter = typeof delimiter == "string" ? delimiter : ""; end = typeof end == "string" ? end : ""; var is_end = end && "?"; var re1 = new RegExp("("+att+")("+attribute+marker+"(?:\\s*\"[^\""+delimiter+"]*\"|\\s*'[^'"+delimiter+"]*'|[^\\s"+delimiter+"]+"+is_end+")"+end+")", "gi"); html = html.replace(selector, function(match){ return prefix + match.replace(re1, by); }); } /* @name cri @description Selects an attribute of a HTML element, and performs a search-and-replace on certain values @param String selector HTML element to match @param String attribute RegExp-escaped; HTML element attribute to match @param String front RegExp-escaped; attribute value, prefix to match @param String flags Optional RegExp flags, default "gi" @param String delimiter Optional RegExp-escaped; non-quote delimiters @param String end Optional RegExp-escaped; forces the match to end before an occurence of <end> when quotes are missing */ function cri(selector, attribute, front, flags, delimiter, end){ if(typeof selector == "string") selector = new RegExp(selector, "gi"); flags = typeof flags == "string" ? flags : "gi"; var re1 = new RegExp("("+att+attribute+"\\s*=)((?:\\s*\"[^\"]*\"|\\s*'[^']*'|[^\\s>]+))", "gi"); end = typeof end == "string" ? end + ")" : ")"; var at1 = new RegExp('(")('+front+'[^"]+")', flags); var at2 = new RegExp("(')("+front+"[^']+')", flags); var at3 = new RegExp("()("+front+'(?:"[^"]+"|\'[^\']+\'|(?:(?!'+delimiter+').)+)'+end, flags); var handleAttr = function(match, g1, g2){ if(g2.charAt(0) == '"') return g1+g2.replace(at1, by); if(g2.charAt(0) == "'") return g1+g2.replace(at2, by); return g1+g2.replace(at3, by); }; html = html.replace(selector, function(match){ return prefix + match.replace(re1, handleAttr); }); } /* <meta http-equiv=refresh content=" ; url= " > */ html = html.replace(new RegExp("<meta"+any+att+"http-equiv\\s*=\\s*(?:\""+ae("refresh")+"\""+any+etag+"|'"+ae("refresh")+"'"+any+etag+"|"+ae("refresh")+"(?:"+ae(" ")+any+etag+"|"+etag+"))", "gi"), "<!-- meta http-equiv=refresh stripped-->"); /* Stripping all scripts */ html = html.replace(new RegExp("<script"+any+">\\s*//\\s*<\\[CDATA\\[[\\S\\s]*?]]>\\s*</script[^>]*>", "gi"), "<!--CDATA script-->"); html = html.replace(/<script[\S\s]+?<\/script\s*>/gi, "<!--Non-CDATA script-->"); cr(tag+any+att+"on[-a-z0-9:_.]+="+any+etag, "on[-a-z0-9:_.]+"); /* Event listeners */ cr(tag+any+att+"href\\s*="+any+etag, "href"); /* Linked elements */ cr(tag+any+att+"src\\s*="+any+etag, "src"); /* Embedded elements */ cr("<object"+any+att+"data\\s*="+any+etag, "data"); /* <object data= > */ cr("<applet"+any+att+"codebase\\s*="+any+etag, "codebase"); /* <applet codebase= > */ /* <param name=movie value= >*/ cr("<param"+any+att+"name\\s*=\\s*(?:\""+ae("movie")+"\""+any+etag+"|'"+ae("movie")+"'"+any+etag+"|"+ae("movie")+"(?:"+ae(" ")+any+etag+"|"+etag+"))", "value"); /* <style> and < style= > url()*/ cr(/<style[^>]*>(?:[^"']*(?:"[^"]*"|'[^']*'))*?[^'"]*(?:<\/style|$)/gi, "url", "\\s*\\(\\s*", "", "\\s*\\)"); cri(tag+any+att+"style\\s*="+any+etag, "style", ae("url")+s+ae("(")+s, 0, s+ae(")"), ae(")")); /* IE7- CSS expression() */ cr(/<style[^>]*>(?:[^"']*(?:"[^"]*"|'[^']*'))*?[^'"]*(?:<\/style|$)/gi, "expression", "\\s*\\(\\s*", "", "\\s*\\)"); cri(tag+any+att+"style\\s*="+any+etag, "style", ae("expression")+s+ae("(")+s, 0, s+ae(")"), ae(")")); return html.replace(new RegExp("(?:"+prefix+")+", "g"), prefix); }

代码的解释

sanitiseHTML函数基于我的replace_all_rel_by_abs函数（请参阅此答案）。 sanitiseHTML函数虽然完全重写，为了实现最大的效率和可靠性。

此外，还添加了一组新的RegExps，用于删除所有脚本和事件处理程序（包括CSS expression() ，IE7-）。为了确保所有标签都按照预期进行了parsing，调整后的标签前缀为 。这个前缀对于正确parsing嵌套的”事件处理程序“和未终止的引号是必须的： <a id="><input onclick="<div onmousemove=evil()>"> 。

这些RegExps是使用内部函数cr / cri （ C reate R eplace [ I nline]）dynamic创build的。这些函数接受参数列表，并创build并执行高级REreplace。为了确保HTML实体不会破坏RegExp（可以用各种方式写入<meta http-equiv=refresh> ），dynamic创build的RegExps部分由函数ae （ A ny E tity）构造。
实际的replace是by （replace）function完成的。在这个实现中， by在所有匹配的属性之前添加data- 。

所有的<script>//<[CDATA[ .. //]]></script>出现都是条带化的。这一步是必须的，因为CDATA部分允许在代码中使用</script>string。在更换完成之后，可以安全地转到下一个更换：
剩下的<script>...</script>标签被删除。
<meta http-equiv=refresh .. >标记被删除
如前所述，所有事件监听器和外部指针/属性（ href ， src ， url() ）都以data-为前缀。
一个IFrame对象被创build。 IFrame不太可能泄漏内存（与htmlfile ActiveXObject相反）。 IFrame变得不可见，并被附加到文档中，以便可以访问DOM。 document.write()用于将HTML写入IFrame。 document.open()和document.close()用于清空文档的以前的内容，以便生成的文档是给定htmlstring的精确副本。
如果已经指定了callback函数，那么函数将被调用两个参数。第一个参数是对生成的document对象的引用。 第二个参数是一个函数，在调用时会破坏生成的DOM树。这个函数应该在你不需要树的时候调用。
如果未指定callback函数，则该函数将返回一个由两个属性（ doc和destroy ）组成的对象，其行为与前面提到的参数相同。

补充笔记

将designMode属性设置为“On”将停止执行脚本的框架（Chrome中不支持）。如果由于特定的原因必须保留<script>标记，则可以使用iframe.designMode = "On"而不是脚本剥离function。
我无法findhtmlfile activeXObject的可靠来源。根据这个资料来源， htmlfile比IFrames慢，而且更容易发生内存泄漏。
所有受影响的属性（ href ， src ，…）都以data-为前缀。 data-href显示了获取/更改这些属性的示例：
elem.getAttribute("data-href")和elem.setAttribute("data-href", "...")
elem.dataset.href和elem.dataset.href = "..." 。
外部资源已被禁用。结果，页面可能看起来完全不同：
~~<link rel="stylesheet" href="main.css" />~~ 没有外部样式
~~<script>document.body.bgColor="red";</script>~~ 没有脚本样式
<img src="128x128.png" /> 没有图像：元素的大小可能完全不同。

例子

sanitiseHTML(html)
将这个书签粘贴到位置的栏中。它将提供一个注入textarea的选项，显示清理过的HTMLstring。

 javascript:void(function(){var s=document.createElement("script");s.src="html-sanitizer.js";document.body.appendChild(s)})();

代码示例 – string2dom(html) ：

 string2dom("<html><head><title>Test</title></head></html>", function(doc, destroy){ alert(doc.title); /* Alert: "Test" */ destroy(); }); var test = string2dom("<div id='secret'></div>"); alert(test.doc.getElementById("secret").tagName); /* Alert: "DIV" */ test.destroy();

值得注意的参考

SO：JS RE将所有相对于绝对URL更改 – 函数sanitiseHTML(html)基于我以前创build的replace_all_rel_by_abs(html)函数。
元素 – embedded式内容 – 标准embedded式元素的完整列表
元素 – 上一个HTML元素 – 附加的（不推荐）元素列表（如<applet> ）
htmlfile ActiveX对象 – “比iframe沙箱慢，如果不pipe理则泄漏内存”

不知道为什么你搞乱了documentFragments，你可以将HTML文本设置为一个新的div元素的innerHTML 。然后，您可以使用该div元素getElementsByTagName等，而无需将div添加到DOM：

 var htmlText= '<html><head><title>Test</title></head><body><div id="test_ele1">this is test_ele1 content</div><div id="test_ele2">this is test_ele content2</div></body></html>'; var d = document.createElement('div'); d.innerHTML = htmlText; console.log(d.getElementsByTagName('div'));

如果你真的和documentFragment的想法结合在一起，你可以使用这个代码，但是你仍然需要将它封装在一个div中，以获得你所需要的DOMfunction：

 function makeDocumentFragment(htmlText) { var range = document.createRange(); var frag = range.createContextualFragment(htmlText); var d = document.createElement('div'); d.appendChild(frag); return d; }

假设HTML也是有效的XML，你可以使用loadXML（）

DocumentFragment不支持getElementsByTagName – 只有Document支持。

您可能需要使用像jsdom这样的库，它提供了DOM的实现，通过它可以使用getElementsByTagName和其他DOM API进行search。您可以将其设置为不执行脚本。是的，这是“沉重的”，我不知道它是否在IE 7中的作品。

我不知道如果IE支持document.implementation.createHTMLDocument ，但如果它，使用这个algorithm（适应从我的DOMParser HTML扩展）。请注意，DOCTYPE不会被保留。

 var doc = document.implementation.createHTMLDocument("") , doc_elt = doc.documentElement , first_elt ; doc_elt.innerHTML = your_html_here; first_elt = doc_elt.firstElementChild; if ( // are we dealing with an entire document or a fragment? doc_elt.childElementCount === 1 && first_elt.tagName.toLowerCase() === "html" ) { doc.replaceChild(first_elt, doc_elt); } // doc is an HTML document // you can now reference stuff like doc.title, etc.

刚刚在这个页面上漫游，有点迟了，可以使用任何东西:)但以下应该可以帮助任何人在未来有类似的问题…但IE7 / 8现在应该被忽略，有更好的方法支持更现代的浏览器。

以下几乎所有我testing过的东西 – 唯一的两个方面是：

我已经添加了bespoke getElementById和getElementsByName函数到根div元素，所以这些不会出现预期的进一步向下的树（除非代码被修改，以迎合这一点） 。
doctype将被忽略 – 但是我不认为这会有很大的不同，因为我的经验是doctype不会影响dom是如何构造的，它是如何被渲染的（显然不会发生这种方法） 。

基本上，系统依赖于<tag>和<namespace:tag>被使用者区别对待的事实。正如已经发现某些特殊标签不能存在于一个div元素中，所以它们被删除。命名空间的元素可以放在任何地方（除非另有说明） 。虽然这些命名空间标签实际上不会像真正的标签一样行事，但是考虑到我们只是真正将它们用于文档中的结构位置，它并不会真正引起问题。

标记和代码如下：

 <!DOCTYPE html> <html> <head> <script> /// function for parsing HTML source to a dom structure /// Tested in Mac OSX, Win 7, Win XP with FF, IE 7/8/9, /// Chrome, Safari & Opera. function parseHTML(src){ /// create a random div, this will be our root var div = document.createElement('div'), /// specificy our namespace prefix ns = 'faux:', /// state which tags we will treat as "special" stn = ['html','head','body','title']; /// the reg exp for replacing the special tags re = new RegExp('<(/?)('+stn.join('|')+')([^>]*)?>','gi'), /// remember the getElementsByTagName function before we override it gtn = div.getElementsByTagName; /// a quick function to namespace certain tag names var nspace = function(tn){ if ( stn.indexOf ) { return stn.indexOf(tn) != -1 ? ns + tn : tn; } else { return ('|'+stn.join('|')+'|').indexOf(tn) != -1 ? ns + tn : tn; } }; /// search and replace our source so that special tags are namespaced /// &nbsp; required for IE7/8 to render tags before first text found /// <faux:check /> tag added so we can test how namespaces work src = '&nbsp;<'+ns+'check />' + src.replace(re,'<$1'+ns+'$2$3>'); /// inject to the div div.innerHTML = src; /// quick test to see how we support namespaces in TagName searches if ( !div.getElementsByTagName(ns+'check').length ) { ns = ''; } /// create our replacement getByName and getById functions var createGetElementByAttr = function(attr, collect){ var func = function(a,w){ var i,c,e,f,l,o; w = w||[]; if ( this.nodeType == 1 ) { if ( this.getAttribute(attr) == a ) { if ( collect ) { w.push(this); } else { return this; } } } else { return false; } if ( (c = this.childNodes) && (l = c.length) ) { for( i=0; i<l; i++ ){ if( (e = c[i]) && (e.nodeType == 1) ) { if ( (f = func.call( e, a, w )) && !collect ) { return f; } } } } return (w.length?w:false); } return func; } /// apply these replacement functions to the div container, obviously /// you could add these to prototypes for browsers the support element /// constructors. For other browsers you could step each element and /// apply the functions through-out the node tree... however this would /// be quite messy, far better just to always call from the root node - /// or use div.getElementsByTagName.call( localElement, 'tag' ); div.getElementsByTagName = function(t){return gtn.call(this,nspace(t));} div.getElementsByName = createGetElementByAttr('name', true); div.getElementById = createGetElementByAttr('id', false); /// return the final element return div; } window.onload = function(){ /// parse the HTML source into a node tree var dom = parseHTML( document.getElementById('source').innerHTML ); /// test some look ups :) var a = dom.getElementsByTagName('head'), b = dom.getElementsByTagName('title'), c = dom.getElementsByTagName('script'), d = dom.getElementById('body'); /// alert the result alert(a[0].innerHTML); alert(b[0].innerHTML); alert(c[0].innerHTML); alert(d.innerHTML); } </script> </head> <body> <xmp id="source"> <!DOCTYPE html> <html> <head> <!-- Comment //--> <meta charset="utf-8"> <meta name="robots" content="index, follow"> <title>An example</title> <link href="test.css" /> <script>alert('of parsing..');</script> </head> <body id="body"> <b>in a similar way to createDocumentFragment</b> </body> </html> </xmp> </body> </html>

我可以将整个HTML文档加载到Internet Explorer中的文档片段中吗？

我试过了

码

代码的解释

补充笔记

例子

值得注意的参考

为什么Internet Explorer（9）在UserAgent中报告“Mozilla”？

使用jQuery无法正确设置Accept HTTP标头

在IE中绝对定位锚标记（没有文本）不可点击

尝试访问以编程方式创build的<iframe>的文档对象时，“访问被拒绝”JavaScript错误（仅限IE）

为什么console.log.apply（）会抛出非法调用错误？

滑动时禁用网页导航（后退和前进）

如何获得浏览器当前的语言环境偏好使用JavaScript？

如何获取在Firefox和/或IE 10中工作的HTML 5 input type =“date”

Internet Explorer 10 Windows 8删除文本input和密码操作图标

在IE中select固定宽度的下拉菜单