RegEx匹配除XHTML自包含标签之外的开放标签

我需要匹配所有这些开始标签:

<p> <a href="foo"> 

但不是这些:

 <br /> <hr class="foo" /> 

我想出了这个,想确保我做对了。 我只捕获az

 <([az]+) *[^/]*?> 

我相信它说:

  • 然后find一个不到的地方
  • 然后查找(并捕获)一个或多个az
  • 然后find零个或多个空格
  • 找零或更多的字符,贪婪,除/ ,然后
  • find一个大于

我有这个权利吗? 更重要的是,你怎么看?

你不能用正则expression式parsing[X] HTML。 因为HTML不能被正则expression式parsing。 正则expression式不是一个可以用来正确parsingHTML的工具。 正如我以前在这里回答了HTML和正则expression式的问题,使用正则expression式不会允许您使用HTML。 正则expression式是一种工具,不足以理解HTML使用的构造。 HTML不是普通的语言,因此不能被正则expression式parsing。 正则expression式查询不能将HTML分解成有意义的部分。 这么多次,但没有达到我。 即使Perl使用的增强的不规则正则expression式也无法parsingHTML。 你永远不会让我开裂。 HTML是一种足够复杂的语言,它不能被正则expression式parsing。 即使Jon Skeet也不能使用正则expression式来parsingHTML。 每当你试图用正则expression式来parsingHTML时,邪恶的孩子就会stream下处女的血,而俄罗斯的黑客则会用你的networking应用程序。 用正则expression式parsingHTML会将受污染的灵魂带入生活的领域。 HTML和正则expression式像爱情,婚姻和仪式杀婴一样。 <center>不能抱太晚。 正则expression式和HTML在同一个概念空间中的作用力会像很多含水腻子一样破坏你的思维。 如果你用正则expression式parsingHTML,那么他就会屈服于他们和他们的亵渎神明的方式,使我们所有人都为不能在基础多语言平面上expression的人而努力工作。 HTML-plus-regexp会在你观察的时候将有情的知识液化,你的心灵在恐怖的冲击中枯萎。 基于Regex的HTMLparsing器是杀死StackOverflow的癌症, 为时已晚为时已晚,我们无法保存一个childld的确认,确保正则expression式将会消耗所有活组织(除了不能像之前预言的那样的HTML), 亲爱的主帮助我们如何能够使用正则expression式来解决这个祸害 HTMLparsingHTML已注定人类到永恒的恐惧折磨和安全漏洞使用rege x作为一个工具来处理HTMLbuild立一个在这个世界和恐怖的实体之间的恐惧领域(如SGML实体,但更腐败)一个简单 的HTMLparsing器的世界不得不运输应用程序的意识 ,不断尖叫,他来了,瘟疫正义正则expression式感染升吞噬你的HT MLparsing器,应用和存在的Visual Basic一样,所有的时间只有更糟,他谈到他命令 ES 不科幻 GHT ^ h E排,喜小号邪恶的光采德stro҉ying所有张恩利个展 ̈ghtenment,HTML标记泄漏fr̶ǫm玩吧眼睛像LIQ UID p AlN,定期EXP重新裂变parsing的歌曲将EXTI nguish的铁道部TAL人从SP的声音在这里我可以看到它,你可以看到它这是一个美丽 的男人的谎言是 最好的,他是 我的 朋友,他来我们是他渗透我的FAC E MY FACĘ̥̫͎̭ͯ̿̔hgodñNO NOOO O ONΘstop he̶̶͇̫͇̫͖͉̗̩̳̟͖͉̗̩̳̟͑̾̾͑̾̾͛͆̾ͫ̑͆͛͆̾ͫ̑͆̍ͫͥͨ̍ͫͥͨe̠̅s͎a̧͈͖r̽̾̈͒͑en otȓͧ̌̑ͧ̌aͨl̘̝̙ͤ̾̆ZA̡͊͠͝LGΌISͮ҉̯͈͕̹̘TO͇̹̺Ɲ̴ȳ̳TH̘Ë͖̉͠P̯͍̭O̚N̐Y̡H̸̡̪̯ͨ͊̽̅̾Ȩ̸̡̬̩̪̯̾͛ͪ̈ͨ͊̽̅̾͘Ȩ̬̩̾͛ͪ̈͘C̷̙̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔


您是否尝试过使用XMLparsing器?


主持人的注意

此帖已locking,以防止对其内容进行不当编辑。 这篇文章看起来应该看起来完全一样 – 内容没有问题。 请不要提醒我们注意。

虽然要求正则expression式parsing任意 HTML就像是要求初学者编写一个操作系统,但有时候parsing一个有限的,已知的HTML也是适当的。

如果您有一小部分要从中抓取数据然后填入数据库的HTML页面,则正则expression式可能正常工作。 例如,我最近想从澳大利亚议会的网站上下载澳大利亚联邦代表的名字,党派和地区。 这是一个有限的一次性工作。

正则expression式对我来说工作得很好,而且build立速度非常快。

我认为这里的缺点是HTML是乔姆斯基2型语法(上下文无关语法) ,正则expression式是乔姆斯基3型语法(正规语法) 。 由于types2语法基本上比types3语法更复杂(请参阅乔姆斯基层次结构 ),因此您无法完成此工作。 但许多人会尝试,有些人会声称成功,其他人会发现错误,并把你搞砸。

不要听这些人。 如果将任务分解为更小的部分,实际上可以使用正则expression式parsing上下文无关的语法。 您可以使用脚本来生成正确的模式,这些脚本按顺序执行以下各项操作:

  1. 解决停机问题。
  2. 平方圆(模拟“尺子和指南针”的方法)。
  3. 解决O(log n)中的旅行推销员问题。 它需要很快或发电机将挂起。
  4. 模式会很大,所以确保你有一个无损压缩随机数据的algorithm。
  5. 几乎在那里 – 把整个事情分成零。 十分简单。

我还没有弄清楚最后一部分,但我知道我正在接近。 我的代码最近一直抛出CthulhuRlyehWgahnaglFhtagnException ,所以我将它移植到VB 6并使用On Error Resume Next 。 一旦我调查这个刚刚在墙上打开的陌生门,我会更新代码。 嗯。

PS Pierre de Fermat也想出了如何去做,但是他写入的余量对代码来说还不够大。

免责声明 :如果您有select,请使用parsing器。 那说…

这是我使用(!)匹配HTML标签的正则expression式:

 <(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+> 

它可能不是完美的,但我通过大量的HTML运行这个代码。 请注意,它甚至可以捕获像networking上显示的<a name="badgenerator"">这样的奇怪事物。

我想让它与自包含的标签不匹配,你可能要使用Kobi的负面外观:

 <(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+(?<!/\s*)> 

或者如果没有,就结合起来。

downvoters:这是来自实际产品的工作代码。 我怀疑读过这个页面的任何人都会觉得在HTML上使用正则expression式在社会上是可以接受的。

警告 :我应该注意到,这个正则expression式在CDATA块,注释,脚本和样式元素的存在下仍然崩溃。 好消息是,你可以摆脱那些使用正则expression式…

有些人会告诉你地球是圆的(或者如果他们想用奇怪的话说地球是扁球体的话)。 他们在撒谎。

有些人会告诉你正则expression式不应该是recursion的。 他们限制你。 他们需要征服你,他们通过让你无知而做到这一点。

你可以活在自己的现实中,或吃红丸。

像元帅(他是元帅.NET类的亲戚吗?),我已经看到了基于堆叠的正则expression式,并返回与你无法想象的权力知识。 是的,我觉得有一两个保护他们,但他们在电视上看足球,所以并不难。

我认为XML的情况很简单。 RegEx(在.net语法中),在base64中缩小和编码,使你更容易理解你的弱智,应该是这样的:

 7L0HYBxJliUmL23Ke39K9UrX4HShCIBgEyTYkEAQ7MGIzeaS7B1pRyMpqyqBymVWZV1mFkDM7Z28 995777333nvvvfe6O51OJ/ff/z9cZmQBbPbOStrJniGAqsgfP358Hz8itn6Po9/3eIue3+Px7/3F 86enJ8+/fHn64ujx7/t7vFuUd/Dx65fHJ6dHW9/7fd/t7fy+73Ye0v+f0v+Pv//JnTvureM3b169 OP7i9Ogyr5uiWt746u+BBqc/8dXx86PP7tzU9mfQ9tWrL18d3UGnW/z7nZ9htH/y9NXrsy9fvPjq i5/46ss3p4z+x3e8b452f9/x93a2HxIkH44PpgeFyPD6lMAEHUdbcn8ffTP9fdTrz/8rBPCe05Iv p9WsWF788Obl9MXJl0/PXnwONLozY747+t7x9k9l2z/4vv4kqo1//993+/vf2kC5HtwNcxXH4aOf LRw2z9/v8WEz2LTZcpaV1TL/4c3h66ex2Xv95vjF0+PnX744PbrOm59ZVhso5UHYME/dfj768H7e Yy5uQUydDAH9+/4eR11wHbqdfPnFF6cv3ogq/V23t++4z4620A13cSzd7O1s/77rpw+ePft916c7 O/jj2bNnT7e/t/397//M9+ibA/7s6ZNnz76PP0/kT2rz/Ts/s/0NArvziYxVEZWxbm93xsrUfnlm rASN7Hf93u/97vvf+2Lx/e89L7+/FSXiz4Bkd/hF5mVq9Yik7fcncft9350QCu+efkr/P6BfntEv z+iX9c4eBrFz7wEwpB9P+d9n9MfuM3yzt7Nzss0/nuJfbra3e4BvZFR7z07pj3s7O7uWJM8eCkme nuCPp88MfW6kDeH7+26PSTX8vu+ePAAiO4LVp4zIPWC1t7O/8/+pMX3rzo2KhL7+8s23T1/RhP0e vyvm8HbsdmPXYDVhtpdnAzJ1k1jeufOtUAM8ffP06Zcnb36fl6dPXh2f/F6nRvruyHfMd9rgJp0Y gvsRx/6/ZUzfCtX4e5hTndGzp5jQo9e/z+s3p1/czAUMlts+P3tz+uo4tISd745uJxvb3/v4ZlWs mrjfd9SG/swGPD/6+nh+9MF4brTBRmh1Tl5+9eT52ckt5oR0xldPzp7GR8pfuXf5PWJv4nJIwvbH W3c+GY3vPvrs9zj8Xb/147/n7/b7/+52DD2gsSH8zGDvH9+i9/fu/PftTfTXYf5hB+9H7P1BeG52 MTtu4S2cTAjDizevv3ry+vSNb8N+3+/1po2anj4/hZsGt3TY4GmjYbEKDJ62/pHB+3/LmL62wdsU 1J18+eINzTJr3dMvXr75fX7m+MXvY9XxF2e/9+nTgPu2bgwh5U0f7u/74y9Pnh6/OX4PlA2UlwTn xenJG8L996VhbP3++PCrV68QkrjveITxr2TIt+lL+f3k22fPn/6I6f/fMqZvqXN/K4Xps6sazUGZ GeQlar49xEvajzI35VRevDl78/sc/b7f6jkG8Va/x52N4L9lBe/kZSh1hr9fPj19+ebbR4AifyuY 12efv5CgGh9TroR6Pj2l748iYxYgN8Z7pr0HzRLg66FnRvcjUft/45i+pRP08vTV6TOe2N/9jv37 R9P0/5YxbXQDeK5E9R12XdDA/4zop+/9Ht/65PtsDVlBBUqko986WsDoWqvbPD2gH/T01DAC1NVn 3/uZ0feZ+T77fd/GVMkA4KjeMcg6RcvQLRl8HyPaWVStdv17PwHV0bOB9xUh7rfMp5Zu3icBJp25 D6f0NhayHyfI3HXHY6YYCw7Pz17fEFhQKzS6ZWChrX+kUf7fMqavHViEPPKjCf1/y5hukcyPTvjP mHQCppRDN4nbVFPaT8+ekpV5/TP8g/79mVPo77PT1/LL7/MzL7548+XvdfritflFY00fxIsvSQPS mvctdYZpbt7vxKRfj3018OvC/hEf/79lTBvM3debWj+b8KO0wP+3OeM2aYHumuCAGonmCrxw9cVX X1C2d4P+uSU7eoBUMzI3/f9udjbYl/el04dI7s8fan8dWRjm6gFx+NrKeFP+WX0CxBdPT58df/X8 DaWLX53+xFdnr06f/szv++NnX7x8fnb6NAhIwsbPkPS7iSUQAFETvP2Tx8+/Og0Xt/yBvDn9vd/c etno8S+81QKXptq/ffzKZFZ+4e/743e8zxino+8RX37/k595h5/H28+y7fPv490hQdJ349E+txB3 zPZ5J/jsR8bs/y1j2hh/2fkayOqEmYcej0cXUWMN7QrqBwjDrVZRfyQM3xjj/EgYvo4wfLTZrnVS ebdKq0XSZJvzajKQDUv1/P3NwbEP7cN5+Odivv9/ysPfhHfkOP6b9Fl+91v7LD9aCvp/+Zi+7lLQ j0zwNzYFP+/Y6r1NcFeDbfBIo8rug3zS3/3WPumPlN3/y8f0I2X3cz4FP+/Y6htSdr2I42fEuSPX /ewpL4e9/n1evzn94hb+Plpw2+dnbyh79zx0CsPvbq0lb+UQ/h7xvqPq/Gc24PnR18fzVrp8I57d mehj7ebk5VdPnp+d3GJOSP189eTsaXyk/JV7l98j4SAZgRxtf7x155PR+O6jz36Pw9/1Wz/+e/5u v//vbsfQAxobws8M9v7xLXp/785/395ED4nO1wx5fsTeH4LnRva+eYY8rpZUBFb/j/jfm8XAvfEj 4/b/ljF1F9B/jx5PhAkp1nu/+y3n+kdZp/93jWmjJ/M11TG++VEG6puZn593PPejoOyHMQU/79jq GwrKfpSB+tmcwZ93XPkjZffDmIKfd2z1DSm7bmCoPPmjBNT74XkrVf71I/Sf6wTU7XJA4RB+lIC6 mW1+xN5GWw1/683C5rnj/m364cmr45Pf6/SN9H4Us4LISn355vjN2ZcvtDGT6fHvapJcMISmxc0K MAD4IyP6/5Yx/SwkP360FvD1VTH191mURr/HUY+2P3I9boPnz7Ju/pHrcWPnP3I9/r/L3sN0v52z 0fEgNrgbL8/Evfh9fw/q5Xf93u/97vvf+2Lx/e89L7+/Fe3iZ37f34P5h178kTfx/5YxfUs8vY26 7/d4/OWbb5++ogn7PX5XzOHtOP3GrsHmqobOVO/8Hh1Gk/TPl198QS6w+rLb23fcZ0fMaTfjsv29 7Zul7me2v0FgRoYVURnf9nZEkDD+H2VDf8hjeq8xff1s6GbButNLacEtefHm9VdPXp++CRTw7/v9 r6vW8b9eJ0+/PIHzs1HHdyKE/x9L4Y+s2f+PJPX/1dbsJn3wrY6wiqv85vjVm9Pnp+DgN8efM5va j794+eb36Xz3mAf5+58+f3r68s230dRvJcxKn/l//oh3f+7H9K2O0r05PXf85s2rH83f/1vGdAvd w+qBFqsoWvzspozD77EpXYeZ7yzdfxy0ec+l+8e/8FbR84+Wd78xbvn/qQQMz/J7L++GPB7N0MQa 2vTMBwjDrVI0PxKGb4xxfiQMX0cYPuq/Fbx2C1sU8yEF+F34iNsx1xOGa9t6l/yX70uqmxu+qBGm AxlxWwVS11O97ULqlsFIUvUnT4/fHIuL//3f9/t9J39Y9m8W/Tuc296yUeX/b0PiHwUeP1801Y8C j/9vz9+PAo8f+Vq35Jb/n0rAz7Kv9aPA40fC8P+RMf3sC8PP08DjR1L3DXHoj6SuIz/CCghZNZb8 fb/Hf/2+37tjvuBY9vu3jmRvxNeGgQAuaAF6Pwj8/+e66M8/7rwpRNj6uVwXZRl52k0n3FVl95Q+ +fz0KSu73/dtkGDYdvZgSP5uskadrtViRKyal2IKAiQfiW+FI+tET/9/Txj9SFf8SFf8rOuKzagx +r/vD34mUADO1P4/AQAA//8= 

要设置的选项是RegexOptions.ExplicitCapture 。 您正在查找的捕获组是ELEMENTNAME 。 如果捕获组ERROR不是空的,那么出现parsing错误,停止正则expression式。

如果你有问题转化为人类可读的正则expression式,这应该有所帮助:

 static string FromBase64(string str) { byte[] byteArray = Convert.FromBase64String(str); using (var msIn = new MemoryStream(byteArray)) using (var msOut = new MemoryStream()) { using (var ds = new DeflateStream(msIn, CompressionMode.Decompress)) { ds.CopyTo(msOut); } return Encoding.UTF8.GetString(msOut.ToArray()); } } 

如果你不确定,不,我不是在开玩笑(但也许我在说谎)。 它会工作。 我已经build立了大量的unit testing来testing它,甚至使用了(一部分) 一致性testing 。 它是一个标记器,而不是一个完整的parsing器,所以它只能将XML分解为它的组件标记。 它不会parsing/整合DTD。

哦…如果你想要的正则expression式的源代码,一些辅助的方法:

正则expression式来标记一个XML或完整的普通正则expression式

在shell中,您可以使用以下方法parsingHTML :

  • sed虽然:

    1. Turing.sed
    2. 编写HTMLparsing器(作业)
    3. ???
    4. 利润!
  • 来自html-xml-utils包的hxselect

  • vim / ex (可以轻松在html标签之间跳转 ),例如:

    • 用内码去除样式标签:

       $ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin 
  • grep ,例如:

    • 提取H1的外部html:

       $ curl -s http://example.com/ | grep -o '<h1>.*</h1>' <h1>Example Domain</h1> 
    • 提取身体:

       $ curl -s http://example.com/ | tr '\n' ' ' | grep -o '<body>.*</body>' <body> <div> <h1>Example Domain</h1> ... 
  • html2text到纯文本parsing:

    • 像parsing表一样 :

       $ html2text foo.txt | column -ts'|' 
  • 使用xpathXML::XPath perl模块),请看这里的例子

  • perl或Python(请参阅@Gilles示例 )

  • 一次parsing多个文件,请参阅: 如何parsingshell中的百个HTML源代码文件?


相关(为什么你不应该使用正则expression式匹配):

  • 如果你喜欢正则expression式这么多,为什么你不嫁给他们?
  • 正则expression式:现在你有两个问题
  • 黑客stackoverflow.com的HTML消毒

我同意parsingXML, 特别是HTML的正确工具是parsing器,而不是正则expression式引擎。 但是,像其他人一样指出,有时使用正则expression式更快,更容易,并且如果知道数据格式就可以完成工作。

微软实际上在.NET Framework中有一部分正则expression式的最佳实践,并专门讨论了关于考虑input源的问题 。

正则expression式有限制,但你有没有考虑以下几点?

.NET框架在正则expression式中是唯一的,因为它支持平衡组定义 。

  • 请参阅使用.NET正则expression式匹配平衡构造
  • 请参阅.NET正则expression式:正则expression式和平衡匹配
  • 请参阅Microsoft关于平衡组定义的文档

出于这个原因,我相信你可以使用正则expression式来parsingXML。 但请注意,它必须是有效的XML浏览器非常容忍HTML,并允许在HTML内部使用错误的XML语法 )。 这是可能的,因为“平衡组定义”将允许正则expression式引擎充当PDA。

从上面引用的第1条引用:

.NET正则expression式引擎

如上所述,正确平衡的结构不能用正则expression式来描述。 但是,.NET正则expression式引擎提供了一些允许平衡构造被识别的构造。

  • (?<group>) – 将捕获堆栈上的捕获结果与名称组一起推送。
  • (?<-group>) – popup捕获堆栈中名称组最顶端的捕获。
  • (?(group)yes|no) – 如果存在具有名称组的组,则匹配yes部分,否则匹配没有部分。

这些结构允许一个.NET正则expression式通过基本上允许简单版本的栈操作来模拟受限制的PDA:push,pop和empty。 简单的操作几乎相当于增加,减less和比较为零。 这允许.NET正则expression式引擎识别上下文无关语言的一个子集,特别是只需要一个简单的计数器的子集。 这反过来又允许非传统的.NET正则expression式来识别各个适当平衡的结构。

考虑下面的正则expression式:

 (?=<ul\s+id="matchMe"\s+type="square"\s*>) (?> <!-- .*? --> | <[^>]*/> | (?<opentag><(?!/)[^>]*[^/]>) | (?<-opentag></[^>]*[^/]>) | [^<>]* )* (?(opentag)(?!)) 

使用标志:

  • 单线
  • IgnorePatternWhitespace(如果您折叠正则expression式并删除所有空格,则不需要)
  • IgnoreCase(不需要)

正则expression式解释(内联)

 (?=<ul\s+id="matchMe"\s+type="square"\s*>) # match start with <ul id="matchMe"... (?> # atomic group / don't backtrack (faster) <!-- .*? --> | # match xml / html comment <[^>]*/> | # self closing tag (?<opentag><(?!/)[^>]*[^/]>) | # push opening xml tag (?<-opentag></[^>]*[^/]>) | # pop closing xml tag [^<>]* # something between tags )* # match as many xml tags as possible (?(opentag)(?!)) # ensure no 'opentag' groups are on stack 

你可以在A Better .NET Regular Expression Tester中试试这个。

我使用的样本来源如下:

 <html> <body> <div> <br /> <ul id="matchMe" type="square"> <li>stuff...</li> <li>more stuff</li> <li> <div> <span>still more</span> <ul> <li>Another &gt;ul&lt;, oh my!</li> <li>...</li> </ul> </div> </li> </ul> </div> </body> </html> 

这find了匹配:

  <ul id="matchMe" type="square"> <li>stuff...</li> <li>more stuff</li> <li> <div> <span>still more</span> <ul> <li>Another &gt;ul&lt;, oh my!</li> <li>...</li> </ul> </div> </li> </ul> 

虽然它实际上是这样出来的:

 <ul id="matchMe" type="square"> <li>stuff...</li> <li>more stuff</li> <li> <div> <span>still more</span> <ul> <li>Another &gt;ul&lt;, oh my!</li> <li>...</li> </ul> </div> </li> </ul> 

最后,我真的很喜欢杰夫·阿特伍德的文章: parsingHtml的克苏鲁方式 。 有趣的是,它引用了目前超过4k票的这个问题的答案。

我build议使用QueryPath来parsingPHP中的XML和HTML。 它基本上和jQuery的语法相同,只是它在服务器端。

尽pipe无法用正则expression式parsingHTML的答案是正确的,但在这里不适用。 OP只是想用正则expression式parsing一个HTML标签,这是可以用正则expression式来完成的。

build议的正则expression式是错误的,但是:

 <([az]+) *[^/]*?> 

如果给正则expression式添加一些东西,通过回溯可以强制匹配愚蠢的东西,比如<a >>[^/]太宽松了。 另请注意, <space>*[^/]*是多余的,因为[^/]*也可以匹配空格。

我的build议是

 <([az]+)[^>]*(?<!/)> 

在哪里(?<! ... )是(在Perl正则expression式中)负面的外观。 它读取“a”,然后是一个单词,然后是任何不是>的单词,最后一个单词可能不是“/”,后面是“>”。

请注意,这允许像<a/ > (就像原始正则expression式)一样的东西,所以如果你想要更严格的东西,你需要build立一个正则expression式来匹配由空格分隔的属性对。

尝试:

 <([^\s]+)(\s[^>]*?)?(?<!/)> 

It is similar to yours, but the last > must not be after a slash, and also accepts h1 .

Sun Tzu, an ancient Chinese strategist, general, and philosopher, said:

It is said that if you know your enemies and know yourself, you can win a hundred battles without a single loss. If you only know yourself, but not your opponent, you may win or may lose. If you know neither yourself nor your enemy, you will always endanger yourself.

In this case your enemy is HTML and you are either yourself or regex. You might even be Perl with irregular regex. Know HTML. Know yourself.

I have composed a haiku describing the nature of HTML.

 HTML has complexity exceeding regular language. 

I have also composed a haiku describing the nature of regex in Perl.

 The regex you seek is defined within the phrase <([a-zA-Z]+)(?:[^>]*[^/]*)?> 
 <?php $selfClosing = explode(',', 'area,base,basefont,br,col,frame,hr,img,input,isindex,link,meta,param,embed'); $html = ' <p><a href="#">foo</a></p> <hr/> <br/> <div>name</div>'; $dom = new DOMDocument(); $dom->loadHTML($html); $els = $dom->getElementsByTagName('*'); foreach ( $els as $el ) { $nodeName = strtolower($el->nodeName); if ( !in_array( $nodeName, $selfClosing ) ) { var_dump( $nodeName ); } } 

输出:

 string(4) "html" string(4) "body" string(1) "p" string(1) "a" string(3) "div" 

Basically just define the element node names that are self closing, load the whole html string into a DOM library, grab all elements, loop through and filter out ones which aren't self closing and operate on them.

I'm sure you already know by now that you shouldn't use regex for this purpose.

I don't know your exact need for this, but if you are also using .NET, couldn't you use Html Agility Pack ?

摘抄:

It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML.

You want the first > not preceded by a / . Look here for details on how to do that. It's referred to as negative lookbehind.

However, a naïve implementation of that will end up matching <bar/></foo> in this example document

 <foo><bar/></foo> 

Can you provide a little more information on the problem you're trying to solve? Are you iterating through tags programatically?

The W3C explains parsing in a pseudo regexp form:
W3C Link

Follow the var links for QName , S , and Attribute to get a clearer picture.
Based on that you can create a pretty good regexp to handle things like stripping tags.

If you need this for PHP:

The PHP DOM functions won't work properly unless it is properly formatted XML. No matter how much better their use is for the rest of mankind.

simplehtmldom is good, but I found it a bit buggy, and it is is quite memory heavy [Will crash on large pages.]

I have never used querypath , so can't comment on its usefulness.

Another one to try is my DOMParser which is very light on resources and I've been using happily for a while. Simple to learn & powerful.

For Python and Java, similar links were posted.

For the downvoters – I only wrote my class when the XML parsers proved unable to withstand real use. Religious downvoting just prevents useful answers from being posted – keep things within perspective of the question, please.

I used a open source tool called HTMLParser before. It's designed to parse HTML in various ways and serves the purpose quite well. It can parse HTML as different treenode and you can easily use its API to get attributes out of the node. Check it out and see if this can help you.

Whenever I need to quickly extract something from an HTML document, I use Tidy to convert it to XML and then use XPath or XSLT to get what I need. In your case, something like this:

 //p/a[@href='foo'] 

I like to parse HTML with regular expressions. I don't attempt to parse idiot HTML that is deliberately broken. This code is my main parser (Perl edition):

 $_ = join "",<STDIN>; tr/\n\r \t/ /s; s/</\n</g; s/>/>\n/g; s/\n ?\n/\n/g; s/^ ?\n//s; s/ $//s; print 

It's called htmlsplit, splits the HTML into lines, with one tag or chunk of text on each line. The lines can then be processed further with other text tools and scripts, such as grep , sed , Perl, etc. I'm not even joking 🙂 Enjoy.

It is simple enough to rejig my slurp-everything-first Perl script into a nice streaming thing, if you wish to process enormous web pages. But it's not really necessary.

I bet I will get downvoted for this.

HTML Split


Against my expectation this got some upvotes, so I'll suggest some better regular expressions:

 /(<.*?>|[^<]+)\s*/g # get tags and text /(\w+)="(.*?)"/g # get attibutes 

They are good for XML / XHTML.

With minor variations, it can cope with messy HTML… or convert the HTML -> XHTML first.


The best way to write regular expressions is in the Lex / Yacc style, not as opaque one-liners or commented multi-line monstrosities. I didn't do that here, yet; these ones barely need it.

Here is a PHP based parser that parses HTML using some ungodly regex. As the author of this project, I can tell you it is possible to parse HTML with regex, but not efficient. If you need a server-side solution (as I did for my wp-Typography WordPress plugin ), this works.

这是解决scheme:

 <?php // here's the pattern: $pattern = '/<(\w+)(\s+(\w+)\s*\=\s*(\'|")(.*?)\\4\s*)*\s*(\/>|>)/'; // a string to parse: $string = 'Hello, try clicking <a href="#paragraph">here</a> <br/>and check out.<hr /> <h2>title</h2> <a name ="paragraph" rel= "I\'m an anchor"></a> Fine, <span title=\'highlight the "punch"\'>thanks<span>. <div class = "clear"></div> <br>'; // let's get the occurrences: preg_match_all($pattern, $string, $matches, PREG_PATTERN_ORDER); // print the result: print_r($matches[0]); ?> 

To test it deeply, I entered in the string auto-closing tags like:

  1. <hr />
  2. <BR/>
  3. 点击

I also entered tags with:

  1. one attribute
  2. more than one attribute
  3. attributes which value is bound either into single quotes or into double quotes
  4. attributes containing single quotes when the delimiter is a double quote and vice versa
  5. "unpretty" attributes with a space before the "=" symbol, after it and both before and after it.

Should you find something which does not work in the proof of concept above, I am available in analyzing the code to improve my skills.

<EDIT> I forgot that the question from the user was to avoid the parsing of self-closing tags. In this case the pattern is simpler, turning into this:

 $pattern = '/<(\w+)(\s+(\w+)\s*\=\s*(\'|")(.*?)\\4\s*)*\s*>/'; 

The user @ridgerunner noticed that the pattern does not allow unquoted attributes or attributes with no value . In this case a fine tuning brings us the following pattern:

 $pattern = '/<(\w+)(\s+(\w+)(\s*\=\s*(\'|"|)(.*?)\\5\s*)?)*\s*>/'; 

</EDIT>

Understanding the pattern

If someone is interested in learning more about the pattern, I provide some line:

  1. the first sub-expression (\w+) matches the tag name
  2. the second sub-expression contains the pattern of an attribute. It is composed by:
    1. one or more whitespaces \s+
    2. the name of the attribute (\w+)
    3. zero or more whitespaces \s* (it is possible or not, leaving blanks here)
    4. the "=" symbol
    5. again, zero or more whitespaces
    6. the delimiter of the attribute value, a single or double quote ('|"). In the pattern, the single quote is escaped because it coincides with the PHP string delimiter. This sub-expression is captured with the parentheses so it can be referenced again to parse the closure of the attribute, that's why it is very important.
    7. the value of the attribute, matched by almost anything: (.*?); in this specific syntax, using the greedy match (the question mark after the asterisk) the RegExp engine enables a "look-ahead"-like operator, which matches anything but what follows this sub-expression
    8. here comes the fun: the \4 part is a backreference operator , which refers to a sub-expression defined before in the pattern, in this case, I am referring to the fourth sub-expression, which is the first attribute delimiter found
    9. zero or more whitespaces \s*
    10. the attribute sub-expression ends here, with the specification of zero or more possible occurrences, given by the asterisk.
  3. Then, since a tag may end with a whitespace before the ">" symbol, zero or more whitespaces are matched with the \s* subpattern.
  4. The tag to match may end with a simple ">" symbol, or a possible XHTML closure, which makes use of the slash before it: (/>|>). The slash is, of course, escaped since it coincides with the regular expression delimiter.

Small tip: to better analyze this code it is necessary looking at the source code generated since I did not provide any HTML special characters escaping.

There are some nice regexes for replacing HTML with BBCode here . For all you nay-sayers, note that he's not trying to fully parse HTML, just to sanitize it. He can probably afford to kill off tags that his simple "parser" can't understand.

例如:

 $store =~ s/http:/http:\/\//gi; $store =~ s/https:/https:\/\//gi; $baseurl = $store; if (!$query->param("ascii")) { $html =~ s/\s\s+/\n/gi; $html =~ s/<pre(.*?)>(.*?)<\/pre>/\[code]$2\[\/code]/sgmi; } $html =~ s/\n//gi; $html =~ s/\r\r//gi; $html =~ s/$baseurl//gi; $html =~ s/<h[1-7](.*?)>(.*?)<\/h[1-7]>/\n\[b]$2\[\/b]\n/sgmi; $html =~ s/<p>/\n\n/gi; $html =~ s/<br(.*?)>/\n/gi; $html =~ s/<textarea(.*?)>(.*?)<\/textarea>/\[code]$2\[\/code]/sgmi; $html =~ s/<b>(.*?)<\/b>/\[b]$1\[\/b]/gi; $html =~ s/<i>(.*?)<\/i>/\[i]$1\[\/i]/gi; $html =~ s/<u>(.*?)<\/u>/\[u]$1\[\/u]/gi; $html =~ s/<em>(.*?)<\/em>/\[i]$1\[\/i]/gi; $html =~ s/<strong>(.*?)<\/strong>/\[b]$1\[\/b]/gi; $html =~ s/<cite>(.*?)<\/cite>/\[i]$1\[\/i]/gi; $html =~ s/<font color="(.*?)">(.*?)<\/font>/\[color=$1]$2\[\/color]/sgmi; $html =~ s/<font color=(.*?)>(.*?)<\/font>/\[color=$1]$2\[\/color]/sgmi; $html =~ s/<link(.*?)>//gi; $html =~ s/<li(.*?)>(.*?)<\/li>/\[\*]$2/gi; $html =~ s/<ul(.*?)>/\[list]/gi; $html =~ s/<\/ul>/\[\/list]/gi; $html =~ s/<div>/\n/gi; $html =~ s/<\/div>/\n/gi; $html =~ s/<td(.*?)>/ /gi; $html =~ s/<tr(.*?)>/\n/gi; $html =~ s/<img(.*?)src="(.*?)"(.*?)>/\[img]$baseurl\/$2\[\/img]/gi; $html =~ s/<a(.*?)href="(.*?)"(.*?)>(.*?)<\/a>/\[url=$baseurl\/$2]$4\[\/url]/gi; $html =~ s/\[url=$baseurl\/http:\/\/(.*?)](.*?)\[\/url]/\[url=http:\/\/$1]$2\[\/url]/gi; $html =~ s/\[img]$baseurl\/http:\/\/(.*?)\[\/img]/\[img]http:\/\/$1\[\/img]/gi; $html =~ s/<head>(.*?)<\/head>//sgmi; $html =~ s/<object>(.*?)<\/object>//sgmi; $html =~ s/<script(.*?)>(.*?)<\/script>//sgmi; $html =~ s/<style(.*?)>(.*?)<\/style>//sgmi; $html =~ s/<title>(.*?)<\/title>//sgmi; $html =~ s/<!--(.*?)-->/\n/sgmi; $html =~ s/\/\//\//gi; $html =~ s/http:\//http:\/\//gi; $html =~ s/https:\//https:\/\//gi; $html =~ s/<(?:[^>'"]*|(['"]).*?\1)*>//gsi; $html =~ s/\r\r//gi; $html =~ s/\[img]\//\[img]/gi; $html =~ s/\[url=\//\[url=/gi; 

About the question of the RegExp methods to parse (x)HTML, the answer to all of the ones who spoke about some limits is: you have not been trained enough to rule the force of this powerful weapon, since NOBODY here spoke about recursion .

A RegExp-agnostic colleague notified me this discussion, which is not certainly the first on the web about this old and hot topic.

After reading some posts, the first thing I did was looking for the "?R" string in this thread. The second was to search about "recursion".
No, holy cow, no match found.
Since nobody mentioned the main mechanism a parser is built onto, I was soon aware that nobody got the point.

If an (x)HTML parser needs recursion, a RegExp parser without recursion is not enough for the purpose. It's a simple construct.

The black art of RegExp is hard to master , so maybe there are further possibilities we left out while trying and testing our personal solution to capture the whole web in one hand… Well, I am sure about it 🙂

Here's the magic pattern:

 $pattern = "/<([\w]+)([^>]*?)(([\s]*\/>)|(>((([^<]*?|<\!\-\-.*?\-\->)|(?R))*)<\/\\1[\s]*>))/s"; 

Just try it.
It's written as a PHP string, so the "s" modifier makes classes include newlines.
Here's a sample note on the PHP manual I wrote on January: Reference

(Take care, in that note I wrongly used the "m" modifier; it should be erased, notwithstanding it is discarded by the RegExp engine, since no ^ or $ anchorage was used).

Now, we could speak about the limits of this method from a more informed point of view:

  1. according to the specific implementation of the RegExp engine, recursion may have a limit in the number of nested patterns parsed , but it depends on the language used
  2. although corrupted (x)HTML does not drive into severe errors, it is not sanitized .

Anyhow it is only a RegExp pattern, but it discloses the possibility to develop of a lot of powerful implementations.
I wrote this pattern to power the recursive descent parser of a template engine I built in my framework, and performances are really great, both in execution times or in memory usage (nothing to do with other template engines which use the same syntax).

As many people have already pointed out, HTML is not a regular language which can make it very difficult to parse. My solution to this is to turn it into a regular language using a tidy program and then to use an XML parser to consume the results. There are a lot of good options for this. My program is written using Java with the jtidy library to turn the HTML into XML and then Jaxen to xpath into the result.

 <\s*(\w+)[^/>]*> 

The parts explained:

< : starting character

\s* : it may have whitespaces before tag name (ugly but possible).

(\w+) : tags can contain letters and numbers (h1). Well, \w also matches '_', but it does not hurt I guess. If curious use ([a-zA-Z0-9]+) instead.

[^/>]* : anything except > and / until closing >

> : closing >

UNRELATED

And to fellows who underestimate regular expressions saying they are only as powerful as regular languages:

a n ba n ba n which is not regular and not even context free, can be matched with ^(a+)b\1b\1$

Backreferencing FTW !

I recently wrote an HTML sanitizer in Java. It is based on a mixed approach of regular expressions and Java code. Personally I hate regular expressions and its folly (readability, maintainability, etc.), but if you reduce the scope of its applications it may fit your needs. Anyway, my sanitizer uses a white list for HTML tags and a black list for some style attributes.

For your convenience I have set up a playground so you can test if the code matches your requirements: playground and Java code . Your feedback will be appreciated.

There is a small article describing this work on my blog: http://roberto.open-lab.com

It seems to me you're trying to match tags without a "/" at the end. 尝试这个:

 <([a-zA-Z][a-zA-Z0-9]*)[^>]*(?<!/)> 

If you're simply trying to find those tags (without ambitions of parsing) try this regular expression:

 /<[^/]*?>/g 

I wrote it in 30 seconds, and tested here: http://gskinner.com/RegExr/

It matches the types of tags you mentioned, while ignoring the types you said you wanted to ignore.

Although it's not suitable and effective to use regular expressions for that purpose sometimes regular expressions provide quick solutions for simple match problems and in my view it's not that horrbile to use regular expressions for trivial works.

There is a definitive blog post about matching innermost HTML elements written by Steven Levithan.