正则expression式来分割HTML标签

我有一个HTMLstring，如下所示：

<img src="http://foo"><img src="http://bar">

正则expression式模式将其分成两个单独的img标签是什么？

你有多确定你的string是完全一样的 ？这样的input是什么：

 <img alt=">" src="http://foo" > <img src='http://bar' alt='<' >

这是什么编程语言？有没有一些原因，你不使用标准的HTMLparsing类来处理这个？正则expression式只是一个很好的方法，当你有一个非常有名的input。它们不适用于真正的HTML，只适用于受操纵的演示。

即使你必须使用正则expression式，你也应该使用正确的语法。这很容易。我已经testing了以下在万亿网页上的programacita。它将照顾我上面概述的案例，还有一两个案例。

 #!/usr/bin/perl use 5.10.0; use strict; use warnings; my $img_rx = qr{ # save capture in $+{TAG} variable (?<TAG> (?&image_tag) ) # remainder is pure declaration (?(DEFINE) (?<image_tag> (?&start_tag) (?&might_white) (?&attributes) (?&might_white) (?&end_tag) ) (?<attributes> (?: (?&might_white) (?&one_attribute) ) * ) (?<one_attribute> \b (?&legal_attribute) (?&might_white) = (?&might_white) (?: (?&quoted_value) | (?&unquoted_value) ) ) (?<legal_attribute> (?: (?&required_attribute) | (?&optional_attribute) | (?&standard_attribute) | (?&event_attribute) # for LEGAL parse only, comment out next line | (?&illegal_attribute) ) ) (?<illegal_attribute> \b \w+ \b ) (?<required_attribute> alt | src ) (?<optional_attribute> (?&permitted_attribute) | (?&deprecated_attribute) ) # NB: The white space in string literals # below DOES NOT COUNT! It's just # there for legibility. (?<permitted_attribute> height | is map | long desc | use map | width ) (?<deprecated_attribute> align | border | hspace | vspace ) (?<standard_attribute> class | dir | id | style | title | xml:lang ) (?<event_attribute> on abort | on click | on dbl click | on mouse down | on mouse out | on key down | on key press | on key up ) (?<unquoted_value> (?&unwhite_chunk) ) (?<quoted_value> (?<quote> ["'] ) (?: (?! \k<quote> ) . ) * \k<quote> ) (?<unwhite_chunk> (?: # (?! [<>'"] ) (?! > ) \S ) + ) (?<might_white> \s * ) (?<start_tag> < (?&might_white) img \b ) (?<end_tag> (?&html_end_tag) | (?&xhtml_end_tag) ) (?<html_end_tag> > ) (?<xhtml_end_tag> / > ) ) }six; $/ = undef; $_ = <>; # read all input # strip stuff we aren't supposed to look at s{ <! DOCTYPE .*? > }{}sx; s{ <! \[ CDATA \[ .*? \]\] > }{}gsx; s{ <script> .*? </script> }{}gsix; s{ <!-- .*? --> }{}gsx; my $count = 0; while (/$img_rx/g) { printf "Match %d at %d: %s\n", ++$count, pos(), $+{TAG}; }

你走了没有什么！

嘿，为什么你会想要使用一个HTMLparsing类，如何轻松地在正则expression式中处理HTML。 ☺

不要用正则expression式。使用HTML / XMLparsing器。你甚至可以通过整洁首先清理它。大多数语言都有一个整洁的图书馆。你使用什么语言？

这将做到这一点：

 <img\s+src=\"[^\"]*?\">

或者你可以这样做来说明任何附加属性

 <img\s+[^>]*?\bsrc=\"[^\"]*?\"[^>]*>

 <img src=\"https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?\">

PHP示例：

 $prom = '<img src="http://foo"><img src="http://bar">'; preg_match_all('|<img src=\"https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?\">|',$prom, $matches); print_r($matches[0]);

一个稍微疯狂/辉煌/奇怪的方法是将<>分开，然后将两个字符分别添加到分割后的string中。

 $string = '<img src="http://foo"><img src="http://bar">'; $KimKardashian = split("><",$string); $First = $KimKardashian[0] . '>'; $Second = '<' . $KimKardashian[1];

正则expression式来分割HTML标签

正则expression式匹配一个C风格的多行注释

正则expression式：有AND运算符吗？

使用javascript获取两个字符之间的子string

用PCRE正则expression式匹配正确的两个二进制数

Vim正则expression式：如何searchA和B不是C

使用RegEx来平衡匹配括号

我怎样才能在JavaScript中连接正则expression式文字？

字母数字，短划线和下划线，但没有空格正则expression式检查Javascript

我可以在正则expression式中使用OR而不捕获封闭的内容吗？

Ruby用捕获的正则expression式模式replacestring