最好的方式来从string中删除的HTML标记在SQL Server中?

我有SQL Server 2005中包含HTML标记的数据,我想剥去所有的,只留下标签之间的文本。 理想情况下,也可以用&lt; 与<等

有没有一个简单的方法来做到这一点,或有人已经有一些示例SQL代码?

我没有能力添加扩展存储的特效等,所以更喜欢纯SQL方法(最好是一个向后兼容SQL 2000)。 我想检索数据与剥离出来的HTML,而不是更新它,所以理想情况下,它将被写为一个函数,以方便重用。

所以例如转换这个:

<B>Some useful text</B>&nbsp; <A onclick="return openInfo(this)" href="http://there.com/3ce984e88d0531bac5349" target=globalhelp> <IMG title="Source Description" height=15 alt="Source Description" src="/ri/new_info.gif" width=15 align=top border=0> </A>&gt;&nbsp;<b>more text</b></TD></TR> 

对此:

 Some useful text > more text 

有一个UDF将这样描述:

用户定义函数去除HTML

 CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX)) RETURNS VARCHAR(MAX) AS BEGIN DECLARE @Start INT DECLARE @End INT DECLARE @Length INT SET @Start = CHARINDEX('<',@HTMLText) SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText)) SET @Length = (@End - @Start) + 1 WHILE @Start > 0 AND @End > 0 AND @Length > 0 BEGIN SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'') SET @Start = CHARINDEX('<',@HTMLText) SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText)) SET @Length = (@End - @Start) + 1 END RETURN LTRIM(RTRIM(@HTMLText)) END GO 

编辑:请注意,这是SQL Server 2005,但如果您将关键字MAX更改为4000,它也将在SQL Server 2000中工作。

如果您的HTML格式正确,我认为这是一个更好的解决scheme:

 create function dbo.StripHTML( @text varchar(max) ) returns varchar(max) as begin declare @textXML xml declare @result varchar(max) set @textXML = REPLACE( @text, '&', '' ); with doc(contents) as ( select chunks.chunk.query('.') from @textXML.nodes('/') as chunks(chunk) ) select @result = contents.value('.', 'varchar(max)') from doc return @result end go select dbo.StripHTML('This <i>is</i> an <b>html</b> test') 

这不是一个完整的新解决scheme,而是对afwebservant解决scheme的修正:

 --note comments to see the corrections CREATE FUNCTION [dbo].[StripHTML] (@HTMLText VARCHAR(MAX)) RETURNS VARCHAR(MAX) AS BEGIN DECLARE @Start INT DECLARE @End INT DECLARE @Length INT --DECLARE @TempStr varchar(255) (this is not used) SET @Start = CHARINDEX('<',@HTMLText) SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText)) SET @Length = (@End - @Start) + 1 WHILE @Start > 0 AND @End > 0 AND @Length > 0 BEGIN IF (UPPER(SUBSTRING(@HTMLText, @Start, 4)) <> '<BR>') AND (UPPER(SUBSTRING(@HTMLText, @Start, 5)) <> '</BR>') begin SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'') end -- this ELSE and SET is important ELSE SET @Length = 0; -- minus @Length here below is important SET @Start = CHARINDEX('<',@HTMLText, @End-@Length) SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText, @Start)) -- instead of -1 it should be +1 SET @Length = (@End - @Start) + 1 END RETURN RTRIM(LTRIM(@HTMLText)) END 

来自@Goner Doug的答案,更新了一些东西:
– 尽可能使用REPLACE
– 预定义实体(如&eacute;转换&eacute; (我select了我需要的:-)
– 列表标签<ul> and <li>一些转换

 ALTER FUNCTION [dbo].[udf_StripHTML] --by Patrick Honorez --- www.idevlop.com --inspired by http://stackoverflow.com/questions/457701/best-way-to-strip-html-tags-from-a-string-in-sql-server/39253602#39253602 ( @HTMLText varchar(MAX) ) RETURNS varchar(MAX) AS BEGIN DECLARE @Start int DECLARE @End int DECLARE @Length int set @HTMLText = replace(@htmlText, '<br>',CHAR(13) + CHAR(10)) set @HTMLText = replace(@htmlText, '<br/>',CHAR(13) + CHAR(10)) set @HTMLText = replace(@htmlText, '<br />',CHAR(13) + CHAR(10)) set @HTMLText = replace(@htmlText, '<li>','- ') set @HTMLText = replace(@htmlText, '</li>',CHAR(13) + CHAR(10)) set @HTMLText = replace(@htmlText, '&rsquo;' collate Latin1_General_CS_AS, '''' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&quot;' collate Latin1_General_CS_AS, '"' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&amp;' collate Latin1_General_CS_AS, '&' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&euro;' collate Latin1_General_CS_AS, '€' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&lt;' collate Latin1_General_CS_AS, '<' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&gt;' collate Latin1_General_CS_AS, '>' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&oelig;' collate Latin1_General_CS_AS, 'oe' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&nbsp;' collate Latin1_General_CS_AS, ' ' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&copy;' collate Latin1_General_CS_AS, '©' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&laquo;' collate Latin1_General_CS_AS, '«' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&reg;' collate Latin1_General_CS_AS, '®' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&plusmn;' collate Latin1_General_CS_AS, '±' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&sup2;' collate Latin1_General_CS_AS, '²' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&sup3;' collate Latin1_General_CS_AS, '³' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&micro;' collate Latin1_General_CS_AS, 'µ' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&middot;' collate Latin1_General_CS_AS, '·' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&ordm;' collate Latin1_General_CS_AS, 'º' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&raquo;' collate Latin1_General_CS_AS, '»' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&frac14;' collate Latin1_General_CS_AS, '¼' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&frac12;' collate Latin1_General_CS_AS, '½' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&frac34;' collate Latin1_General_CS_AS, '¾' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&Aelig' collate Latin1_General_CS_AS, 'Æ' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&Ccedil;' collate Latin1_General_CS_AS, 'Ç' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&Egrave;' collate Latin1_General_CS_AS, 'È' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&Eacute;' collate Latin1_General_CS_AS, 'É' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&Ecirc;' collate Latin1_General_CS_AS, 'Ê' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&Ouml;' collate Latin1_General_CS_AS, 'Ö' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&agrave;' collate Latin1_General_CS_AS, 'à' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&acirc;' collate Latin1_General_CS_AS, 'â' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&auml;' collate Latin1_General_CS_AS, 'ä' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&aelig;' collate Latin1_General_CS_AS, 'æ' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&ccedil;' collate Latin1_General_CS_AS, 'ç' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&egrave;' collate Latin1_General_CS_AS, 'è' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&eacute;' collate Latin1_General_CS_AS, 'é' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&ecirc;' collate Latin1_General_CS_AS, 'ê' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&euml;' collate Latin1_General_CS_AS, 'ë' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&icirc;' collate Latin1_General_CS_AS, 'î' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&ocirc;' collate Latin1_General_CS_AS, 'ô' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&ouml;' collate Latin1_General_CS_AS, 'ö' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&divide;' collate Latin1_General_CS_AS, '÷' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&oslash;' collate Latin1_General_CS_AS, 'ø' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&ugrave;' collate Latin1_General_CS_AS, 'ù' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&uacute;' collate Latin1_General_CS_AS, 'ú' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&ucirc;' collate Latin1_General_CS_AS, 'û' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&uuml;' collate Latin1_General_CS_AS, 'ü' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&quot;' collate Latin1_General_CS_AS, '"' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&amp;' collate Latin1_General_CS_AS, '&' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&lsaquo;' collate Latin1_General_CS_AS, '<' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&rsaquo;' collate Latin1_General_CS_AS, '>' collate Latin1_General_CS_AS) -- Remove anything between <STYLE> tags SET @Start = CHARINDEX('<STYLE', @HTMLText) SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('<', @HTMLText)) + 7 SET @Length = (@End - @Start) + 1 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '') SET @Start = CHARINDEX('<STYLE', @HTMLText) SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('</STYLE>', @HTMLText)) + 7 SET @Length = (@End - @Start) + 1 END -- Remove anything between <whatever> tags SET @Start = CHARINDEX('<', @HTMLText) SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText)) SET @Length = (@End - @Start) + 1 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '') SET @Start = CHARINDEX('<', @HTMLText) SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText)) SET @Length = (@End - @Start) + 1 END RETURN LTRIM(RTRIM(@HTMLText)) END 

尝试这个。 它是由RedFilter发布的修改版本…此SQL删除除BR,B和P以外的所有附加属性的标签:

 CREATE FUNCTION [dbo].[StripHtml] (@HTMLText VARCHAR(MAX)) RETURNS VARCHAR(MAX) AS BEGIN DECLARE @Start INT DECLARE @End INT DECLARE @Length INT DECLARE @TempStr varchar(255) SET @Start = CHARINDEX('<',@HTMLText) SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText)) SET @Length = (@End - @Start) + 1 WHILE @Start > 0 AND @End > 0 AND @Length > 0 BEGIN IF (UPPER(SUBSTRING(@HTMLText, @Start, 3)) <> '<BR') AND (UPPER(SUBSTRING(@HTMLText, @Start, 2)) <> '<P') AND (UPPER(SUBSTRING(@HTMLText, @Start, 2)) <> '<B') AND (UPPER(SUBSTRING(@HTMLText, @Start, 3)) <> '</B') BEGIN SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'') END SET @Start = CHARINDEX('<',@HTMLText, @End) SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText, @Start)) SET @Length = (@End - @Start) - 1 END RETURN RTRIM(LTRIM(@HTMLText)) END 

下面是这个函数的更新版本,它将RedFilter的答案(Pinal的原始版本)和LazyCoders添加以及goodeye的input错误更正和我自己的补充结合在一起处理HTML内部的<STYLE>标签。

 ALTER FUNCTION [dbo].[udf_StripHTML] ( @HTMLText varchar(MAX) ) RETURNS varchar(MAX) AS BEGIN DECLARE @Start int DECLARE @End int DECLARE @Length int -- Replace the HTML entity &amp; with the '&' character (this needs to be done first, as -- '&' might be double encoded as '&amp;amp;') SET @Start = CHARINDEX('&amp;', @HTMLText) SET @End = @Start + 4 SET @Length = (@End - @Start) + 1 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&') SET @Start = CHARINDEX('&amp;', @HTMLText) SET @End = @Start + 4 SET @Length = (@End - @Start) + 1 END -- Replace the HTML entity &lt; with the '<' character SET @Start = CHARINDEX('&lt;', @HTMLText) SET @End = @Start + 3 SET @Length = (@End - @Start) + 1 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '<') SET @Start = CHARINDEX('&lt;', @HTMLText) SET @End = @Start + 3 SET @Length = (@End - @Start) + 1 END -- Replace the HTML entity &gt; with the '>' character SET @Start = CHARINDEX('&gt;', @HTMLText) SET @End = @Start + 3 SET @Length = (@End - @Start) + 1 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '>') SET @Start = CHARINDEX('&gt;', @HTMLText) SET @End = @Start + 3 SET @Length = (@End - @Start) + 1 END -- Replace the HTML entity &amp; with the '&' character SET @Start = CHARINDEX('&amp;amp;', @HTMLText) SET @End = @Start + 4 SET @Length = (@End - @Start) + 1 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&') SET @Start = CHARINDEX('&amp;amp;', @HTMLText) SET @End = @Start + 4 SET @Length = (@End - @Start) + 1 END -- Replace the HTML entity &nbsp; with the ' ' character SET @Start = CHARINDEX('&nbsp;', @HTMLText) SET @End = @Start + 5 SET @Length = (@End - @Start) + 1 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ') SET @Start = CHARINDEX('&nbsp;', @HTMLText) SET @End = @Start + 5 SET @Length = (@End - @Start) + 1 END -- Replace any <br> tags with a newline SET @Start = CHARINDEX('<br>', @HTMLText) SET @End = @Start + 3 SET @Length = (@End - @Start) + 1 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10)) SET @Start = CHARINDEX('<br>', @HTMLText) SET @End = @Start + 3 SET @Length = (@End - @Start) + 1 END -- Replace any <br/> tags with a newline SET @Start = CHARINDEX('<br/>', @HTMLText) SET @End = @Start + 4 SET @Length = (@End - @Start) + 1 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10)) SET @Start = CHARINDEX('<br/>', @HTMLText) SET @End = @Start + 4 SET @Length = (@End - @Start) + 1 END -- Replace any <br /> tags with a newline SET @Start = CHARINDEX('<br />', @HTMLText) SET @End = @Start + 5 SET @Length = (@End - @Start) + 1 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10)) SET @Start = CHARINDEX('<br />', @HTMLText) SET @End = @Start + 5 SET @Length = (@End - @Start) + 1 END -- Remove anything between <STYLE> tags SET @Start = CHARINDEX('<STYLE', @HTMLText) SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('<', @HTMLText)) + 7 SET @Length = (@End - @Start) + 1 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '') SET @Start = CHARINDEX('<STYLE', @HTMLText) SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('</STYLE>', @HTMLText)) + 7 SET @Length = (@End - @Start) + 1 END -- Remove anything between <whatever> tags SET @Start = CHARINDEX('<', @HTMLText) SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText)) SET @Length = (@End - @Start) + 1 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '') SET @Start = CHARINDEX('<', @HTMLText) SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText)) SET @Length = (@End - @Start) + 1 END RETURN LTRIM(RTRIM(@HTMLText)) END 

如何使用XQuery一个class轮:

 select @xml.query('for $x in //. return ($x)//text()') 

这遍历所有元素并仅返回text()。

要避免元素之间的连接而不使用空格的文本,请使用:

 SELECT @xml.query('for $x in //. return concat((($x)//text())[1]," ")') 

这些非常适合当你想build立seach短语,剥离HTML等

只要注意,这返回typesxml,所以CAST或COVERT在适当的文本。 这种数据types的xml版本是无用的,因为它不是一个合适的XML。