在使用SQL Server的varchar列中查找非ASCII字符

如何使用SQL Server返回具有非ASCII字符的行?
如果你能显示如何做一列将是很好的。

我现在正在做这样的事情,但它不工作

select * from Staging.APARMRE1 as ar where ar.Line like '%[^!-~ ]%' 

额外的功劳,如果它可以跨越一个表中的所有 varchar列,这将是优秀的! 在这个解决scheme中,最好返回三列:

  • 该logging的身份字段。 (这将允许整个logging与另一个查询进行审查。)
  • 列名称
  • 具有无效字符的文本
  Id | FieldName | InvalidText | ----+-----------+-------------------+ 25 | LastName | Solís | 56 | FirstName | François | 100 | Address1 | 123 Ümlaut street | 

无效的字符将会在SPACE(32 10 )到~ (127 10

尝试这样的事情:

 DECLARE @YourTable table (PK int, col1 varchar(20), col2 varchar(20), col3 varchar(20)) INSERT @YourTable VALUES (1, 'ok','ok','ok') INSERT @YourTable VALUES (2, 'BA'+char(182)+'D','ok','ok') INSERT @YourTable VALUES (3, 'ok',char(182)+'BAD','ok') INSERT @YourTable VALUES (4, 'ok','ok','B'+char(182)+'AD') INSERT @YourTable VALUES (5, char(182)+'BAD','ok',char(182)+'BAD') INSERT @YourTable VALUES (6, 'BAD'+char(182),'B'+char(182)+'AD','BAD'+char(182)+char(182)+char(182)) --if you have a Numbers table use that, other wise make one using a CTE ;WITH AllNumbers AS ( SELECT 1 AS Number UNION ALL SELECT Number+1 FROM AllNumbers WHERE Number<1000 ) SELECT pk, 'Col1' BadValueColumn, CONVERT(varchar(20),col1) AS BadValue --make the XYZ in convert(varchar(XYZ), ...) the largest value of col1, col2, col3 FROM @YourTable y INNER JOIN AllNumbers n ON n.Number <= LEN(y.col1) WHERE ASCII(SUBSTRING(y.col1, n.Number, 1))<32 OR ASCII(SUBSTRING(y.col1, n.Number, 1))>127 UNION SELECT pk, 'Col2' BadValueColumn, CONVERT(varchar(20),col2) AS BadValue --make the XYZ in convert(varchar(XYZ), ...) the largest value of col1, col2, col3 FROM @YourTable y INNER JOIN AllNumbers n ON n.Number <= LEN(y.col2) WHERE ASCII(SUBSTRING(y.col2, n.Number, 1))<32 OR ASCII(SUBSTRING(y.col2, n.Number, 1))>127 UNION SELECT pk, 'Col3' BadValueColumn, CONVERT(varchar(20),col3) AS BadValue --make the XYZ in convert(varchar(XYZ), ...) the largest value of col1, col2, col3 FROM @YourTable y INNER JOIN AllNumbers n ON n.Number <= LEN(y.col3) WHERE ASCII(SUBSTRING(y.col3, n.Number, 1))<32 OR ASCII(SUBSTRING(y.col3, n.Number, 1))>127 order by 1 OPTION (MAXRECURSION 1000) 

OUTPUT:

 pk BadValueColumn BadValue ----------- -------------- -------------------- 2 Col1 BA¶D 3 Col2 ¶BAD 4 Col3 B¶AD 5 Col1 ¶BAD 5 Col3 ¶BAD 6 Col1 BAD¶ 6 Col2 B¶AD 6 Col3 BAD¶¶¶ (8 row(s) affected) 

这是使用PATINDEX进行单列search的解决scheme。
它还显示StartPosition,InvalidCharacter和ASCII码。

 select line, patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line) as [Position], substring(line,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line),1) as [InvalidCharacter], ascii(substring(line,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line),1)) as [ASCIICode] from staging.APARMRE1 where patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line) >0 

该脚本在一列中search非ASCII字符。 它会生成一个所有有效字符的string,这里的代码点为32到127.然后它search与列表不匹配的行:

 declare @str varchar(128) declare @i int set @str = '' set @i = 32 while @i <= 127 begin set @str = @str + '|' + char(@i) set @i = @i + 1 end select col1 from YourTable where col1 like '%[^' + @str + ']%' escape '|' 

我一直在成功地运行这些代码

 declare @UnicodeData table ( data nvarchar(500) ) insert into @UnicodeData values (N'Horse ') ,(N'Dog') ,(N'Cat') select data from @UnicodeData where data collate LATIN1_GENERAL_BIN != cast(data as varchar(max)) 

这对于已知列很适用。

为了获得额外的功劳,我写了这个快速脚本来search给定表中的所有nvarchar列以获取Unicode字符。

 declare @sql varchar(max) = '' ,@table sysname = 'mytable' -- enter your table here ;with ColumnData as ( select RowId = row_number() over (order by c.COLUMN_NAME) ,c.COLUMN_NAME ,ColumnName = '[' + c.COLUMN_NAME + ']' ,TableName = '[' + c.TABLE_SCHEMA + '].[' + c.TABLE_NAME + ']' from INFORMATION_SCHEMA.COLUMNS c where c.DATA_TYPE = 'nvarchar' and c.TABLE_NAME = @table ) select @sql = @sql + 'select FieldName = ''' + c.ColumnName + ''', InvalidCharacter = [' + c.COLUMN_NAME + '] from ' + c.TableName + ' where ' + c.ColumnName + ' collate LATIN1_GENERAL_BIN != cast(' + c.ColumnName + ' as varchar(max)) ' + case when c.RowId <> (select max(RowId) from ColumnData) then ' union all ' else '' end + char(13) from ColumnData c -- check -- print @sql exec (@sql) 

我不是dynamicSQL的粉丝,但它确实有这样的探索性查询的用途。

网页上有一个用户定义的function“Parse Alphanumeric”。 Google UDFparsing字母数字,你应该find它的代码。 这个用户定义的函数删除所有不适合0-9,az和AZ的字符。

 Select * from Staging.APARMRE1 ar where udf_parsealpha(ar.last_name) <> ar.last_name 

这应该带回任何有姓氏与无效字符为您logging…虽然你的奖金点的问题是多一点挑战,但我认为一个案例声明可以处理它。 这是一个伪代码,我不完全确定它是否工作。

 Select id, case when udf_parsealpha(ar.last_name) <> ar.last_name then 'last name' when udf_parsealpha(ar.first_name) <> ar.first_name then 'first name' when udf_parsealpha(ar.Address1) <> ar.last_name then 'Address1' end, case when udf_parsealpha(ar.last_name) <> ar.last_name then ar.last_name when udf_parsealpha(ar.first_name) <> ar.first_name then ar.first_name when udf_parsealpha(ar.Address1) <> ar.last_name then ar.Address1 end from Staging.APARMRE1 ar where udf_parsealpha(ar.last_name) <> ar.last_name or udf_parsealpha(ar.first_name) <> ar.first_name or udf_parsealpha(ar.Address1) <> ar.last_name 

我在论坛的信箱里写了这个…所以我不太确定这个function是否可以正常工作,但是应该很接近。 我不太确定如果单个logging有两个字段是无效字符,它将如何performance。

作为替代,您应该能够将from子句从单个表中更改为类似于以下内容的子查询:

 select id,fieldname,value from ( Select id,'last_name' as 'fieldname', last_name as 'value' from Staging.APARMRE1 ar Union Select id,'first_name' as 'fieldname', first_name as 'value' from Staging.APARMRE1 ar ---(and repeat unions for each field) ) where udf_parsealpha(value) <> value 

好处在于,对于每一列,只需要在这里扩展union语句,而对于脚本的case语句版本中的每列需要进行三次比较

在一些真实世界的数据上运行各种解决scheme – 12M行varchar length〜30,大约9k dodgy行,没有全文索引,patIndex解决scheme是最快的,它也select最多的行。

(预先运行公里,将caching设置为已知状态,运行3个进程,最后再次运行公里 – 最后2公里的运行时间在2秒内给出时间)

Gerhard Weiss的patindex解决scheme – 运行时间0:38,返回9144行

 select dodgyColumn from myTable fcc WHERE patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,dodgyColumn ) >0 

MT的子string解决scheme。 – 运行时间1:16,返回8996行

 select dodgyColumn from myTable fcc INNER JOIN dbo.Numbers32k dn ON dn.number<(len(fcc.dodgyColumn )) WHERE ASCII(SUBSTRING(fcc.dodgyColumn , dn.Number, 1))<32 OR ASCII(SUBSTRING(fcc.dodgyColumn , dn.Number, 1))>127 

udf解决schemeDeon Robertson – 运行时间3:47,返回7316行

 select dodgyColumn from myTable where dbo.udf_test_ContainsNonASCIIChars(dodgyColumn , 1) = 1 

这是一个UDF我build立检测与扩展ascii charaters列。 它很快,你可以扩展你想检查的字符集。 第二个参数允许您在检查标准字符集以外的任何内容或允许扩展集之间进行切换:

 create function [dbo].[udf_ContainsNonASCIIChars] ( @string nvarchar(4000), @checkExtendedCharset bit ) returns bit as begin declare @pos int = 0; declare @char varchar(1); declare @return bit = 0; while @pos < len(@string) begin select @char = substring(@string, @pos, 1) if ascii(@char) < 32 or ascii(@char) > 126 begin if @checkExtendedCharset = 1 begin if ascii(@char) not in (9,124,130,138,142,146,150,154,158,160,170,176,180,181,183,184,185,186,192,193,194,195,196,197,199,200,201,202,203,204,205,206,207,209,210,211,212,213,214,216,217,218,219,220,221,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,248,249,250,251,252,253,254,255) begin select @return = 1; select @pos = (len(@string) + 1) end else begin select @pos = @pos + 1 end end else begin select @return = 1; select @pos = (len(@string) + 1) end end else begin select @pos = @pos + 1 end end return @return; end 

用法:

 select Address1 from PropertyFile_English where udf_ContainsNonASCIIChars(Address1, 1) = 1 

要find哪个字段包含无效字符:

 SELECT * FROM Staging.APARMRE1 FOR XML AUTO, TYPE 

你可以用这个查询来testing它:

 SELECT top 1 'char 31: '+char(31)+' (hex 0x1F)' field from sysobjects FOR XML AUTO, TYPE 

结果将是:

Msg 6841,Level 16,State 1,Line 3 FOR XML无法序列化节点“字段”的数据,因为它包含XML中不允许的字符(0x001F)。 要使用FOR XML检索此数据,请将其转换为二进制,varbinary或图像数据types,并使用BINARY BASE64指令。

当你编写xml文件,并validation它时,得到无效字符的错误是非常有用的。