在构build使用数据的XmlReader或XPathDocument之前，如何从基于XML的数据源中删除无效的hex字符？

在XmlReader中使用基于XML的数据源之前，是否有任何简单的/通用的方法来清理这些数据源，以便我可以优雅地使用与XML上的hex字符限制不一致的XML数据？

注意：

解决scheme需要处理使用除UTF-8以外的字符编码的XML数据源，例如通过在XML文档声明中指定字符编码。在去除无效的hex字符的同时，不要修改源的字符编码一直是主要的问题。
删除无效的hex字符应该只能删除hex编码值，因为您经常可以在数据中发现包含string的href值，该string可能是hex字符的string匹配。

背景：

我需要使用符合特定格式（基于Atom或RSS提要）的基于XML的数据源，但希望能够使用已发布的数据源，其中包含每个XML规范中无效的hex字符。

在.NET中，如果您有一个表示XML数据源的Stream，然后尝试使用XmlReader和/或XPathDocument进行parsing，则会由于在XML数据中包含无效的hex字符而引发exception。我目前的尝试解决这个问题是parsingstream作为一个string，并使用正则expression式来删除和/或replace无效的hex字符，但我正在寻找一个更高性能的解决scheme。

这可能不是完美的 （因为人们错过了这个免责声明，所以增加了重点），但是我在这种情况下做了什么。您可以调整以使用stream。

/// <summary> /// Removes control characters and other non-UTF-8 characters /// </summary> /// <param name="inString">The string to process</param> /// <returns>A string with no control characters or entities above 0x00FD</returns> public static string RemoveTroublesomeCharacters(string inString) { if (inString == null) return null; StringBuilder newString = new StringBuilder(); char ch; for (int i = 0; i < inString.Length; i++) { ch = inString[i]; // remove any characters outside the valid UTF-8 range as well as all control characters // except tabs and new lines //if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r') //if using .NET version prior to 4, use above logic if (XmlConvert.IsXmlChar(ch)) //this method is new in .NET 4 { newString.Append(ch); } } return newString.ToString(); }

我喜欢尤金的白名单概念。我需要做一个类似的东西，原来的海报，但我需要支持所有的Unicode字符，不只是达到0x00FD。 XML规范是：

在.NET中，Unicode字符的内部表示只有16位，所以我们不能明确地“允许”0x10000-0x10FFFF。 XML规范明确禁止从0xD800开始的替代码点出现。然而，如果我们允许在我们的白名单中使用这些替代代码点，那么只要utf-8编码是由utf-16字符中的替代对生成的，那么utf-8编码我们的string可能会产生有效的XML。 .NETstring。我还没有探讨这个，所以我去了更安全的赌注，并不允许在我的白名单中的代理人。

尤金的解决scheme中的评论是误导性的，问题是我们排除的字符在XML中是无效的…他们是完全有效的Unicode代码点。我们不删除`非UTF-8字符'。我们正在删除可能不在格式良好的XML文档中出现的utf-8字符。

 public static string XmlCharacterWhitelist( string in_string ) { if( in_string == null ) return null; StringBuilder sbOutput = new StringBuilder(); char ch; for( int i = 0; i < in_string.Length; i++ ) { ch = in_string[i]; if( ( ch >= 0x0020 && ch <= 0xD7FF ) || ( ch >= 0xE000 && ch <= 0xFFFD ) || ch == 0x0009 || ch == 0x000A || ch == 0x000D ) { sbOutput.Append( ch ); } } return sbOutput.ToString(); }

作为删除无效XML字符的方法，我build议您使用XmlConvert.IsXmlChar方法。它是从.NET Framework 4开始添加的，也是在Silverlight中提供的。这是一个小样本：

 void Main() { string content = "\v\f\0"; Console.WriteLine(IsValidXmlString(content)); // False content = RemoveInvalidXmlChars(content); Console.WriteLine(IsValidXmlString(content)); // True } static string RemoveInvalidXmlChars(string text) { char[] validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray(); return new string(validXmlChars); } static bool IsValidXmlString(string text) { try { XmlConvert.VerifyXmlChars(text); return true; } catch { return false; } }

DRY实现这个答案的解决scheme（使用不同的构造函数 – 随意使用你的应用程序中需要的那个）：

 public class InvalidXmlCharacterReplacingStreamReader : StreamReader { private readonly char _replacementCharacter; public InvalidXmlCharacterReplacingStreamReader(string fileName, char replacementCharacter) : base(fileName) { this._replacementCharacter = replacementCharacter; } public override int Peek() { int ch = base.Peek(); if (ch != -1 && IsInvalidChar(ch)) { return this._replacementCharacter; } return ch; } public override int Read() { int ch = base.Read(); if (ch != -1 && IsInvalidChar(ch)) { return this._replacementCharacter; } return ch; } public override int Read(char[] buffer, int index, int count) { int readCount = base.Read(buffer, index, count); for (int i = index; i < readCount + index; i++) { char ch = buffer[i]; if (IsInvalidChar(ch)) { buffer[i] = this._replacementCharacter; } } return readCount; } private static bool IsInvalidChar(int ch) { return (ch < 0x0020 || ch > 0xD7FF) && (ch < 0xE000 || ch > 0xFFFD) && ch != 0x0009 && ch != 0x000A && ch != 0x000D; } }

现代化dnewcombe的答案，你可以采取一个稍微简单的方法

 public static string RemoveInvalidXmlChars(string input) { var isValid = new Predicate<char>(value => (value >= 0x0020 && value <= 0xD7FF) || (value >= 0xE000 && value <= 0xFFFD) || value == 0x0009 || value == 0x000A || value == 0x000D); return new string(Array.FindAll(input.ToCharArray(), isValid)); }

或者Linq

 public static string RemoveInvalidXmlChars(string input) { return new string(input.Where(value => (value >= 0x0020 && value <= 0xD7FF) || (value >= 0xE000 && value <= 0xFFFD) || value == 0x0009 || value == 0x000A || value == 0x000D).ToArray()); }

我很想知道这些方法的性能如何比较，以及它们如何与使用Buffer.BlockCopy的黑名单方法进行Buffer.BlockCopy 。

基于正则expression式的方法

 public static string StripInvalidXmlCharacters(string str) { var invalidXmlCharactersRegex = new Regex("[^\u0009\u000a\u000d\u0020-\ud7ff\ue000-\ufffd]|([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])"); return invalidXmlCharactersRegex.Replace(str, "");

}

查看我的博客post了解更多详情

这是dnewcome在自定义StreamReader中的答案。它只是包装一个真正的stream媒体阅读器，并replace阅读的字符。

我只实施了一些方法来节省自己的时间。我使用这个与XDocument.Load和一个文件stream一起，只有Read（char []缓冲区，int索引，int count）方法被调用，所以它像这样工作。您可能需要实施其他方法才能使其适用于您的应用程序。我使用这种方法，因为它似乎比其他答案更有效。我也只实现了其中一个构造函数，显然你可以实现你需要的任何StreamReader构造函数，因为它只是一个传递。

我selectreplace字符而不是删除它们，因为它极大地简化了解决scheme。通过这种方式，文本的长度保持不变，所以不需要跟踪单独的索引。

 public class InvalidXmlCharacterReplacingStreamReader : TextReader { private StreamReader implementingStreamReader; private char replacementCharacter; public InvalidXmlCharacterReplacingStreamReader(Stream stream, char replacementCharacter) { implementingStreamReader = new StreamReader(stream); this.replacementCharacter = replacementCharacter; } public override void Close() { implementingStreamReader.Close(); } public override ObjRef CreateObjRef(Type requestedType) { return implementingStreamReader.CreateObjRef(requestedType); } public void Dispose() { implementingStreamReader.Dispose(); } public override bool Equals(object obj) { return implementingStreamReader.Equals(obj); } public override int GetHashCode() { return implementingStreamReader.GetHashCode(); } public override object InitializeLifetimeService() { return implementingStreamReader.InitializeLifetimeService(); } public override int Peek() { int ch = implementingStreamReader.Peek(); if (ch != -1) { if ( (ch < 0x0020 || ch > 0xD7FF) && (ch < 0xE000 || ch > 0xFFFD) && ch != 0x0009 && ch != 0x000A && ch != 0x000D ) { return replacementCharacter; } } return ch; } public override int Read() { int ch = implementingStreamReader.Read(); if (ch != -1) { if ( (ch < 0x0020 || ch > 0xD7FF) && (ch < 0xE000 || ch > 0xFFFD) && ch != 0x0009 && ch != 0x000A && ch != 0x000D ) { return replacementCharacter; } } return ch; } public override int Read(char[] buffer, int index, int count) { int readCount = implementingStreamReader.Read(buffer, index, count); for (int i = index; i < readCount+index; i++) { char ch = buffer[i]; if ( (ch < 0x0020 || ch > 0xD7FF) && (ch < 0xE000 || ch > 0xFFFD) && ch != 0x0009 && ch != 0x000A && ch != 0x000D ) { buffer[i] = replacementCharacter; } } return readCount; } public override Task<int> ReadAsync(char[] buffer, int index, int count) { throw new NotImplementedException(); } public override int ReadBlock(char[] buffer, int index, int count) { throw new NotImplementedException(); } public override Task<int> ReadBlockAsync(char[] buffer, int index, int count) { throw new NotImplementedException(); } public override string ReadLine() { throw new NotImplementedException(); } public override Task<string> ReadLineAsync() { throw new NotImplementedException(); } public override string ReadToEnd() { throw new NotImplementedException(); } public override Task<string> ReadToEndAsync() { throw new NotImplementedException(); } public override string ToString() { return implementingStreamReader.ToString(); } }

上述解决scheme似乎是在转换为XML之前删除无效字符。

使用此代码从XMLstring中删除无效的XML字符。例如。＆X1A;

  public static string CleanInvalidXmlChars( string Xml, string XMLVersion ) { string pattern = String.Empty; switch( XMLVersion ) { case "1.0": pattern = @"&#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F]);"; break; case "1.1": pattern = @"&#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF]);"; break; default: throw new Exception( "Error: Invalid XML Version!" ); } Regex regex = new Regex( pattern, RegexOptions.IgnoreCase ); if( regex.IsMatch( Xml ) ) Xml = regex.Replace( Xml, String.Empty ); return Xml; }

http://balajiramesh.wordpress.com/2008/05/30/strip-illegal-xml-characters-based-on-w3c-standard/

使用此function删除无效的xml字符。

 public static string CleanInvalidXmlChars(string text) { string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]"; return Regex.Replace(text, re, ""); }

修改答案或由Neolisk上面的原始答案。
更改：通过\ 0字符，删除完成，而不是replace。还使用了XmlConvert.IsXmlChar（char）方法

  /// <summary> /// Replaces invalid Xml characters from input file, NOTE: if replacement character is \0, then invalid Xml character is removed, instead of 1-for-1 replacement /// </summary> public class InvalidXmlCharacterReplacingStreamReader : StreamReader { private readonly char _replacementCharacter; public InvalidXmlCharacterReplacingStreamReader(string fileName, char replacementCharacter) : base(fileName) { _replacementCharacter = replacementCharacter; } public override int Peek() { int ch = base.Peek(); if (ch != -1 && IsInvalidChar(ch)) { if ('\0' == _replacementCharacter) return Peek(); // peek at the next one return _replacementCharacter; } return ch; } public override int Read() { int ch = base.Read(); if (ch != -1 && IsInvalidChar(ch)) { if ('\0' == _replacementCharacter) return Read(); // read next one return _replacementCharacter; } return ch; } public override int Read(char[] buffer, int index, int count) { int readCount= 0, ch; for (int i = 0; i < count && (ch = Read()) != -1; i++) { readCount++; buffer[index + i] = (char)ch; } return readCount; } private static bool IsInvalidChar(int ch) { return !XmlConvert.IsXmlChar((char)ch); } }

 private static String removeNonUtf8CompliantCharacters( final String inString ) { if (null == inString ) return null; byte[] byteArr = inString.getBytes(); for ( int i=0; i < byteArr.length; i++ ) { byte ch= byteArr[i]; // remove any characters outside the valid UTF-8 range as well as all control characters // except tabs and new lines if ( !( (ch > 31 && ch < 253 ) || ch == '\t' || ch == '\n' || ch == '\r') ) { byteArr[i]=' '; } } return new String( byteArr ); }

您可以通过以下方式传递非UTF字符：

 string sFinalString = ""; string hex = ""; foreach (char ch in UTFCHAR) { int tmp = ch; if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r') { sFinalString += ch; } else { sFinalString += "&#" + tmp+";"; } }

试试这个PHP！

 $goodUTF8 = iconv("utf-8", "utf-8//IGNORE", $badUTF8);

在构build使用数据的XmlReader或XPathDocument之前，如何从基于XML的数据源中删除无效的hex字符？

在ASP.NET RegularExpressionValidator中使正则expression式不区分大小写

使用参数进行ASP.NET MVC 3客户端validation

有没有办法redirect到另一个行动类，而不使用struts.xml

最好的方法来在Javascript中的字母数字检查

如何获得一个<input type =“number”>字段的原始值？

用MVVM进行适当的validation

使用RegEx进行域名validation

在回发上，我如何添加错误消息到validation摘要？

使用数据注释自定义模型依赖属性的validation

ASP.NET MVC 3：dynamic/ AJAX内容的不显眼客户端validation所需的步骤