Objective-C / Cocoa Touch中的HTML字符解码

首先,我发现这个: Objective C HTML escape / unescape ,但它不适用于我。

我的编码字符(来自RSS源,顺便说一句)看起来像这样: &

我搜索了整个网络,发现相关的讨论,但没有解决我的特定编码,我认为他们被称为十六进制字符。

这些被称为字符实体引用 。 当他们采取&#<number>;的形式&#<number>; 他们被称为数字实体引用 。 基本上,它是应该被替换的字节的字符串表示。 在&#038; ,它代表ISO-8859-1字符编码方案中值为38的字符,即&

&符号必须在RSS中编码的原因是它是一个保留的特殊字符。

你需要做的是解析字符串,并用一个匹配&#;之间的值的字节替换实体; 。 我不知道有什么好的方法来做到这一点在客观的C,但这个堆栈溢出的问题可能会有所帮助。

编辑:自从两年前回答这个问题以后,有一些很好的解决方案。 请参阅下面的@Michael Waterfall的答案。

查看我的NSString类别的HTML 。 以下是可用的方法:

 - (NSString *)stringByConvertingHTMLToPlainText; - (NSString *)stringByDecodingHTMLEntities; - (NSString *)stringByEncodingHTMLEntities; - (NSString *)stringWithNewLinesAsBRs; - (NSString *)stringByRemovingNewLinesAndWhitespace; 

Daniel的那个基本上是非常好的,我在那里解决了一些问题:

  1. 删除了NSSCanner的跳过字符(否则两个连续实体之间的空格将被忽略

    [扫描仪setCharactersToBeSkipped:无];

  2. 修复了当有孤立的'&'符号解析(我不知道什么是“正确的”输出,我只是比较它与Firefox):

例如

  &#ABC DF & B&#39; & C&#39; Items (288) 

这里是修改后的代码:

 - (NSString *)stringByDecodingXMLEntities { NSUInteger myLength = [self length]; NSUInteger ampIndex = [self rangeOfString:@"&" options:NSLiteralSearch].location; // Short-circuit if there are no ampersands. if (ampIndex == NSNotFound) { return self; } // Make result string with some extra capacity. NSMutableString *result = [NSMutableString stringWithCapacity:(myLength * 1.25)]; // First iteration doesn't need to scan to & since we did that already, but for code simplicity's sake we'll do it again with the scanner. NSScanner *scanner = [NSScanner scannerWithString:self]; [scanner setCharactersToBeSkipped:nil]; NSCharacterSet *boundaryCharacterSet = [NSCharacterSet characterSetWithCharactersInString:@" \t\n\r;"]; do { // Scan up to the next entity or the end of the string. NSString *nonEntityString; if ([scanner scanUpToString:@"&" intoString:&nonEntityString]) { [result appendString:nonEntityString]; } if ([scanner isAtEnd]) { goto finish; } // Scan either a HTML or numeric character entity reference. if ([scanner scanString:@"&amp;" intoString:NULL]) [result appendString:@"&"]; else if ([scanner scanString:@"&apos;" intoString:NULL]) [result appendString:@"'"]; else if ([scanner scanString:@"&quot;" intoString:NULL]) [result appendString:@"\""]; else if ([scanner scanString:@"&lt;" intoString:NULL]) [result appendString:@"<"]; else if ([scanner scanString:@"&gt;" intoString:NULL]) [result appendString:@">"]; else if ([scanner scanString:@"&#" intoString:NULL]) { BOOL gotNumber; unsigned charCode; NSString *xForHex = @""; // Is it hex or decimal? if ([scanner scanString:@"x" intoString:&xForHex]) { gotNumber = [scanner scanHexInt:&charCode]; } else { gotNumber = [scanner scanInt:(int*)&charCode]; } if (gotNumber) { [result appendFormat:@"%C", (unichar)charCode]; [scanner scanString:@";" intoString:NULL]; } else { NSString *unknownEntity = @""; [scanner scanUpToCharactersFromSet:boundaryCharacterSet intoString:&unknownEntity]; [result appendFormat:@"&#%@%@", xForHex, unknownEntity]; //[scanner scanUpToString:@";" intoString:&unknownEntity]; //[result appendFormat:@"&#%@%@;", xForHex, unknownEntity]; NSLog(@"Expected numeric character entity but got &#%@%@;", xForHex, unknownEntity); } } else { NSString *amp; [scanner scanString:@"&" intoString:&amp]; //an isolated & symbol [result appendString:amp]; /* NSString *unknownEntity = @""; [scanner scanUpToString:@";" intoString:&unknownEntity]; NSString *semicolon = @""; [scanner scanString:@";" intoString:&semicolon]; [result appendFormat:@"%@%@", unknownEntity, semicolon]; NSLog(@"Unsupported XML character entity %@%@", unknownEntity, semicolon); */ } } while (![scanner isAtEnd]); finish: return result; } 

从iOS 7开始,可以使用带有NSHTMLTextDocumentType属性的NSAttributedString本地解码HTML字符:

 NSString *htmlString = @"&#63743; &amp; &#38; &lt; &gt; &trade; &copy; &hearts; &clubs; &spades; &diams;"; NSData *stringData = [htmlString dataUsingEncoding:NSUTF8StringEncoding]; NSDictionary *options = @{NSDocumentTypeDocumentAttribute:NSHTMLTextDocumentType}; NSAttributedString *decodedString; decodedString = [[NSAttributedString alloc] initWithData:stringData options:options documentAttributes:NULL error:NULL]; 

解码的属性字符串现在将显示为:&&<>™©♥♣♠♦。

注意:这只有在主线程上调用才会起作用。

似乎没有人提到最简单的选择之一: Google Toolbox for Mac
(尽管名字,这也适用于iOS。)

https://github.com/google/google-toolbox-for-mac/blob/master/Foundation/GTMNSString%2BHTML.h

 /// Get a string where internal characters that are escaped for HTML are unescaped // /// For example, '&amp;' becomes '&' /// Handles &#32; and &#x32; cases as well /// // Returns: // Autoreleased NSString // - (NSString *)gtm_stringByUnescapingFromHTML; 

而且我只能在项目中包含三个文件:头文件,实现文件和GTMDefines.h

我应该在GitHub上发布这个东西。 这是一个NSString的类别,使用NSScanner来实现,并处理十六进制和十进制数字字符实体以及通常的符号。

此外,它处理格式错误的字符串(当你有一个无效的字符序列),相对优雅,这是我发布的应用程序 ,使用此代码是至关重要的。

 - (NSString *)stringByDecodingXMLEntities { NSUInteger myLength = [self length]; NSUInteger ampIndex = [self rangeOfString:@"&" options:NSLiteralSearch].location; // Short-circuit if there are no ampersands. if (ampIndex == NSNotFound) { return self; } // Make result string with some extra capacity. NSMutableString *result = [NSMutableString stringWithCapacity:(myLength * 1.25)]; // First iteration doesn't need to scan to & since we did that already, but for code simplicity's sake we'll do it again with the scanner. NSScanner *scanner = [NSScanner scannerWithString:self]; do { // Scan up to the next entity or the end of the string. NSString *nonEntityString; if ([scanner scanUpToString:@"&" intoString:&nonEntityString]) { [result appendString:nonEntityString]; } if ([scanner isAtEnd]) { goto finish; } // Scan either a HTML or numeric character entity reference. if ([scanner scanString:@"&amp;" intoString:NULL]) [result appendString:@"&"]; else if ([scanner scanString:@"&apos;" intoString:NULL]) [result appendString:@"'"]; else if ([scanner scanString:@"&quot;" intoString:NULL]) [result appendString:@"\""]; else if ([scanner scanString:@"&lt;" intoString:NULL]) [result appendString:@"<"]; else if ([scanner scanString:@"&gt;" intoString:NULL]) [result appendString:@">"]; else if ([scanner scanString:@"&#" intoString:NULL]) { BOOL gotNumber; unsigned charCode; NSString *xForHex = @""; // Is it hex or decimal? if ([scanner scanString:@"x" intoString:&xForHex]) { gotNumber = [scanner scanHexInt:&charCode]; } else { gotNumber = [scanner scanInt:(int*)&charCode]; } if (gotNumber) { [result appendFormat:@"%C", charCode]; } else { NSString *unknownEntity = @""; [scanner scanUpToString:@";" intoString:&unknownEntity]; [result appendFormat:@"&#%@%@;", xForHex, unknownEntity]; NSLog(@"Expected numeric character entity but got &#%@%@;", xForHex, unknownEntity); } [scanner scanString:@";" intoString:NULL]; } else { NSString *unknownEntity = @""; [scanner scanUpToString:@";" intoString:&unknownEntity]; NSString *semicolon = @""; [scanner scanString:@";" intoString:&semicolon]; [result appendFormat:@"%@%@", unknownEntity, semicolon]; NSLog(@"Unsupported XML character entity %@%@", unknownEntity, semicolon); } } while (![scanner isAtEnd]); finish: return result; } 

这是我使用RegexKitLite框架的方式:

 -(NSString*) decodeHtmlUnicodeCharacters: (NSString*) html { NSString* result = [html copy]; NSArray* matches = [result arrayOfCaptureComponentsMatchedByRegex: @"\\&#([\\d]+);"]; if (![matches count]) return result; for (int i=0; i<[matches count]; i++) { NSArray* array = [matches objectAtIndex: i]; NSString* charCode = [array objectAtIndex: 1]; int code = [charCode intValue]; NSString* character = [NSString stringWithFormat:@"%C", code]; result = [result stringByReplacingOccurrencesOfString: [array objectAtIndex: 0] withString: character]; } return result; 

}

希望这会帮助别人。

你可以使用这个函数来解决这个问题。

 + (NSString*) decodeHtmlUnicodeCharactersToString:(NSString*)str { NSMutableString* string = [[NSMutableString alloc] initWithString:str]; // #&39; replace with ' NSString* unicodeStr = nil; NSString* replaceStr = nil; int counter = -1; for(int i = 0; i < [string length]; ++i) { unichar char1 = [string characterAtIndex:i]; for (int k = i + 1; k < [string length] - 1; ++k) { unichar char2 = [string characterAtIndex:k]; if (char1 == '&' && char2 == '#' ) { ++counter; unicodeStr = [string substringWithRange:NSMakeRange(i + 2 , 2)]; // read integer value ie, 39 replaceStr = [string substringWithRange:NSMakeRange (i, 5)]; // #&39; [string replaceCharactersInRange: [string rangeOfString:replaceStr] withString:[NSString stringWithFormat:@"%c",[unicodeStr intValue]]]; break; } } } [string autorelease]; if (counter > 1) return [self decodeHtmlUnicodeCharactersToString:string]; else return string; } 

其实,Michael Waterfall的MWFeedParser框架(参考他的回答)已经被rmchaara分享了,他已经用ARC支持更新了它。

你可以在这里找到Github

它真的很好,我用stringByDecodingHTMLEntities方法,并完美地工作。

这是一个Swiet版本的Walty Yeung的回答 :

 extension String { static private let mappings = ["&quot;" : "\"","&amp;" : "&", "&lt;" : "<", "&gt;" : ">","&nbsp;" : " ","&iexcl;" : "¡","&cent;" : "¢","&pound;" : " £","&curren;" : "¤","&yen;" : "¥","&brvbar;" : "¦","&sect;" : "§","&uml;" : "¨","&copy;" : "©","&ordf;" : " ª","&laquo" : "«","&not" : "¬","&reg" : "®","&macr" : "¯","&deg" : "°","&plusmn" : "±","&sup2; " : "²","&sup3" : "³","&acute" : "´","&micro" : "µ","&para" : "¶","&middot" : "·","&cedil" : "¸","&sup1" : "¹","&ordm" : "º","&raquo" : "»&","frac14" : "¼","&frac12" : "½","&frac34" : "¾","&iquest" : "¿","&times" : "×","&divide" : "÷","&ETH" : "Ð","&eth" : "ð","&THORN" : "Þ","&thorn" : "þ","&AElig" : "Æ","&aelig" : "æ","&OElig" : "Œ","&oelig" : "œ","&Aring" : "Å","&Oslash" : "Ø","&Ccedil" : "Ç","&ccedil" : "ç","&szlig" : "ß","&Ntilde;" : "Ñ","&ntilde;":"ñ",] func stringByDecodingXMLEntities() -> String { guard let _ = self.rangeOfString("&", options: [.LiteralSearch]) else { return self } var result = "" let scanner = NSScanner(string: self) scanner.charactersToBeSkipped = nil let boundaryCharacterSet = NSCharacterSet(charactersInString: " \t\n\r;") repeat { var nonEntityString: NSString? = nil if scanner.scanUpToString("&", intoString: &nonEntityString) { if let s = nonEntityString as? String { result.appendContentsOf(s) } } if scanner.atEnd { break } var didBreak = false for (k,v) in String.mappings { if scanner.scanString(k, intoString: nil) { result.appendContentsOf(v) didBreak = true break } } if !didBreak { if scanner.scanString("&#", intoString: nil) { var gotNumber = false var charCodeUInt: UInt32 = 0 var charCodeInt: Int32 = -1 var xForHex: NSString? = nil if scanner.scanString("x", intoString: &xForHex) { gotNumber = scanner.scanHexInt(&charCodeUInt) } else { gotNumber = scanner.scanInt(&charCodeInt) } if gotNumber { let newChar = String(format: "%C", (charCodeInt > -1) ? charCodeInt : charCodeUInt) result.appendContentsOf(newChar) scanner.scanString(";", intoString: nil) } else { var unknownEntity: NSString? = nil scanner.scanUpToCharactersFromSet(boundaryCharacterSet, intoString: &unknownEntity) let h = xForHex ?? "" let u = unknownEntity ?? "" result.appendContentsOf("&#\(h)\(u)") } } else { scanner.scanString("&", intoString: nil) result.appendContentsOf("&") } } } while (!scanner.atEnd) return result } } 

就好像你需要另一个解决方案! 这个很简单,很有效:

 @interface NSString (NSStringCategory) - (NSString *) stringByReplacingISO8859Codes; @end @implementation NSString (NSStringCategory) - (NSString *) stringByReplacingISO8859Codes { NSString *dataString = self; do { //*** See if string contains &# prefix NSRange range = [dataString rangeOfString: @"&#" options: NSRegularExpressionSearch]; if (range.location == NSNotFound) { break; } //*** Get the next three charaters after the prefix NSString *isoHex = [dataString substringWithRange: NSMakeRange(range.location + 2, 3)]; //*** Create the full code for replacement NSString *isoString = [NSString stringWithFormat: @"&#%@;", isoHex]; //*** Convert to decimal integer unsigned decimal = 0; NSScanner *scanner = [NSScanner scannerWithString: [NSString stringWithFormat: @"0%@", isoHex]]; [scanner scanHexInt: &decimal]; //*** Use decimal code to get unicode character NSString *unicode = [NSString stringWithFormat:@"%C", decimal]; //*** Replace all occurences of this code in the string dataString = [dataString stringByReplacingOccurrencesOfString: isoString withString: unicode]; } while (TRUE); //*** Loop until we hit the NSNotFound return dataString; } @end 

如果字符实体引用是字符串,例如@"2318" ,则可以使用strtoul提取带有正确unicode字符的重新编码的NSString;

 NSString *unicodePoint = @"2318" unichar iconChar = (unichar) strtoul(unicodePoint.UTF8String, NULL, 16); NSString *recoded = [NSString stringWithFormat:@"%C", iconChar]; NSLog(@"recoded: %@", recoded"); // prints out "recoded: ⌘" 

Jugale答案的Swift 3版本

 extension String { static private let mappings = ["&quot;" : "\"","&amp;" : "&", "&lt;" : "<", "&gt;" : ">","&nbsp;" : " ","&iexcl;" : "¡","&cent;" : "¢","&pound;" : " £","&curren;" : "¤","&yen;" : "¥","&brvbar;" : "¦","&sect;" : "§","&uml;" : "¨","&copy;" : "©","&ordf;" : " ª","&laquo" : "«","&not" : "¬","&reg" : "®","&macr" : "¯","&deg" : "°","&plusmn" : "±","&sup2; " : "²","&sup3" : "³","&acute" : "´","&micro" : "µ","&para" : "¶","&middot" : "·","&cedil" : "¸","&sup1" : "¹","&ordm" : "º","&raquo" : "»&","frac14" : "¼","&frac12" : "½","&frac34" : "¾","&iquest" : "¿","&times" : "×","&divide" : "÷","&ETH" : "Ð","&eth" : "ð","&THORN" : "Þ","&thorn" : "þ","&AElig" : "Æ","&aelig" : "æ","&OElig" : "Œ","&oelig" : "œ","&Aring" : "Å","&Oslash" : "Ø","&Ccedil" : "Ç","&ccedil" : "ç","&szlig" : "ß","&Ntilde;" : "Ñ","&ntilde;":"ñ",] func stringByDecodingXMLEntities() -> String { guard let _ = self.range(of: "&", options: [.literal]) else { return self } var result = "" let scanner = Scanner(string: self) scanner.charactersToBeSkipped = nil let boundaryCharacterSet = CharacterSet(charactersIn: " \t\n\r;") repeat { var nonEntityString: NSString? = nil if scanner.scanUpTo("&", into: &nonEntityString) { if let s = nonEntityString as? String { result.append(s) } } if scanner.isAtEnd { break } var didBreak = false for (k,v) in String.mappings { if scanner.scanString(k, into: nil) { result.append(v) didBreak = true break } } if !didBreak { if scanner.scanString("&#", into: nil) { var gotNumber = false var charCodeUInt: UInt32 = 0 var charCodeInt: Int32 = -1 var xForHex: NSString? = nil if scanner.scanString("x", into: &xForHex) { gotNumber = scanner.scanHexInt32(&charCodeUInt) } else { gotNumber = scanner.scanInt32(&charCodeInt) } if gotNumber { let newChar = String(format: "%C", (charCodeInt > -1) ? charCodeInt : charCodeUInt) result.append(newChar) scanner.scanString(";", into: nil) } else { var unknownEntity: NSString? = nil scanner.scanUpToCharacters(from: boundaryCharacterSet, into: &unknownEntity) let h = xForHex ?? "" let u = unknownEntity ?? "" result.append("&#\(h)\(u)") } } else { scanner.scanString("&", into: nil) result.append("&") } } } while (!scanner.isAtEnd) return result } }