从string中删除非utf8字符

我有一个问题，从string中删除非UTF8字符，显示不正确。字符是这样的0x97 0x61 0x6C 0x6F（hex表示）

删除它们的最好方法是什么？正则expression式还是别的？

使用正则expression式的方法：

$regex = <<<'END' / ( (?: [\x00-\x7F] # single-byte sequences 0xxxxxxx | [\xC0-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx | [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences 1110xxxx 10xxxxxx * 2 | [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3 ){1,100} # ...one or more times ) | . # anything else /x END; preg_replace($regex, '$1', $text);

它searchUTF-8序列，并将其捕获到组1中。它还匹配单个字节，这些字节不能被识别为UTF-8序列的一部分，但不会捕获这些字节。 replace是捕获到组1中的任何内容。这样可以有效地删除所有无效的字节。

通过将无效字节编码为UTF-8字符，可以修复string。但是，如果错误是随机的，这可能会留下一些奇怪的符号。

 $regex = <<<'END' / ( (?: [\x00-\x7F] # single-byte sequences 0xxxxxxx | [\xC0-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx | [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences 1110xxxx 10xxxxxx * 2 | [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3 ){1,100} # ...one or more times ) | ( [\x80-\xBF] ) # invalid byte in range 10000000 - 10111111 | ( [\xC0-\xFF] ) # invalid byte in range 11000000 - 11111111 /x END; function utf8replacer($captures) { if ($captures[1] != "") { // Valid byte sequence. Return unmodified. return $captures[1]; } elseif ($captures[2] != "") { // Invalid byte of the form 10xxxxxx. // Encode as 11000010 10xxxxxx. return "\xC2".$captures[2]; } else { // Invalid byte of the form 11xxxxxx. // Encode as 11000011 10xxxxxx. return "\xC3".chr(ord($captures[3])-64); } } preg_replace_callback($regex, "utf8replacer", $text);

编辑：

!empty(x)将匹配非空值（ "0"被认为是空的）。
x != ""将匹配非空值，包括"0" 。
x !== ""将匹配除""之外的任何内容。

在这种情况下， x != ""似乎是最好的select。

我也加快了比赛。它不是分别匹配每个字符，而是匹配有效的UTF-8字符序列。

如果将utf8_encode()应用于已经是UTF8的string，它将返回一个乱码的UTF8输出。

我提出了解决所有这些问题的function。这就是所谓的Encoding::toUTF8() 。

你不需要知道你的string的编码是什么。它可以是Latin1（ISO8859-1），Windows-1252或UTF8，或者string可以混合使用。 Encoding::toUTF8()将把所有东西都转换成UTF8。

我这样做是因为一个服务给了我所有的数据源，把这些编码混合在同一个string中。

用法：

 require_once('Encoding.php'); use \ForceUTF8\Encoding; // It's namespaced now. $utf8_string = Encoding::toUTF8($mixed_string); $latin1_string = Encoding::toLatin1($mixed_string);

我已经包含了另一个函数Encoding :: fixUTF8（），它将修复每个UTF8string，该string看起来被多次编码为UTF8的乱码产品。

用法：

 require_once('Encoding.php'); use \ForceUTF8\Encoding; // It's namespaced now. $utf8_string = Encoding::fixUTF8($garbled_utf8_string);

例子：

 echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football"); echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football"); echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football"); echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football");

会输出：

 Fédération Camerounaise de Football Fédération Camerounaise de Football Fédération Camerounaise de Football Fédération Camerounaise de Football

下载：

https://github.com/neitanod/forceutf8

你可以使用mbstring：

 $text = mb_convert_encoding($text, 'UTF-8', 'UTF-8');

…将删除无效的字符。

请参阅：通过问号replace无效的UTF-8字符，mbstring.substitute_character似乎被忽略

这是我的function，总是工作，无论编码：

 function remove_bs($Str) { $StrArr = str_split($Str); $NewStr = ''; foreach ($StrArr as $Char) { $CharNo = ord($Char); if ($CharNo == 163) { $NewStr .= $Char; continue; } // keep £ if ($CharNo > 31 && $CharNo < 127) { $NewStr .= $Char; } } return $NewStr; }

怎么运行的：

 echo remove_bs('Hello õhowå åare youÆ?'); // Hello how are you?

 $text = iconv("UTF-8", "UTF-8//IGNORE", $text);

这是我正在使用的。似乎工作得很好。采取从http://planetozh.com/blog/2005/01/remove-invalid-characters-in-utf-8/

UConverter可以使用自PHP 5.5以来。如果您使用intl扩展名并且不使用mbstring，则UConverter是更好的select。

 function replace_invalid_byte_sequence($str) { return UConverter::transcode($str, 'UTF-8', 'UTF-8'); } function replace_invalid_byte_sequence2($str) { return (new UConverter('UTF-8', 'UTF-8'))->convert($str); }

从PHP 5.4开始，htmlspecialchars可以用来删除无效的字节序列。 Hmlspecialchars比preg_match更好的处理大尺寸的字节和准确性。可以看到很多使用正则expression式的错误实现。

 function replace_invalid_byte_sequence3($str) { return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8')); }

尝试这个：

 $string = iconv("UTF-8","UTF-8//IGNORE",$string);

根据iconv手册，函数将第一个参数作为input字符集，第二个参数作为输出字符集，第三个作为实际inputstring。

如果将input和输出字符集都设置为UTF-8 ，并将//IGNORE标志附加到输出字符集，则该函数将删除（剥离）输出string中不能由输出字符集表示的所有字符。因此，过滤inputstring是有效的。

文本可能包含非utf8字符 。试着先做：

 $nonutf8 = mb_convert_encoding($nonutf8 , 'UTF-8', 'UTF-8');

你可以阅读更多关于它在这里： http : //php.net/manual/en/function.mb-convert-encoding.php 新闻

 $string = preg_replace('~&([az]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));

我做了一个从string中删除无效的UTF-8字符的函数。我使用它来清除27000产品的描述，然后生成XML导出文件。

 public function stripInvalidXml($value) { $ret = ""; $current; if (empty($value)) { return $ret; } $length = strlen($value); for ($i=0; $i < $length; $i++) { $current = ord($value{$i}); if (($current == 0x9) || ($current == 0xA) || ($current == 0xD) || (($current >= 0x20) && ($current <= 0xD7FF)) || (($current >= 0xE000) && ($current <= 0xFFFD)) || (($current >= 0x10000) && ($current <= 0x10FFFF))) { $ret .= chr($current); } else { $ret .= ""; } } return $ret; }

从最近的补丁到Drupal的Feeds JSONparsing器模块：

 //remove everything except valid letters (from any language) $raw = preg_replace('/(?:\\\\u[\pL\p{Zs}])+/', '', $raw);

如果你担心的是，它保留空格作为有效字符。

做了我需要的它去掉了现在普遍使用的不符合MySQL'utf8'字符集的表情符号，并且给了我像“SQLSTATE [HY000]：一般错误：1366不正确的string值”的错误。

详情请参阅https://www.drupal.org/node/1824506#comment-6881382

所以规则是第一个UTF-8 octlet将高位设置为一个标记，然后用1到4位来表示多less个额外的八进制; 那么每个附加的八进制都必须将高两位设置为10。

伪巨蟒将是：

 newstring = '' cont = 0 for each ch in string: if cont: if (ch >> 6) != 2: # high 2 bits are 10 # do whatever, eg skip it, or skip whole point, or? else: # acceptable continuation of multi-octlet char newstring += ch cont -= 1 else: if (ch >> 7): # high bit set? c = (ch << 1) # strip the high bit marker while (c & 1): # while the high bit indicates another octlet c <<= 1 cont += 1 if cont > 4: # more than 4 octels not allowed; cope with error if !cont: # illegal, do something sensible newstring += ch # or whatever if cont: # last utf-8 was not terminated, cope

这同样的逻辑应该是可以转换到PHP。但是，一旦你得到一个畸形的性格，还不清楚要做什么样的剥离。

要删除Unicode基本语言平面之外的所有Unicode字符：

 $str = preg_replace("/[^\\x00-\\xFFFF]/", "", $str);

问题略有不同，但我所做的是使用HtmlEncode（string），

伪代码在这里

 var encoded = HtmlEncode(string); encoded = Regex.Replace(encoded, "&#\d+?;", ""); var result = HtmlDecode(encoded);

input和输出

 "Headlight\x007E Bracket, &#123; Cafe Racer<> Style,Â Stainless Steel 中文呢？" "Headlight~ Bracket, &#123; Cafe Racer<> Style, Stainless Steel 中文呢？"

我知道这不是完美的，但为我做的工作。

iconv：

http://php.net/manual/en/function.iconv.php

没有在PHP本身内部使用它，但它总是在我的命令行上performance良好。你可以把它replace成无效的字符。

从string中删除非utf8字符

如何使用多个RE引擎testing我的正则expression式？

为什么VIM有自己的正则expression式语法？

如何用链接replace纯文字的URL？

什么是深入学习C＃expression式树的最佳资源？

在re.search中使用start / end参数时，在正则expression式中$和^之间的不一致？

如何模仿StackOverflow自动链接行为

LINQexpression式返回属性值？

匹配所有的正则expression式

JS中的正则expression式中需要使用caret（^）和dollar符号（$）吗？

在root后面用可选parameter passing路由控制