如何从.NET中的string中删除变音符号(重音符号)?

我试图转换一些加拿大法语的string,基本上,我希望能够在保留字母的同时在字母中取出法语重音符号。 (例如把é变成e ,所以crème brûlée就变成creme brulee

什么是实现这个最好的方法?

我没有用过这种方法,但是Michael Kaplan在他的博客文章中描述了一个方法,用于解释标注符号的方法: 剥离是一个有趣的工作(又名“无意义”,又名“所有Mn”字符是非间距的,但是有些比其他的间距更小)

 static string RemoveDiacritics(string text) { var normalizedString = text.Normalize(NormalizationForm.FormD); var stringBuilder = new StringBuilder(); foreach (var c in normalizedString) { var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c); if (unicodeCategory != UnicodeCategory.NonSpacingMark) { stringBuilder.Append(c); } } return stringBuilder.ToString().Normalize(NormalizationForm.FormC); } 

请注意,这是他以前的post的后续行动: 剥离变音符号….

该方法使用String.Normalize将inputstring拆分为组成字形(基本上将“基本”字符与变音符分开),然后扫描结果并仅保留基本字符。 这只是一个复杂的,但你真的在看一个复杂的问题。

当然,如果你限制自己的法语,那么你可能就不用像@David Dibben所推荐的那样, 在C ++ std :: string中如何去掉重音符和撇号的简单的基于表格的方法。

这对我来说是诀窍…

 string accentedStr; byte[] tempBytes; tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(accentedStr); string asciiStr = System.Text.Encoding.UTF8.GetString(tempBytes); 

快速和短暂!

如果有人有兴趣,我正在寻找类似的东西,并写下以下内容:

  public static string NormalizeStringForUrl(string name) { String normalizedString = name.Normalize(NormalizationForm.FormD); StringBuilder stringBuilder = new StringBuilder(); foreach (char c in normalizedString) { switch (CharUnicodeInfo.GetUnicodeCategory(c)) { case UnicodeCategory.LowercaseLetter: case UnicodeCategory.UppercaseLetter: case UnicodeCategory.DecimalDigitNumber: stringBuilder.Append(c); break; case UnicodeCategory.SpaceSeparator: case UnicodeCategory.ConnectorPunctuation: case UnicodeCategory.DashPunctuation: stringBuilder.Append('_'); break; } } string result = stringBuilder.ToString(); return String.Join("_", result.Split(new char[] { '_' } , StringSplitOptions.RemoveEmptyEntries)); // remove duplicate underscores } 

如果有人感兴趣,这里是Java的等价物:

 import java.text.Normalizer; public class MyClass { public static String removeDiacritics(String input) { String nrml = Normalizer.normalize(input, Normalizer.Form.NFD); StringBuilder stripped = new StringBuilder(); for (int i=0;i<nrml.length();++i) { if (Character.getType(nrml.charAt(i)) != Character.NON_SPACING_MARK) { stripped.append(nrml.charAt(i)); } } return stripped.toString(); } } 

我经常使用基于我在这里find的另一个版本的扩展方法(请参阅replaceC#中的字符(ascii) )快速说明:

  • 正常化形成D分裂的特点,如èe和nonspacing`
  • 由此,删除了字符
  • 结果是规范化回到formsC(我不知道这是否是必要的)

码:

 using System.Linq; using System.Text; using System.Globalization; // namespace here public static class Utility { public static string RemoveDiacritics(this string str) { if (str == null) return null; var chars = from c in str.Normalize(NormalizationForm.FormD).ToCharArray() let uc = CharUnicodeInfo.GetUnicodeCategory(c) where uc != UnicodeCategory.NonSpacingMark select c; var cleanStr = new string(chars.ToArray()).Normalize(NormalizationForm.FormC); return cleanStr; } } 

我需要一些能够转换所有主要unicode字符的东西,投票的答案有一些,所以我创build了一个CodeIgniter的convert_accented_characters($str)到C#中,这个版本很容易定制:

 using System; using System.Text; using System.Collections.Generic; public static class Strings { static Dictionary<string, string> foreign_characters = new Dictionary<string, string> { { "äæǽ", "ae" }, { "öœ", "oe" }, { "ü", "ue" }, { "Ä", "Ae" }, { "Ü", "Ue" }, { "Ö", "Oe" }, { "ÀÁÂÃÄÅǺĀĂĄǍΑΆẢẠẦẪẨẬẰẮẴẲẶА", "A" }, { "àáâãåǻāăąǎªαάảạầấẫẩậằắẵẳặа", "a" }, { "Б", "B" }, { "б", "b" }, { "ÇĆĈĊČ", "C" }, { "çćĉċč", "c" }, { "Д", "D" }, { "д", "d" }, { "ÐĎĐΔ", "Dj" }, { "ðďđδ", "dj" }, { "ÈÉÊËĒĔĖĘĚΕΈẼẺẸỀẾỄỂỆЕЭ", "E" }, { "èéêëēĕėęěέεẽẻẹềếễểệеэ", "e" }, { "Ф", "F" }, { "ф", "f" }, { "ĜĞĠĢΓГҐ", "G" }, { "ĝğġģγгґ", "g" }, { "ĤĦ", "H" }, { "ĥħ", "h" }, { "ÌÍÎÏĨĪĬǏĮİΗΉΊΙΪỈỊИЫ", "I" }, { "ìíîïĩīĭǐįıηήίιϊỉịиыї", "i" }, { "Ĵ", "J" }, { "ĵ", "j" }, { "ĶΚК", "K" }, { "ķκк", "k" }, { "ĹĻĽĿŁΛЛ", "L" }, { "ĺļľŀłλл", "l" }, { "М", "M" }, { "м", "m" }, { "ÑŃŅŇΝН", "N" }, { "ñńņňʼnνн", "n" }, { "ÒÓÔÕŌŎǑŐƠØǾΟΌΩΏỎỌỒỐỖỔỘỜỚỠỞỢО", "O" }, { "òóôõōŏǒőơøǿºοόωώỏọồốỗổộờớỡởợо", "o" }, { "П", "P" }, { "п", "p" }, { "ŔŖŘΡР", "R" }, { "ŕŗřρр", "r" }, { "ŚŜŞȘŠΣС", "S" }, { "śŝşșšſσςс", "s" }, { "ȚŢŤŦτТ", "T" }, { "țţťŧт", "t" }, { "ÙÚÛŨŪŬŮŰŲƯǓǕǗǙǛŨỦỤỪỨỮỬỰУ", "U" }, { "ùúûũūŭůűųưǔǖǘǚǜυύϋủụừứữửựу", "u" }, { "ÝŸŶΥΎΫỲỸỶỴЙ", "Y" }, { "ýÿŷỳỹỷỵй", "y" }, { "В", "V" }, { "в", "v" }, { "Ŵ", "W" }, { "ŵ", "w" }, { "ŹŻŽΖЗ", "Z" }, { "źżžζз", "z" }, { "ÆǼ", "AE" }, { "ß", "ss" }, { "IJ", "IJ" }, { "ij", "ij" }, { "Œ", "OE" }, { "ƒ", "f" }, { "ξ", "ks" }, { "π", "p" }, { "β", "v" }, { "μ", "m" }, { "ψ", "ps" }, { "Ё", "Yo" }, { "ё", "yo" }, { "Є", "Ye" }, { "є", "ye" }, { "Ї", "Yi" }, { "Ж", "Zh" }, { "ж", "zh" }, { "Х", "Kh" }, { "х", "kh" }, { "Ц", "Ts" }, { "ц", "ts" }, { "Ч", "Ch" }, { "ч", "ch" }, { "Ш", "Sh" }, { "ш", "sh" }, { "Щ", "Shch" }, { "щ", "shch" }, { "ЪъЬь", "" }, { "Ю", "Yu" }, { "ю", "yu" }, { "Я", "Ya" }, { "я", "ya" }, }; public static char RemoveDiacritics(this char c){ foreach(KeyValuePair<string, string> entry in foreign_characters) { if(entry.Key.IndexOf (c) != -1) { return entry.Value[0]; } } return c; } public static string RemoveDiacritics(this string s) { //StringBuilder sb = new StringBuilder (); string text = ""; foreach (char c in s) { int len = text.Length; foreach(KeyValuePair<string, string> entry in foreign_characters) { if(entry.Key.IndexOf (c) != -1) { text += entry.Value; break; } } if (len == text.Length) { text += c; } } return text; } } 

用法

 // for strings "crème brûlée".RemoveDiacritics (); // creme brulee // for chars "Ã"[0].RemoveDiacritics (); // A 

希腊语(ISO) CodePage可以做到这一点

有关此代码页的信息在System.Text.Encoding.GetEncodings() 。 通过以下url了解详情: https : //msdn.microsoft.com/pt-br/library/system.text.encodinginfo.getencoding(v=vs.110).aspx

希腊语(ISO)有代码页28597和名称iso-8859-7

转到代码… \ o /

 string text = "Você está numa situação lamentável"; string textEncode = System.Web.HttpUtility.UrlEncode(text, Encoding.GetEncoding("iso-8859-7")); //result: "Voce+esta+numa+situacao+lamentavel" string textDecode = System.Web.HttpUtility.UrlDecode(textEncode); //result: "Voce esta numa situacao lamentavel" 

所以,写这个function…

 public string RemoveAcentuation(string text) { return System.Web.HttpUtility.UrlDecode( System.Web.HttpUtility.UrlEncode( text, Encoding.GetEncoding("iso-8859-7"))); } 

请注意… Encoding.GetEncoding("iso-8859-7")相当于Encoding.GetEncoding(28597)因为第一个是名称,第二个是Encoding的代码页。

这在java中正常工作。

它基本上把所有的重音字符转换成deAccented的对应字符,然后再把它们的变音符合起来。 现在,您可以使用正则expression式来去除变音符号。

 import java.text.Normalizer; import java.util.regex.Pattern; public String deAccent(String str) { String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+"); return pattern.matcher(nfdNormalizedString).replaceAll(""); } 

这是VB版本(与希腊工程):

导入System.Text

importSystem.Globalization

 Public Function RemoveDiacritics(ByVal s As String) Dim normalizedString As String Dim stringBuilder As New StringBuilder normalizedString = s.Normalize(NormalizationForm.FormD) Dim i As Integer Dim c As Char For i = 0 To normalizedString.Length - 1 c = normalizedString(i) If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then stringBuilder.Append(c) End If Next Return stringBuilder.ToString() End Function 

这是我如何在所有的.NET程序中将变音符replace为非变音符

C#:

 //Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter 'é' is substituted by an 'e' public string RemoveDiacritics(string s) { string normalizedString = null; StringBuilder stringBuilder = new StringBuilder(); normalizedString = s.Normalize(NormalizationForm.FormD); int i = 0; char c = '\0'; for (i = 0; i <= normalizedString.Length - 1; i++) { c = normalizedString[i]; if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark) { stringBuilder.Append(c); } } return stringBuilder.ToString().ToLower(); } 

VB .NET:

 'Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter "é" is substituted by an "e"' Public Function RemoveDiacritics(ByVal s As String) As String Dim normalizedString As String Dim stringBuilder As New StringBuilder normalizedString = s.Normalize(NormalizationForm.FormD) Dim i As Integer Dim c As Char For i = 0 To normalizedString.Length - 1 c = normalizedString(i) If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then stringBuilder.Append(c) End If Next Return stringBuilder.ToString().ToLower() End Function 

你可以使用MMLib.Extensions中的string扩展名nuget包:

 using MMLib.RapidPrototyping.Generators; public void ExtensionsExample() { string target = "aácčeéií"; Assert.AreEqual("aacceeii", target.RemoveDiacritics()); } 

Nuget页面: https ://www.nuget.org/packages/MMLib.Extensions/ Codeplex项目网站https://mmlib.codeplex.com/

尝试HelperSharp软件包 。

有一个方法RemoveAccents:

  public static string RemoveAccents(this string source) { //8 bit characters byte[] b = Encoding.GetEncoding(1251).GetBytes(source); // 7 bit characters string t = Encoding.ASCII.GetString(b); Regex re = new Regex("[^a-zA-Z0-9]=-_/"); string c = re.Replace(t, " "); return c; } 

这个人说什么:

Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(text));

它实际上将一个字符(它是字符代码00E5不是 0061加修饰符030A看上去相同)的å分成a加上某种types的修饰符,然后ASCII转换删除修饰符,只留下修饰符a

有趣的是,这样的问题可以得到这么多的答案,但没有一个符合我的要求:)有许多语言,一个完整的语言不可知的解决scheme是AFAIK不是真的可能,因为其他人已经提到,FormC或FormD的问题。

由于原来的问题与法语有关,所以最简单的工作答案是确实的

  public static string ConvertWesternEuropeanToASCII(this string str) { return Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(str)); } 

1251应该被input语言的编码代码replace。

但是,这只能replace一个字符一个字符。 由于我也是用德语作为input,所以我做了一个手动转换

  public static string LatinizeGermanCharacters(this string str) { StringBuilder sb = new StringBuilder(str.Length); foreach (char c in str) { switch (c) { case 'ä': sb.Append("ae"); break; case 'ö': sb.Append("oe"); break; case 'ü': sb.Append("ue"); break; case 'Ä': sb.Append("Ae"); break; case 'Ö': sb.Append("Oe"); break; case 'Ü': sb.Append("Ue"); break; case 'ß': sb.Append("ss"); break; default: sb.Append(c); break; } } return sb.ToString(); } 

它可能不能提供最好的性能,但至less很容易阅读和扩展。 正则expression式是一个不行,比任何字符/string的东西慢得多。

我也有一个非常简单的方法来删除空间:

  public static string RemoveSpace(this string str) { return str.Replace(" ", string.Empty); } 

最后,我使用了以上所有3个扩展的组合:

  public static string LatinizeAndConvertToASCII(this string str, bool keepSpace = false) { str = str.LatinizeGermanCharacters().ConvertWesternEuropeanToASCII(); return keepSpace ? str : str.RemoveSpace(); } 

而一个小的unit testing(不完全)通过成功。

  [TestMethod()] public void LatinizeAndConvertToASCIITest() { string europeanStr = "Bonjour ça va? C'est l'été! Ich möchte ä Ä á à â ê é è ë Ë É ï Ï î í ì ó ò ô ö Ö Ü ü ù ú û Û ý Ý ç Ç ñ Ñ"; string expected = "Bonjourcava?C'estl'ete!IchmoechteaeAeaaaeeeeEEiIiiiooooeOeUeueuuuUyYcCnN"; string actual = europeanStr.LatinizeAndConvertToASCII(); Assert.AreEqual(expected, actual); } 

如果你还没有考虑过,把这个图书馆popup来。 看起来像是有一个全面的unit testing。

https://github.com/thomasgalliker/Diacritics.NET

 Imports System.Text Imports System.Globalization Public Function DECODE(ByVal x As String) As String Dim sb As New StringBuilder For Each c As Char In x.Normalize(NormalizationForm.FormD).Where(Function(a) CharUnicodeInfo.GetUnicodeCategory(a) <> UnicodeCategory.NonSpacingMark) sb.Append(c) Next Return sb.ToString() End Function 

我非常喜欢azrafe7提供的简洁而实用的代码。 所以,我已经改变了一点,将其转换为扩展方法:

 public static class StringExtensions { public static string RemoveDiacritics(this string text) { const string SINGLEBYTE_LATIN_ASCII_ENCODING = "ISO-8859-8"; if (string.IsNullOrEmpty(text)) { return string.Empty; } return Encoding.ASCII.GetString( Encoding.GetEncoding(SINGLEBYTE_LATIN_ASCII_ENCODING).GetBytes(text)); } }