I'm trying to translate the following slugify method from PHP to C#: http://snipplr.com/view/22741/slugify-a-string-in-php/
Edit: For the sake of convenience, here the code from above:
/**
* Modifies a string to remove al non ASCII characters and spaces.
*/
static public function slugify($text)
{
// replace non letter or digits by -
$text = preg_replace('~[^\\pL\d]+~u', '-', $text);
// trim
$text = trim($text, '-');
// transliterate
if (function_exists('iconv'))
{
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
}
// lowercase
$text = strtolower($text);
// remove unwanted characters
$text = preg_replace('~[^-\w]+~', '', $text);
if (empty($text))
{
return 'n-a';
}
return $text;
}
I got no probleming coding the rest except I can not find the C# equivalent of the following line of PHP code:
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
Edit:
Purpose of this is to translate non-ASCII characters such as Reformáció Genfi Emlékműve Előtt
into reformacio-genfi-emlekmuve-elott
I would also like to add that the //TRANSLIT
removes the apostrophes and that @jxac solution doesn't address that. I'm not sure why but by first encoding it to Cyrillic and then to ASCII you get a similar behavior as //TRANSLIT
.
var str = "éåäöíØ";
var noApostrophes = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(str));
=> "eaaoiO"
There is a .NET library for transliteration on codeplex - unidecode. It generally does the trick using Unidecode tables ported from python.
conversion to string:
byte[] unicodeBytes = Encoding.Unicode.GetBytes(str);
byte[] asciiBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, unicodeBytes);
string asciiString = Encoding.ASCII.GetString(asciiBytes);
conversion to bytes:
byte[] ascii = Encoding.ASCII.GetBytes(str);
@Thomas Levesque is right, will get encoded by the output stream...
to remove the diacritics (accent marks), you can use the String.Normalize function, as detailed here:
http://www.siao2.com/2007/05/14/2629747.aspx
that should take care of most of the cases (where the glyph is really a character plus an accent mark). for an even more aggressive char matching (to take care of cases like the Scandinavian slashed o [Ø], digraphs, and other exotic glyphs), there's the table approach:
http://www.codeproject.com/KB/cs/UnicodeNormalization.aspx
this includes around 1,000 symbol mappings in addition to the normalization.
(note, all punctuation is removed by the regex replace in your example)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With