Removing Acents With Reg Ex

Discuss practical ways rearrange URLs using mod_rewrite.

Removing Acents With Reg Ex

Postby david64 » Mon Jun 02, 2008 5:59 am

Hi. This is a little off topic. I am not actually using mod_rewrite, but I am using reg ex to make pretty URLs.

At the moment I a trying to make a bulletproof string to url function to create nice URLs. All is well until I realised that any accented chars are going to be removed. I have found some instructions on how to do this, however its in Perl and I don't know if its possible to port this over to PHP.

The instructions are:

1. we take some data with diacritics;
2. convert it to Unicode;
3. put it through Canonical Decomposition, also known as Normalization Form D;
4. remove all characters that belong to the Unicode General Category “Mark” (non-spacing, spacing combining, enclosing) — thus removing the diacritics (accent marks);
5. prepare the data for output to an ASCII stream.


I have completed them up to step 3, but am stuck on step 4. The following Perl is given to do step 4 but I am not sure if this can be done in PHP.

for ( $str ) { # the variable we work on
## convert to Unicode first
## if your data comes in Latin-1, then uncomment:
#$_ = Encode::decode( 'iso-8859-1', $_ );
$_ = NFD( $_ ); ## decompose
s/\pM//g; ## strip combining characters
s/[^\0-\x80]//g; ## clear everything else
}


Any ideas?
david64
 
Posts: 15
Joined: Wed Mar 26, 2008 2:16 pm

Update

Postby david64 » Mon Jun 02, 2008 6:09 am

Just a quick update.

I have found this running the string thru NFD normalisation and then doing this:

$normalized = $normalizer->normalize('åæçèéêëìíîïðñ', 'NFD', 'UTF-8' );

$new = preg_replace( '/[\x80-\xff]|[\x00-\x07][\x00-\xff]/', '', $normalized );

will give me the string: aceeeeiiiin

That is pretty close to the required output, which is: aaeceeeeiiion

Any Unicode brutes know how to make this work better?
david64
 
Posts: 15
Joined: Wed Mar 26, 2008 2:16 pm


Return to Friendly URLs with Mod_Rewrite

Who is online

Users browsing this forum: No registered users and 102 guests

cron