You are here:  » Character Encoding Filters


Character Encoding Filters

Submitted by support on Thu, 2006-05-04 13:12 in

Some users building sites in non-English languages are coming across an issue caused by the fact that some of their feeds are UTF-8 encoded whereas others are ISO-8859-1 encoded. Since you must choose a single character set for the entire site; a solution to this is to use PHP's utf8_encode() and utf8_decode() functions as required aginst the product name and description fields of feeds not in the same character encoding as your site. The following code will provide 2 additional filters that will process any selected field against these functions.

To implement this; choose the character encoding of the majority of your feeds, and set this as the character encoding for the entire site in config.php, for example:

  $config_charset = "UTF-8";

Then, for any feeds that are ISO-8859-1 encoded, use the new "UTF8 Encode" filter, which can be installed by adding the following code to includes/filter.php

Note: These filters are now included as part of the distribution

<?php
  
/*************************************************/
  /* UTF8 Encode                                   */
  /*************************************************/
  
$filter_names["utf8Encode"] = "UTF8 Encode";
  function 
filter_utf8EncodeConfigure($filter_data)
  {
    print 
"<p>There are no additional configuration parameters for this filter.</p>";
  }
  function 
filter_utf8EncodeValidate($filter_data)
  {
  }
  function 
filter_utf8EncodeExec($filter_data,$text)
  {
    return 
utf8_encode($text);
  }
  
/*************************************************/
  /* UTF8 Decode                                   */
  /*************************************************/
  
$filter_names["utf8Decode"] = "UTF8 Decode";
  function 
filter_utf8DecodeConfigure($filter_data)
  {
    print 
"<p>There are no additional configuration parameters for this filter.</p>";
  }
  function 
filter_utf8DecodeValidate($filter_data)
  {
  }
  function 
filter_utf8DecodeExec($filter_data,$text)
  {
    return 
utf8_decode($text);
  }
?>

Alternatively, if the majority of your feeds are ISO-8859-1 encoded choose that as the main character set for the site, and use the UTF8 Decode filter against UTF-8 encoded feeds. These filters are now included in the distribution.

Submitted by affolable on Sun, 2006-05-21 14:54

In my case I have UTF-8 and ISO-8859-1 feeds, I modify the file tapestry.php with this

function is_utf8($string) {

// From http://w3.org/International/questions/qa-forms-utf-8.html
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);

}

and in the tapestry_normalise function I add
if (!is_utf8($text)) $text = utf8_encode($text);