Changes
Title
unchanged
How to set up Unicode
Category
unchanged
How-tos
Yii version
unchanged
Tags
unchanged
i18n, unicode
Content
changed
To fix issues with display of special language characters once and for all
there's a solution: use Unicode UTF-8 everywhere. Other Unicode encodings exists, like UTF-16, but they are far less used on the web. If everything is set up to use Unicode, you can use mostly every language in your application.
> Info: Strictly speaking, *Unicode* is a *character set*.
> It lists and names characters from every main language around the world.
> *UTF-8* is an *encoding*.
> It defines a mapping between Unicode characters and a sequence of bytes.
> UTF-8 has a main advantage over other Unicode encodings : it is backward compatible with ASCII.
There are several places that all may need some configuration tuning to use Unicode.
## 0. Yii Application ##
By default, Yii applications already suppose the character set is UTF-8. See [CApplication::charset](http://www.yiiframework.com/doc/api/1.1/CApplication#charset-detail). This is used for encoding text in HTML pages, e.g. by [CHtml::encode()](http://www.yiiframework.com/doc/api/CHtml/#encode-detail)
## 1. PHP script files ##
Make sure that you use an editor which is capable of using UTF-8 and save all your files UTF-8 encoded without [BOM](http://en.wikipedia.org/wiki/Byte_order_mark). If you have some older non-unicode files in your project open them with your editor and save them again UTF-8 encoded.
On Linux you can also use command line tools like `recode` or `iconv` to convert a whole bunch of files.
For Example:
~~~
[bash]
$ cd /var/www/myproject/
$ sudo su
# for i in $(find -name '*.php');do encoding=$(file -bi $i | sed -e 's/.*[ ]charset=//'); iconv -f $encoding -t UTF-8 -o $i $i; done
# exit
~~~
On Windows you can use application like [`Notepad++`](http://notepad-plus-plus.org/ "Notepad++"), which has `Encoding` menu from where you can change encodings of your files.
## 2. Database tables ##
You need to set to UTF-8 the encoding of your connection to the SQL server. It's recommended to set up every table in your database needs to use the same charset for its content, but if it's not the case, the SQL server will convert the text on-the-fly. So **this step isn't mandatory, but it's highly recommended**.
The configuration for that might differ between database systems.
### MySQL
To find out if a table uses utf8 charset you have to look at the `CREATE`
statement for that table. You can use phpMyAdmin's export feature and look
at the `CREATE` statement.
>Info: Don't confuse the *encoding* of characters in a table with its *collation*. The
latter is used for sorting in queries and can be changed easily with e.g. phpMyAdmin
or even for a single query.
You could also issue this SQL statement:
~~~
[sql]
SHOW CREATE TABLE your_tablename;
~~~
You'll see a `CREATE` statement with the `CHARSET` information at the end. It
should like this:
~~~
[sql]
CREATE TABLE IF NOT EXISTS `your_tablename` (
.... your field definitions ...
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
~~~
If your table doesn't use UTF-8 charset yet the easiest way to change this is
to export your table, adapt the `CREATE` statement's `CHARSET` parameter and
re-import your table again into the database.
Be very careful when doing this conversion and make sure you save the file with the changed
SQL statement in UTF-8 and convert it if necessary. If not performed carefully
you can easily end up with messed up encodings, e.g. having ISO-8859-1 encoded
characters in a table with utf8 `CHARSET`.
>Tip: To have MySQL create all of your tables with utf8
>CHARSET by default, you can add this to your MySQL
>configuration (e.g. `my.cnf` file):
>
>~~~
>[mysqld]
>character-set-server = utf8
># for older versions:
>default-character-set = utf8
>~~~
### Mysql indexes
utf8 is efficient if the data is mostly English (which is often true for web apps) because its variable-length encoding uses one byte for each English alphabet character. For accented Latin and other alphabets it uses multiple bytes per character. But for indexes MySQL uses a fixed-length encoding and requires 3-bytes for every character regardless. So converting an indexed latin1 table to uft8 will tripple the index size and that will slow it down. This also explains why the maximum width of indexed columns is smaller with utf8. In MyISAM an indexed latin1 column can be up to VARCHAR(1000) but utf8 is limited to 333. InnoDB can index latin1 up to VARCHAR(757) and utf8 up to only 255.
## 3. Database connection ##
When connecting to a database a client like PHP has to use a specific charset encoding.
To specify the charset to use for a connection in Yii, configure it like this:
```php
return array(
......
'components'=>array(
......
'db'=>array(
'connectionString'=>'sqlite:protected/data/source.db',
'charset'=>'utf8',
),
),
......
);
```
The connection encoding can also be set with a SQL command. In MySQL and SQLite:
~~~
[SQL]
-- Beware, it's utf8, not utf-8!
SET NAMES utf8 ;
~~~
Such a command can be put in the `initSQLs` attribute of the `db` component.
The `charset` attribute introduced above should be sufficient, though.
## 4. Webserver/HTTP-Header ##
We also need to let the browser know, that we use UTF-8 with our pages. There are 3 levels for this. By decreasing priority order:
* in PHP, with `header('Content-Type: text/html; charset=utf-8');`
* in the webserver (Apache, etc)
* in the HTML with `<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />`
The best place to do this is in the header of an HTTP response.
Configuring this varies between different server software.
>Tip: If you use this approach, there's no need to add additional header information
about encoding to your pages. You just have to overwrite the HTTP header when your page is not in HTML or in UTF-8, like `header('Content-Type: text/plain; charset=iso-8859-1');`.
### Apache
You can configure UTF-8 charset either in a `VirtualHost` section of your server
configuration or by adding this line into a `.htaccess` file in your `DocumentRoot`:
~~~
AddDefaultCharset UTF-8
~~~
## 5. PHP string functions ##
PHP needs to use UTF-8 internally in order for e.g. string length validation to work correctly.
### mbstring
The alternative is to use [mbstring functions](http://de.php.net/manual/en/ref.mbstring.php) instead of the non-multibyte aware counterparts. Since mbstring is a non-default extension it might not be available on every host. That's one of the reasons why Yii uses the non-multibyte functions like strlen() instead of mb_strlen() by default.
#### Using mbstring with Yii > 1.1.1
Since version 1.1.1 you can use the [encoding](http://www.yiiframework.com/doc/api/CStringValidator#encoding-detail) parameter of CStringValidator. If you set it to `utf-8` it will use the mbstring functions for different string validation operations.
#### Using mbstring with older versions of Yii
A workaround for older releases is to use mbstring's [function overloading feature](http://de.php.net/manual/en/mbstring.overload.php). This will override then non-multibyte aware functions with their mbstring counterpart. To set this up add this in your php.ini:
~~~
mbstring.func_overload "7"
mbstring.internal_encoding "UTF-8"
~~~
As an alternative you can also enable it for a single `VirtualHost` in Apache in the according configuration section:
~~~
php_admin_value mbstring.func_overload "7"
php_admin_value mbstring.internal_encoding "UTF-8"
~~~
>Note: Unfortunately it's not recommended to set this an an `.htaccess` file as this may lead to undefined behavior.
When mbstring function overloading is turned on the built-in PHP function `strlen()` counts Unicode characters, not bytes, and the change can break existing code. Use `mb_string($str, 'ISO-8859-1')` to find the byte length of `$str`.
### Links
[Chinese version](http://projects.ourplanet.tk/node/84)If everything is set up to use Unicode, you can use mostly every language in your application.
> Info: Strictly speaking, *Unicode* is a *character set*. It lists and names characters
> from every main language around the world. *UTF-8* is an *encoding*. It defines a mapping
> between Unicode characters and a sequence of bytes. Other Unicode encodings exists,
> like UTF-16, but they are far less used on the web. UTF-8 has a main advantage over other
> Unicode encodings : it is backward compatible with ASCII.
There are several places that all may need some configuration tuning to use Unicode.
## 1. PHP script files ##
Every text file is stored in a specific character set on disk. For your PHP files this must be UTF-8 charset **without [BOM](http://en.wikipedia.org/wiki/Byte_order_mark)**. Make sure to use an editor which is capable of Unicode. If you have some older non-unicode files in your project open them with your editor and save them again UTF-8 encoded.
> Tip: On Windows you can for example use [`Notepad++`](http://notepad-plus-plus.org/ "Notepad++"), which has an `Encoding` menu from where you can change encodings of your files.
On Linux you can also use command line tools like `recode` or `iconv` to convert a whole bunch of files. Here's a script that converts every php file in the directory `myproject/` and its sub-directories:
~~~
[sh]
$ cd myproject/
$ for i in $(find -name '*.php'); do encoding=$(file -bi "$i" | sed -e 's/.*[ ]charset=//'); iconv -f $encoding -t UTF-8 -o "$i" "$i"; done
~~~
## 2. PHP-Code and Yii Application ##
PHP needs to use UTF-8 internally in order for e.g. string length validation to work correctly. Scripts should use [mbstring functions](http://de.php.net/manual/en/ref.mbstring.php) instead of the non-multibyte aware counterparts.
By default, the Yii applications already supposes your character set to be UTF-8. See [CApplication::charset](http://www.yiiframework.com/doc/api/1.1/CApplication#charset-detail). This is used for encoding text in HTML pages, e.g. by [CHtml::encode()](http://www.yiiframework.com/doc/api/CHtml/#encode-detail).
### Yii > 1.1.1
Yii will try to use mbstring functions if they are available. For the string validator you should set the [`encoding`](http://www.yiiframework.com/doc/api/CStringValidator#encoding-detail) parameter to `utf-8`.
### Older versions of Yii
A workaround for older releases is to use mbstring's [function overloading feature](http://de.php.net/manual/en/mbstring.overload.php). This will override then non-multibyte aware functions with their mbstring counterpart.
To set this up add this in your php.ini:
~~~
mbstring.func_overload "7"
mbstring.internal_encoding "UTF-8"
~~~
or configure it in a `VirtualHost` section in Apache:
~~~
php_admin_value mbstring.func_overload "7"
php_admin_value mbstring.internal_encoding "UTF-8"
~~~
>Note: Unfortunately it's not recommended to set this in an `.htaccess` file as this may lead to undefined behavior.
## 3. Database ##
Your database needs to know that it should store data in utf-8. The configuration for that might differ between database systems.
### MySQL
The charset can be defined per database and per table. Use the following SQL to find out the charset for an existing database or table:
~~~
[sql]
SHOW CHARACTER SET FOR mydatabase;
SHOW CHARACTER SET FOR mydatabase.mytable;
~~~
> Info: Don't confuse the encoding of characters in a table with its collation. The latter is used for sorting in queries and can be changed easily with e.g. phpMyAdmin or even for a single query.
If your table doesn't use UTF-8 charset yet the most reliable way to change this is
to export your table, modify the `CREATE` statement's `CHARSET` parameter and
re-import your table again into the database.
Be very careful when doing this conversion. You need to make sure you use the
correct connection charset and save the file in UTF-8. If not performed carefully
you can easily end up with messed up encodings, e.g. having `ISO-8859-1` encoded
characters in a table with `utf8` `CHARSET`.
> Tip: To have MySQL create all of your tables with `utf8` charset and collation
> by default, you can add this to your MySQL configuration (e.g. `my.cnf` file):
>
>~~~
>[mysqld]
>character_set_server = utf8
>collation_server = utf8_general_ci
># for older versions:
>default-character-set = utf8
>~~~
## 4. Database connection ##
When connecting to a database a client like PHP also has to use a specific charset encoding.
To specify the charset to use for a connection in Yii, configure it like this:
```php
return array(
// ...
'components' => array(
// ...
'db' => array(
// ..
'charset' => 'utf8',
),
),
```
If you have problems with the `charset` configuration above you can also try to set the charset with a SQL command. You can use the `initSQLs` configuration:
```php
'db'=>array(
'connectionString'=>'sqlite:protected/data/source.db',
'initSQLs'=>'SET NAMES utf8 ;',
),
```
## 5. HTTP Content-Type ##
We also need to let the browser know, that we use UTF-8 with our pages. There are 2 options for this:
* **HTTP `Content-Type` header**. This is configured in the webserver but can also be set from PHP (see below).
* **`Content-Type` meta tag**. You could add a meta tag to your HTML pages like `<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />`.
The recommended way is to use the HTTP header as it overrides what you have set in the meta tag.
>Tip: If you let the webserver set the header, there's no need to add additional header information about encoding to your pages. In this case you would only have to overwrite the HTTP header if your page where *not* in UTF-8.
### Apache
You can configure the `Content-Type` header either in a `VirtualHost` section of your server
or in a `.htaccess` file in your `DocumentRoot`. Add this line:
~~~
AddDefaultCharset UTF-8
~~~
### Nginx
The right `Content-Type` header is set with this directive:
~~~
server {
charset UTF-8;
...
}
~~~
### PHP alternative
If you don't have access to or don't want to modify your server configuration you can also set the content type from PHP. Again you have different options:
* Set `default_charset` to `utf8` in your `php.ini`
* Add the following PHP command to Yii's `index.php`: `header('Content-Type: text/html; charset=utf-8');`.
The drawback of this method is that it sets the header only for PHP files. So if you also serve some static content, it will not have the right `Content-Type` header set.