You are viewing revision #23 of this wiki article.
This version may not be up to date with the latest version.
You may want to view the differences to the latest version or see the changes made in this revision.
To fix issues with display of special language characters once and for all there's a solution: use Unicode UTF-8 everywhere. Other Unicode encodings exists, like UTF-16, but they are far less used on the web. If everything is set up to use Unicode, you can use mostly every language in your application.
Info: Strictly speaking, Unicode is a character set. It lists and names characters from every main language around the world. UTF-8 is an encoding. It defines a mapping between Unicode characters and a sequence of bytes. UTF-8 has a main advantage over other Unicode encodings : it is backward compatible with ASCII.
There are several places that all may need some configuration tuning to use Unicode.
0. Yii Application ¶
By default, Yii applications already suppose the character set is UTF-8. See CApplication::charset. This is used for encoding text in HTML pages, e.g. by CHtml::encode()
1. PHP script files ¶
Make sure that you use an editor which is capable of using UTF-8 and save all your files UTF-8 encoded without BOM. If you have some older non-unicode files in your project open them with your editor and save them again UTF-8 encoded. On Linux you can also use command line tools like recode
or iconv
to convert a whole bunch of files.
For Example: ~~~ [bash] $ cd /var/www/myproject/ $ sudo su
for i in $(find -name '.php');do encoding=$(file -bi $i | sed -e 's/.[ ]charset=//'); iconv -f $encoding -t UTF-8 -o $i $i; done ¶
- 0. Yii Application
- 1. PHP script files
- 3. Database connection
- 4. Webserver/HTTP-Header
- 5. PHP string functions
exit ¶
## 2. Database tables ##
You need to set to UTF-8 the encoding of your connection to the SQL server. It's recommended to set up every table in your database needs to use the same charset for its content, but if it's not the case, the SQL server will convert the text on-the-fly. So **this step isn't mandatory, but it's highly recommended**.
The configuration for that might differ between database systems.
### MySQL
To find out if a table uses utf8 charset you have to look at the `CREATE`
statement for that table. You can use phpMyAdmin's export feature and look
at the `CREATE` statement.
>Info: Don't confuse the *encoding* of characters in a table with its *collation*. The
latter is used for sorting in queries and can be changed easily with e.g. phpMyAdmin
or even for a single query.
You could also issue this SQL statement:
[sql] SHOW CREATE TABLE your_tablename; ~~~
You'll see a CREATE
statement with the CHARSET
information at the end. It
should like this:
[sql]
CREATE TABLE IF NOT EXISTS `your_tablename` (
.... your field definitions ...
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
If your table doesn't use UTF-8 charset yet the easiest way to change this is
to export your table, adapt the CREATE
statement's CHARSET
parameter and
re-import your table again into the database.
Be very careful when doing this conversion and make sure you save the file with the changed
SQL statement in UTF-8 and convert it if necessary. If not performed carefully
you can easily end up with messed up encodings, e.g. having ISO-8859-1 encoded
characters in a table with utf8 CHARSET
.
>Tip: To have MySQL create all of your tables with utf8
>CHARSET by default, you can add this to your MySQL
>configuration (e.g. my.cnf
file):
>
>~~~
>[mysqld]
>character-set-server = utf8
># for older versions:
>default-character-set = utf8
>~~~
Mysql indexes ¶
utf8 is efficient if the data is mostly English (which is often true for web apps) because its variable-length encoding uses one byte for each English alphabet character. For accented Latin and other alphabets it uses multiple bytes per character. But for indexes MySQL uses a fixed-length encoding and requires 3-bytes for every character regardless. So converting an indexed latin1 table to uft8 will tripple the index size and that will slow it down. This also explains why the maximum width of indexed columns is smaller with utf8. In MyISAM an indexed latin1 column can be up to VARCHAR(1000) but utf8 is limited to 333. InnoDB can index latin1 up to VARCHAR(757) and utf8 up to only 255.
3. Database connection ¶
When connecting to a database a client like PHP has to use a specific charset encoding. To specify the charset to use for a connection in Yii, configure it like this:
return array(
......
'components'=>array(
......
'db'=>array(
'connectionString'=>'sqlite:protected/data/source.db',
'charset'=>'utf8',
),
),
......
);
The connection encoding can also be set with a SQL command. In MySQL and SQLite:
~~~
[SQL]
-- Beware, it's utf8, not utf-8!
SET NAMES utf8 ;
~~~
Such a command can be put in the initSQLs
attribute of the db
component.
The charset
attribute introduced above should be sufficient, though.
4. Webserver/HTTP-Header ¶
We also need to let the browser know, that we use UTF-8 with our pages. There are 3 levels for this. By decreasing priority order:
- in PHP, with
header('Content-Type: text/html; charset=utf-8');
- in the webserver (Apache, etc)
- in the HTML with
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
The best place to do this is in the header of an HTTP response. Configuring this varies between different server software.
>Tip: If you use this approach, there's no need to add additional header information
about encoding to your pages. You just have to overwrite the HTTP header when your page is not in HTML or in UTF-8, like header('Content-Type: text/plain; charset=iso-8859-1');
.
Apache ¶
You can configure UTF-8 charset either in a VirtualHost
section of your server
configuration or by adding this line into a .htaccess
file in your DocumentRoot
:
AddDefaultCharset UTF-8
5. PHP string functions ¶
PHP needs to use UTF-8 internally in order for e.g. string length validation to work correctly.
mbstring ¶
The alternative is to use mbstring functions instead of the non-multibyte aware counterparts. Since mbstring is a non-default extension it might not be available on every host. That's one of the reasons why Yii uses the non-multibyte functions like strlen() instead of mb_strlen() by default.
Using mbstring with Yii > 1.1.1 ¶
Since version 1.1.1 you can use the encoding parameter of CStringValidator. If you set it to utf-8
it will use the mbstring functions for different string validation operations.
Using mbstring with older versions of Yii ¶
A workaround for older releases is to use mbstring's function overloading feature. This will override then non-multibyte aware functions with their mbstring counterpart. To set this up add this in your php.ini:
mbstring.func_overload "7"
mbstring.internal_encoding "UTF-8"
As an alternative you can also enable it for a single VirtualHost
in Apache in the according configuration section:
php_admin_value mbstring.func_overload "7"
php_admin_value mbstring.internal_encoding "UTF-8"
>Note: Unfortunately it's not recommended to set this an an .htaccess
file as this may lead to undefined behavior.
When mbstring function overloading is turned on the built-in PHP function strlen()
counts Unicode characters, not bytes, and the change can break existing code. Use mb_string($str, 'ISO-8859-1')
to find the byte length of $str
.
Shell script
Here's the shell script i use in cygwin to remove the ... BOM
#!/bin/bash for i in $(grep -rli $'\xEF\xBB\xBF' --include=*.php /cygdrive/c/PHP-projects/toto); do echo Processing $i; cp $i $i.bak cat $i | perl -pe 's/\xEF\xBB\xBF//i' > $i.new; mv $i.new $i; done
Individual fields of the table
I was struggling to get utf8 work, my problem was that even though the DEFAULT CHARSET=utf8 was set to all of the tables, individual fields were having latin COLLATION and who knows what CHARACTER SET...
I had to do smth like this with all of the individual fields in the tables:
ALTER TABLE
tbl_example
DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;I hope this helps someone.
php.ini
If your php.ini has a default_charset set then everything might just get ignored (like in my case).
Just put this at the begining of the entry script (index.php)
ini_set('default_charset','utf-8');
SETTING THE CHARSET - CORRECT
Just add on your 'rootApp/protected/config/main.php' the correct charset of your app on the root of the return array, like that:
'charset'=>'iso-8859-1',
So change your main layout to use the charset off your app, changing the meta tag of header section:
< meta http-equiv="Content-Type" content="text/html; charset=<?= Yii::app()->charset ?>" />
And your problems are solved, without need to do conversions and etc.
Bye!
Remove UTF-8 BOM from ouput
You can remove the UTF-8 BOM from the output using the ob_start function. This way you can leave the UTF-8 BOM in your source files so your editor understands it is really UTF-8.
In the /protected/config/main.php you have to add before returning the config array:
ob_start('My_OB'); function My_OB($str, $flags) { //remove UTF-8 BOM $str = preg_replace("/\xef\xbb\xbf/","",$str); return $str; } return array( ... yii config array ...);
P.S. You don't have to call ob_end_flush(), php will do this automatically at the end of the script.
set names
In my installation I have to do even ini set, in order to have database and application with the same data:
return array( ...... 'components'=>array( ...... 'db'=>array( 'connectionString'=>'sqlite:protected/data/source.db', 'charset'=>'utf8', 'initSQLs'=>array('set names utf8'), ), ), ...... );
Really helpful
It's important to remember to set encoding parameter when using CStringValidator with not latin charactors
Great article
Very useful stuff. Some of this knowledge I learned the hard way... over the years.
Cleanup
To the other editors: I've cleaned up and reorganized the article. I think, some content was not really part of the HOWTO (e.g. the section about DB indexes). If you still think, that's useful information please add it as a comment here.
Unicode routing
How to setup unicode routing? Is there any option in hosting or is it in the urlManager in main.php in config dir?
If you have any questions, please ask in the forum instead.
Signup or Login in order to comment.