UTF-8 Validation and PHP, do you?

I wonder why none of the major PHP frameworks have validators for UTF-8 encoding? You may be asking why do I need to validate incoming text as UTF-8? UTF-8 is the preferred character encoding for the web, if you want to display languages other than the Latin derived ones. All of the browsers support it. And so do all the major databases. I won’t get into the basics about what UTF-8 is and why you should use it. There are plenty of other resources for that. I’m also going to assume you are sending the correct charset header (in HTTP, not relying on a meta tag). And that your DB tables and connections are declared to use UTF-8. That stuff is well covered elsewhere also.

I still haven’t really told you why you would want to validate incoming text as UTF-8, I’ve only told you why you should use it. And the simple answer is the old security mantras of all input is evil and Filter Input and Escape Output. If this text is input I want to be validating it, no? The W3C recommends that you validate UTF-8 text. WACT does too. It’s been proven that you can launch an XSS attack on a site using “incorrectly” encoded text. Chris’ example used GBK encoding, but I think you can do the same thing with UTF-8. Is it immune?

I’ve been looking at a lot of example code looking for answers. The “major” PHP frameworks I looked at were Zend Framework, CakePHP, Symfony, Solar, Codeigniter, and Kohana. None of them include any validators for UTF-8, or any other text encoding. I looked at some other smaller frameworks but they were lucky to have validators at all.

I also looked at a few of the big mainstream PHP projects. Joomla, MediaWiki, phpBB, and WordPress. Of them Joomla contains the library PHP UTF-8 by Harry Fuecks. The purpose of this library is to provide “native” PHP multibyte string functions when mb_string isn’t loaded on your server (and presumably you can’t do anything about it because you are on shared hosting). In the back of this library is a couple of functions, one actually validates UTF-8 and returns a true or false. The second converts UTF-8 to it’s Unicode code points, returned as an array. And the last converts that array back to UTF-8. The last two are used in the library’s uft8_strtolower() and uft8_strtoupper() functions, but is other wise unused by Joomla. The first function, called utf8_is_valid() is the one I am most interested in, and it is not used at all. Interestingly Kohana includes this same library, but they rearranged the files and function names, and stripped out the *utf8_is_valid()</em function all together. MediaWiki and phpBB both use a set of functions to Normalize UTF-8 data strings. This goes beyond just validating the byte stream. It is also recommended to Normalize UTF-8 so it sorts properly and consistently, and that is probably why these two packages do it. Both, especially MediaWiki count on being able to search strings well. But it also seems to be compute intensive. For what it’s worth it appears phpBB borrowed MediaWiki’s code and refactored it.</p>

The last of our quadruple, WordPress, contains a function to do a basic validation if a stream is UTF-8. However it seems that in practice this seems to be more UTF-8 detection than it is validation. The function, called seems_utf8(), allows 5 and 6 byte sequences, which were apparently in the early UTF-8 versions, before the Unicode consortium decided to limit the code point range to U+10FFFF, making anything over 4 bytes unnecessary. It also does not check for the disallowed UTF-16 surrogate code points, or byte order marks. Those last two points are important because Windows, Java, and Oracle store text internally in UTF-16. So a botched conversion from one of these sources into the browser could send invalid text to your PHP application. I don’t know the ins and outs of copy and pasting from one of these sources to a browser. I assume a conversion happens but don’t know where.

Getting back to the utf8_is_valid() function in PHP UTF-8, this is essentially what I have in mind for my CMS. The code in that function comes from another small library written by Henri Sivonen of validator.nu fame. He provides a function to convert a UFT-8 string to Unicode code points, returned in an array, and another one to convert the array to a UTF-8 string. Sound familiar? This is where the PHP UTF-8 library got it’s code that does the same thing. I stumbled onto this library through the WACT site, and ended up coding essentially the same thing Harry Fuecks did. I also made a “sanitizer” version that “deconstructs” the byte stream and throws out all the bad byte sequences without the intermediate point of the array. It just concatenates the good byte sequences int a new string.

I haven’t gotten into the functions PHP natively provides for checking and converting character encodings. But this post is long enough so that will have to wait until another day. So for anyone in the PHP community lucky or unlucky enough to read this post, should we be validating strings to make sure they are in the encoding we think they are in?