The Slow Lane

A blog about autocrossing, some geeky stuff & Philadelphia.

Browsing Posts Made in: August 2009

Performance of static vs. instanciated method calls

As you saw in my past post I am working on filtering user input into my PHP application. I don’t want to get to much into the boring details because I started to write the post explaining all the little details and I could see it getting very long and drawn out and unfocused. But I was experimenting with calling the filter functions as static methods of a class. Then I thought about making objects of the class and calling the method from the instantiated class. I wanted to know if there was a performance difference between the two so I created a test. And there is.

For the test I am using my Filter_UTF8 class, discussed a little in my last blog post. I am calling the validate method. This is not a “hello world” type of test. The method does some heavy lifting and/or calculating. For all the tests I would call the method 10,000 times, to validate a ~1,200 kB text file. The same file would be validated over and over again.

The first test was to use call_user_func_array to call the method. This took 10 to 11 seconds to run.

1
2
3
4
5
6
7
$iteration = 10000;
$i = 0;
while ($i < $iteration) {
    $ret = call_user_func_array(array('Filter_UTF8', 'validate'), 
        array($text, 4096));
    $i++;
}

Next was Creating object, calling the method, and then destroying the object. I did it this way because this simulates one of the common design pattern for doing filters, a collection object holds a bunch of filter instances for each value of the form. Then when “validation” is run each one is called to do it’s thing and then the form is processed and they are all destroyed once the form data is saved or it’s re-rendered. So each instance is created, run once, or twice if maybe you have a getMessage() type function, and they destroyed. I feel test is close to how the above design pattern would work on a large scale.

1
2
3
4
5
6
7
8
$iteration = 10000;
$i = 0;
while ($i < $iteration) {
    $my = new Filter_UTF8();
    $ret = $my->validate($text, 4096);
    $i++;
    $my = null;
}

The results were surprising. It took 33 seconds for this to run. Eeek!

I thought maybe the act of creating and destroying all those objects was causing the slowdown. So I created a third test that created one instance and calls the validate() method 10,000 times.

1
2
3
4
5
6
7
$iteration = 10000;
$i = 0;
$my = new Filter_UTF8();
while ($i < $iteration) {
    $ret = $my->validate($text, 4096);
    $i++;
}

I was really surprised when this took the same 33 seconds as creating 10,000 instances did. The crappy thing is, creating a bunch of instances is easier than trying to manage calling them statically unless you want to type out each filter call in a bunch of if/else statements (I’m trying to do an automated form type of thing). I just can’t believe the performance difference. You wouldn’t notice the difference on each page hit, where you had 100 of these. But if your script had 100 people all doing the same thing at once that ends up being a big difference.

UTF-8 Validation and PHP, do you?

I wonder why none of the major PHP frameworks have validators for UTF-8 encoding? You may be asking why do I need to validate incoming text as UTF-8? UTF-8 is the preferred character encoding for the web, if you want to display languages other than the Latin derived ones. All of the browsers support it. And so do all the major databases. I won’t get into the basics about what UTF-8 is and why you should use it. There are plenty of other resources for that. I’m also going to assume you are sending the correct charset header (in HTTP, not relying on a meta tag). And that your DB tables and connections are declared to use UTF-8. That stuff is well covered elsewhere also.

I still haven’t really told you why you would want to validate incoming text as UTF-8, I’ve only told you why you should use it. And the simple answer is the old security mantras of all input is evil and Filter Input and Escape Output. If this text is input I want to be validating it, no? The W3C recommends that you validate UTF-8 text. WACT does too. It’s been proven that you can launch an XSS attack on a site using “incorrectly” encoded text. Chris’ example used GBK encoding, but I think you can do the same thing with UTF-8. Is it immune?

I’ve been looking at a lot of example code looking for answers. The “major” PHP frameworks I looked at were Zend Framework, CakePHP, Symfony, Solar, Codeigniter, and Kohana. None of them include any validators for UTF-8, or any other text encoding. I looked at some other smaller frameworks but they were lucky to have validators at all.

I also looked at a few of the big mainstream PHP projects. Joomla, MediaWiki, phpBB, and Wordpress. Of them Joomla contains the library PHP UTF-8 by Harry Fuecks. The purpose of this library is to provide “native” PHP multibyte string functions when mb_string isn’t loaded on your server (and presumably you can’t do anything about it because you are on shared hosting). In the back of this library is a couple of functions, one actually validates UTF-8 and returns a true or false. The second converts UTF-8 to it’s Unicode code points, returned as an array. And the last converts that array back to UTF-8. The last two are used in the library’s uft8_strtolower() and uft8_strtoupper() functions, but is other wise unused by Joomla. The first function, called utf8_is_valid() is the one I am most interested in, and it is not used at all. Interestingly Kohana includes this same library, but they rearranged the files and function names, and stripped out the utf8_is_valid()

MediaWiki and phpBB both use a set of functions to Normalize UTF-8 data strings. This goes beyond just validating the byte stream. It is also recommended to Normalize UTF-8 so it sorts properly and consistently, and that is probably why these two packages do it. Both, especially MediaWiki count on being able to search strings well. But it also seems to be compute intensive. For what it’s worth it appears phpBB borrowed MediaWiki’s code and refactored it.

The last of our quadruple, Wordpress, contains a function to do a basic validation if a stream is UTF-8. However it seems that in practice this seems to be more UTF-8 detection than it is validation. The function, called seems_utf8(), allows 5 and 6 byte sequences, which were apparently in the early UTF-8 versions, before the Unicode consortium decided to limit the code point range to U+10FFFF, making anything over 4 bytes unnecessary. It also does not check for the disallowed UTF-16 surrogate code points, or byte order marks. Those last two points are important because Windows, Java, and Oracle store text internally in UTF-16. So a botched conversion from one of these sources into the browser could send invalid text to your PHP application. I don’t know the ins and outs of copy and pasting from one of these sources to a browser. I assume a conversion happens but don’t know where.

Getting back to the utf8_is_valid() function in PHP UTF-8, this is essentially what I have in mind for my CMS. The code in that function comes from another small library written by Henri Sivonen of validator.nu fame. He provides a function to convert a UFT-8 string to Unicode code points, returned in an array, and another one to convert the array to a UTF-8 string. Sound familiar? This is where the PHP UTF-8 library got it’s code that does the same thing. I stumbled onto this library through the WACT site, and ended up coding essentially the same thing Harry Fuecks did. I also made a “sanitizer” version that “deconstructs” the byte stream and throws out all the bad byte sequences without the intermediate point of the array. It just concatenates the good byte sequences int a new string.

I haven’t gotten into the functions PHP natively provides for checking and converting character encodings. But this post is long enough so that will have to wait until another day. So for anyone in the PHP community lucky or unlucky enough to read this post, should we be validating strings to make sure they are in the encoding we think they are in?