Posts in Category: PHP

Performance of static vs. instanciated method calls

As you saw in my past post I am working on filtering user input into my PHP application. I don’t want to get to much into the boring details because I started to write the post explaining all the little details and I could see it getting very long and drawn out and unfocused. But I was experimenting with calling the filter functions as static methods of a class. Then I thought about making objects of the class and calling the method from the instantiated class. I wanted to know if there was a performance difference between the two so I created a test. And there is.

For the test I am using my Filter_UTF8 class, discussed a little in my last blog post. I am calling the validate method. This is not a “hello world” type of test. The method does some heavy lifting and/or calculating. For all the tests I would call the method 10,000 times, to validate a ~1,200 kB text file. The same file would be validated over and over again.

The first test was to use call_user_func_array to call the method. This took 10 to 11 seconds to run.

$iteration = 10000;
$i = 0;
while ($i < $iteration) {
    $ret = call_user_func_array(array('Filter_UTF8', 'validate'), 
        array($text, 4096));
    $i++;
}

Next was Creating object, calling the method, and then destroying the object. I did it this way because this simulates one of the common design pattern for doing filters, a collection object holds a bunch of filter instances for each value of the form. Then when “validation” is run each one is called to do it’s thing and then the form is processed and they are all destroyed once the form data is saved or it’s re-rendered. So each instance is created, run once, or twice if maybe you have a getMessage() type function, and they destroyed. I feel test is close to how the above design pattern would work on a large scale.

$iteration = 10000;
$i = 0;
while ($i < $iteration) {
    $my = new Filter_UTF8();
    $ret = $my->validate($text, 4096);
    $i++;
    $my = null;
}

The results were surprising. It took 33 seconds for this to run. Eeek!

I thought maybe the act of creating and destroying all those objects was causing the slowdown. So I created a third test that created one instance and calls the validate() method 10,000 times.

$iteration = 10000;
$i = 0;
$my = new Filter_UTF8();
while ($i < $iteration) {
    $ret = $my->validate($text, 4096);
    $i++;
}

I was really surprised when this took the same 33 seconds as creating 10,000 instances did. The crappy thing is, creating a bunch of instances is easier than trying to manage calling them statically unless you want to type out each filter call in a bunch of if/else statements (I’m trying to do an automated form type of thing). I just can’t believe the performance difference. You wouldn’t notice the difference on each page hit, where you had 100 of these. But if your script had 100 people all doing the same thing at once that ends up being a big difference.

UTF-8 Validation and PHP, do you?

I wonder why none of the major PHP frameworks have validators for UTF-8 encoding? You may be asking why do I need to validate incoming text as UTF-8? UTF-8 is the preferred character encoding for the web, if you want to display languages other than the Latin derived ones. All of the browsers support it. And so do all the major databases. I won’t get into the basics about what UTF-8 is and why you should use it. There are plenty of other resources for that. I’m also going to assume you are sending the correct charset header (in HTTP, not relying on a meta tag). And that your DB tables and connections are declared to use UTF-8. That stuff is well covered elsewhere also.

I still haven’t really told you why you would want to validate incoming text as UTF-8, I’ve only told you why you should use it. And the simple answer is the old security mantras of all input is evil and Filter Input and Escape Output. If this text is input I want to be validating it, no? The W3C recommends that you validate UTF-8 text. WACT does too. It’s been proven that you can launch an XSS attack on a site using “incorrectly” encoded text. Chris’ example used GBK encoding, but I think you can do the same thing with UTF-8. Is it immune?

I’ve been looking at a lot of example code looking for answers. The “major” PHP frameworks I looked at were Zend Framework, CakePHP, Symfony, Solar, Codeigniter, and Kohana. None of them include any validators for UTF-8, or any other text encoding. I looked at some other smaller frameworks but they were lucky to have validators at all.

I also looked at a few of the big mainstream PHP projects. Joomla, MediaWiki, phpBB, and WordPress. Of them Joomla contains the library PHP UTF-8 by Harry Fuecks. The purpose of this library is to provide “native” PHP multibyte string functions when mb_string isn’t loaded on your server (and presumably you can’t do anything about it because you are on shared hosting). In the back of this library is a couple of functions, one actually validates UTF-8 and returns a true or false. The second converts UTF-8 to it’s Unicode code points, returned as an array. And the last converts that array back to UTF-8. The last two are used in the library’s uft8_strtolower() and uft8_strtoupper() functions, but is other wise unused by Joomla. The first function, called utf8_is_valid() is the one I am most interested in, and it is not used at all. Interestingly Kohana includes this same library, but they rearranged the files and function names, and stripped out the *utf8_is_valid()</em function all together. MediaWiki and phpBB both use a set of functions to Normalize UTF-8 data strings. This goes beyond just validating the byte stream. It is also recommended to Normalize UTF-8 so it sorts properly and consistently, and that is probably why these two packages do it. Both, especially MediaWiki count on being able to search strings well. But it also seems to be compute intensive. For what it’s worth it appears phpBB borrowed MediaWiki’s code and refactored it.</p>

The last of our quadruple, WordPress, contains a function to do a basic validation if a stream is UTF-8. However it seems that in practice this seems to be more UTF-8 detection than it is validation. The function, called seems_utf8(), allows 5 and 6 byte sequences, which were apparently in the early UTF-8 versions, before the Unicode consortium decided to limit the code point range to U+10FFFF, making anything over 4 bytes unnecessary. It also does not check for the disallowed UTF-16 surrogate code points, or byte order marks. Those last two points are important because Windows, Java, and Oracle store text internally in UTF-16. So a botched conversion from one of these sources into the browser could send invalid text to your PHP application. I don’t know the ins and outs of copy and pasting from one of these sources to a browser. I assume a conversion happens but don’t know where.

Getting back to the utf8_is_valid() function in PHP UTF-8, this is essentially what I have in mind for my CMS. The code in that function comes from another small library written by Henri Sivonen of validator.nu fame. He provides a function to convert a UFT-8 string to Unicode code points, returned in an array, and another one to convert the array to a UTF-8 string. Sound familiar? This is where the PHP UTF-8 library got it’s code that does the same thing. I stumbled onto this library through the WACT site, and ended up coding essentially the same thing Harry Fuecks did. I also made a “sanitizer” version that “deconstructs” the byte stream and throws out all the bad byte sequences without the intermediate point of the array. It just concatenates the good byte sequences int a new string.

I haven’t gotten into the functions PHP natively provides for checking and converting character encodings. But this post is long enough so that will have to wait until another day. So for anyone in the PHP community lucky or unlucky enough to read this post, should we be validating strings to make sure they are in the encoding we think they are in?

An idea and a setback

I’ve had an idea. I’ve been having lots of ideas lately but this one has been in my head for a while. There are lots of Content Management Systems or CMS’ out there, software that you use to build and maintain a website. They come in all shapes, sizes, and complexities. I have observed there is a missing niche among all of these CMS’. Something that is simple to use for non-computer people, and offer the basics without going overboard with complexity or features. This WordPress blog I’m typing on now is a good example. In fact WordPress has become very popular as a CMS for small websites where you just need to make a few pages.

That is where my inspiration came from. Our SCCA webpage could use a CMS like that. And a friend of mine runs a videography business where he wants to be able to update his site frequently. You could just use WordPress for these sites, but why drag along all the blog oriented code when you aren’t going to use it. Plus I believe WordPress could use a good tune-up. It is still written for the no-longer-supported PHP4.

So I started writing my own little CMS in my spare time. It’s very slow going. It’s hard to get something really going when you are doing it two hours at a time. and of course I’ve been running into some snags and doing a lot of learning. One of those snags turned out to be database access. I saw that I would be writing a bunch of similar queries. So I wrote a lightweight database abstraction layer to automate some of that SQL creation. This database layer is based on PHP5’s PDO, meaning it has an object oriented interface and I can use it to connect to many different databases, assuming the SQL I write is compatible.

I don’t want to re-invent the wheel so I looked at two frameworks that do something similar, Zend_Db and Solar_Sql for ideas. They both take a different approach on how to handle prepared statements and how to pass the data into them. I tried to take the middle road and support both. The solution which I came up with, I recently found out won’t work. So I’ve got to give it a big re-think. It’s thing like this that are slowing the project down. Plus the stop-starting from lack of time. I didn’t want to talk too much about this project until it was more together. but based on this “little” setback I realized that it’s going to take a lot longer then I hoped for this thing to see the light of day. So I might as well talk about it online. I certainly haven’t been autocrossing this summer. :-(

Back on the Horse

September. Damn that seems like a long time ago, because well it was. And a lot has happened since then. For one I’m married now. There is not much to blog about atucrossing over the winter b/c there really isn’t much autocrossing over the winter. Philly region does have a winter series but I didn’t attend it b/c my Miata got rear-ended and subseqently totaled out by the insurance co. That was obviously a big bummer b/c I was looking forward to really tuning the car this year and bridging the gap to the CRXs.

And Irene and I decided to get rid of the Prelude. It was just getting two high in mileage. So that is two cars gone over the winter. We did get something cool to replace the Prelude though, a Mazdaspeed3. We both like th car a lot. S-plan pricing really made it a good deal.

Initially I was going to find another Miata shell this spring and swap everything over. But I decided to put that on hold till next year so we can focus on finding a house. The government is holding this $8,000 tax credit carrot in front of our noses and we’d be stupid to not try and take advantage of it. So there hasn’t been much to report on the car front. I have done a few autocrosses in the Mazdaspeed3. But I’ve also missed a bunch.

So because of all that stuff, and there was some crappy medical stuff going one over the winter, that I haven’t been posting. I’ve realized of late that there is a lot of other stuff I could be posting about other than cars and autocrossing. I don’t like putting to much personal life stuff up on a public blog. But I have been getting more interested in what is happening in politics, or more like what the current President and congress are doing to “fix the economy”. For the record I don’t like it, which is why I feel I should be writing about it. In addition there is the real possibility of “carbon cap & trade” coming to this country now. I have some strong opinions on that since I work in the utility industry. Specifically I don’t like it.

I have also been messing around with computers more lately. I got really fed up with Vista and tried Linux again. There is at least one future blog post in that story. I have also been messing around with trying to write a small CMS for websites. It is going very slowly since I can only dedicate an hour or two each day to it. IT is very rewarding though and I could see myself doing that for a living if I was to find a way to change careers w/o a drop in pay (yeah right!). And I have been doing more unique stuff with the 3D design software we use in work, Autodesk Inventor. There are some potential blog posts waiting to come out of that experience.

So now that I’ve made the excuse post and gotten back on the horse let’s hope I can me more active on this thing again.