Looking for more information on how to do PHP the right way? Check out PHP: The Right Way

Edd Mann:
Reversing a Unicode String in PHP using UTF-16BE/LE
May 12, 2014 @ 15:55:00

Edd Mann looks at an issue in his latest post that caused him problems in a recent project, reversing a Unicode string with UTF-16BE/LE.

Last week I was bit by the Unicode encoding issue when trying to naively manipulate a user's input using PHP's built-in string functions. PHP simply assumes that all characters are a single byte (octet) and the provided functions use this assumption when processing a string. [...] You should be aware that in 'Western Europe' we commonly only use the basic ASCII character-set (consisting of 7 bytes). This makes the transition to the popular 'UTF-8' Unicode representation almost seamless, as the two map one-to-one. I wish to however, discuss how to reverse a Unicode string (UTF-8) using a combination of endianness magic and the 'strrev' function.

He provides two different approaches to the problem. The first he calls the "naive" approach because it corrupts characters needing more than the two-byte representation. His second solution, the "endianness" method, converts the string to big-endian first (UTF-16) and then back to UTF-8 for more correct handling.

tagged: unicode string utf8 utf16 bigendian endian convert reverse string

Link: http://eddmann.com/posts/reversing-a-unicode-string-in-php-using-utf-16-be-le

Ahmed Shreef's Blog:
iconv misunderstands UTF-16 strings with no BOM
Aug 27, 2010 @ 18:36:56

Ahmed Shreef has a recent post to his blog about an issue he had converting UTF-16 strings over to UTF-8 with the iconv functionality in PHP. Specifically, he ended up with "rubbish unreadable characters" after the conversion.

I had a problem last week with converting UTF-16 encoded strings to UTF-8 using PHP's iconv library on a Linux server. my code worked fine on my machine but the same code resulted in a rubbish unreadable characters on our production server.

In his example (a basic "Hello World" in Arabic) he notes that there's no byte order mark on the string and, because of this, the iconv feature tries to guess if it's big-endian or little-endian. This guessing varies from machine to machine resulting in the inconsistencies he saw. The solution is to define the "to" and "from" for the conversion manually rather than letting it just guess.

tagged: byteordermark bom iconv utf16 utf8 convert

Link:

Danne Lundqvist's Blog:
Detecting UTF BOM - byte order mark
Apr 29, 2010 @ 16:47:03

In a new post to his blog Danne Lundqvist looks at a common pitfall that could trip you up if you're not careful with your UTF-8 data - not looking for the UTF byte order mark that tells the application if it needs to be handled as UTF content.

One such thing is the occurrence of the UTF byte order mark, or BOM. [...] For UTF-8, especially on Windows, it has become more and more common to use it to indicate that the file is indeed UTF. Most text editors handle this well and you won’t ever see these bytes. As it should be.

He points out what could cause an issue - using strcmp or substr but it can be prevented by looking at and removing those first three bytes if needed. He includes a snippet of code that does just that.

tagged: byteordermark utf utf8 utf16 detect

Link:

Lukas Smith's Blog:
One thumb up and two down (Zend_Http_Client)
Jun 16, 2008 @ 14:32:24

Coming back from some previous comments about the Zend_Http_Client in the Zend Framework, Lukas Smith admits that a certain feature has come in handy with their development, but another bug has come up that has gotten under his skin - a problem with the component's cookie handling.

We ran into a really hard to find bug in the cookie handling of Zend_Http_Client, which has been filed as a bug back in August 2007 against version 1.0.1 (today we are at 1.5.2). More over this is a bug that other similar packages have gotten over in 2004.

He had to use wireshark to finally track down the culprit - a call to urlencode on the contents of the cookie before sending it. He also includes some code to overcome a problem he had with UTF-16 in one of his feeds (a custom function that takes in and returns a string translated correctly).

tagged: zendframework zendhttpclient cookie handling urlencode utf16 encode

Link:


Trending Topics: