Looking for more information on how to do PHP the right way? Check out PHP: The Right Way

PHP Roundtable:
046: Character Encoding and UTF-8 in PHP
Jun 03, 2016 @ 09:13:13

The PHP Roundtable podcast, hosted by PHP community member Sammy K Powers, has posted their latest episode today: Episode #46 - Character Encoding and UTF-8 in PHP.

If you've ever gotten a number of weird looking characters in your database or on your website like, "?" and didn't know why, then this episode is for you. Those bizarre characters called "mojibake", rear their ugly heads when we don't account for a consistent character encoding. Today we discuss what character encoding is, how to accommodate for it in HTML, PHP & your database, and how we can ensure we'll never encounter an unexpected alien character in our web apps again.

For this episode Sammy is joined by Andreas Heigl and Evert Pot two developers more than familiar with Unicode woes. You can watch this latest episode through either the in-page audio or video player or by grabbing the audio file. If you enjoy the show, be sure to subscribe to their feed and follow them on Twitter for updates when future shows are released.

tagged: phproundtable podcast character encoding utf8 andreasheigl evertpot

Link: https://www.phproundtable.com/episode/character-encoding-and-utf-8-in-php

Data Encoding: A Guide to UTF-8 for PHP and MySQL
Jan 28, 2016 @ 13:22:56

The Toptal.com blog has posted a guide to data encoding in PHP and MySQL looking specifically at the use of UTF-8 and related handling. They talk about some of the updates you'll need to make to configurations, code and the MySQL settings to fully support this character set.

As a MySQL or PHP developer, once you step beyond the comfortable confines of English-only character sets, you quickly find yourself entangled in the wonderfully wacky world of UTF-8.

[...] Indeed, navigating through UTF-8 related data encoding issues can be a frustrating and hair-pulling experience. This post provides a concise cookbook for addressing these issues when working with PHP and MySQL in particular, based on practical experience and lessons learned (and with thanks, in part, to information discovered here and here along the way).

They start with the changes on the PHP side, updating the INI settings to make UTF-8 the default character set and which functions you'll need to update and replace. With those changes out of the way they move to the MySQL side, changing up settings in the my.cnf file and a few other things to consider on the database side (including that the MySQL support for UTF-8 is only a partial character set).

tagged: toptal data encoding mysql utf8 update configuration code

Link: http://www.toptal.com/php/a-utf-8-primer-for-php-and-mysql

David Sklar:
Fixing Broken UTF-8
Aug 27, 2015 @ 10:48:29

David Sklar has a post to his site showing you how to fix broken UTF-8 characters in content being passed through the normal string functions.

When working on the i18n bits of Learning PHP 7, I had a problem. My example showing how plain string functions such as strtolower() and strtoupper() mangle multibyte UTF-8 characters was making the book formatting/rendering pipeline barf. The processing tools are expecing nicely formatted, valid, UTF-8 encoded HTMLBook files. It didn’t like the mangled invalid UTF-8 characters in my example output.

To fix this, I wrote the following function to replace invalid UTF-8 sequences with the Unicode Replacement Character (U+FFFD).

He includes the code for this method that walks through the string, character by character, and checks the bytes it contains to see how it needs to be translated. There's plenty of comments in it too, explaining what it's doing as it goes along.

tagged: fix broken utf8 character function example unicode replacement

Link: http://www.sklar.com/php/2015/08/25/fixing-broken-utf8/

Three Devs & A Maybe Podcast:
Understanding Character Sets and Encodings
May 14, 2014 @ 13:12:06

The Three Devs & A Maybe podcast (with hosts Michael Budd, Fraser Hart, Lewis Cains and Edd Mann) has posted their latest episode (#24) talking about character sets and encodings.

Having only just recently been bit by the character encoding issue again, we thought it would be a good time to bring it up on the podcast. Starting from the beginning with ASCII, we move on to discuss how 8-bit compatible machines brought way to the ISO-8859-* standards. This leads us on to Unicode, with the goal to develop a single character-set encoding standard that could support all of the world's scripts. Finally, we discuss the de-factor character encoding implementation used on the web today 'UTF-8', and reasons why this is the case.

Lots of different topics are mentioned including reversing a Unicode String in PHP using UTF-16BE/LE, portable UTF-8 and a YouTube video covering Pragmatic Unicode. You can listen to this new episode though the in-page player, by downloading the mp3 or subscribing to their feed.

tagged: threedevsandamaybe podcast ep24 unicode character set encoding utf8

Link: http://threedevsandamaybe.com/posts/understanding-character-sets-and-encodings/

Edd Mann:
Reversing a Unicode String in PHP using UTF-16BE/LE
May 12, 2014 @ 10:55:00

Edd Mann looks at an issue in his latest post that caused him problems in a recent project, reversing a Unicode string with UTF-16BE/LE.

Last week I was bit by the Unicode encoding issue when trying to naively manipulate a user's input using PHP's built-in string functions. PHP simply assumes that all characters are a single byte (octet) and the provided functions use this assumption when processing a string. [...] You should be aware that in 'Western Europe' we commonly only use the basic ASCII character-set (consisting of 7 bytes). This makes the transition to the popular 'UTF-8' Unicode representation almost seamless, as the two map one-to-one. I wish to however, discuss how to reverse a Unicode string (UTF-8) using a combination of endianness magic and the 'strrev' function.

He provides two different approaches to the problem. The first he calls the "naive" approach because it corrupts characters needing more than the two-byte representation. His second solution, the "endianness" method, converts the string to big-endian first (UTF-16) and then back to UTF-8 for more correct handling.

tagged: unicode string utf8 utf16 bigendian endian convert reverse string

Link: http://eddmann.com/posts/reversing-a-unicode-string-in-php-using-utf-16-be-le

SitePoint PHP Blog:
Bringing Unicode to PHP with Portable UTF-8
Sep 10, 2013 @ 11:19:05

On the SitePoint PHP blog there's a new tutorial showing you how to bring portable UT-8 support to PHP with the Portable-UTF8 library. UTF-8 handling has long been one thing desired in the core of PHP, but hasn't been introduced quite yet.

PHP’s lack of Unicode/multibyte support means that the standard string handling functions treat strings as a sequence of single-byte characters. In fact, the official manual defines a string in PHP as a “series of characters, where a character is the same as a byte.” PHP supports only 8-bit characters, while Unicode (and many other character sets) may require more than one byte to represent a character. This limitation of PHP affects almost all aspects of string manipulation, including (but not limited to) substring extraction, determining string lengths, string splitting, shuffling etc.

The article mentions some of the efforts in the past that have been made to try to introduce this functionality into the core, but was shelved at the time. Instead of waiting on this feature to be introduced, they show you how to use the library to do things like check for UTF-8 strings, "cleaning" the UTF-8 strings and do some validation on the string's contents.

tagged: unicode portable utf8 library tutorial

Link: http://www.sitepoint.com/bringing-unicode-to-php-with-portable-utf8/

Let's talk Character Encoding
Mar 15, 2012 @ 11:07:07

On Reddit.com there's a recent post with a growing discussion about character encodings in PHP applications (with some various recommendations).

I would rather not have to convert these weird characters to the HTML character entities, if possible. I'd rather be able to use these characters directly on the web page. If this is for some reason a bad idea, let me know. This might be more of a general web design question (i already posted it there), but I figured it is still appropriate to post here as well since PHP is used to pull an entry from the database, and I figured a lot of you here would know the answer to the question.

The general consensus is to use UTF8 in this case, but there's a few reminders for the poster too:

  • Don't forget to make the database UTF8 too
  • Be sure you're sending the right Content-Type for the UTF8 data
  • an link to an article about what "developers must know about unicode/charactersets"
tagged: character encoding advice reddit utf8 contenttype unicode


Nikita Popov's Blog:
htmlspecialchars() improvements in PHP 5.4
Jan 30, 2012 @ 09:55:24

In this new post to his blog Nikita Popov looks at an update that might have gotten lost in the shuffle of new features coming in PHP 5.4 - some updates to htmlspecialchars.

One set of changes that I think is particularly important was largely overlooked: For PHP 5.4 cataphract (Artefacto on StackOverflow) heroically rewrote large parts of htmlspecialchars thus fixing various quirks and adding some really nice new features. Here a quick summary of the most important changes: UTF-8 as the default charset, improved error handling (ENT_SUBSTITUTE) and Doctype handling (ENT_HTML401,...).

He goes into each of these three main features in a bit more detail, providing code to illustrate the improved error handling and the new flags for Doctype handling (covering HTML 4.01, HTML 5, XML 1 and XHTML).

tagged: htmlspecialchars improvement release doctype error utf8


Patchwork-UTF8 - UTF8 Support for PHP
Jan 27, 2012 @ 11:38:40

Nicolas Grekas has shared another tool that he's pulled out of his "Patchwork" framework to make it a stand-alone tool: the Patchwork-UTF8 helper that provides matching functions to those PHP already has for regular strings, but a little smarter to work with UTF8 correctly.

The PatchworkUtf8 class implements the quasi complete set of string functions that need UTF-8 grapheme clusters awareness. These functions are all static methods of the PatchworkUtf8 class. The best way to use them is to add a use PatchworkUtf8 as u; at the beginning of your files, then when UTF-8 awareness is required, prefix by u:: when calling them.

In the README for the tool he talks about the functions included in the current release that match PHP's string functions as well as some additional methods like "isUtf8", "bestFit" and "strtocasefold". It relies on the mbstring, iconv and intl extensions being installed, and if they aren't, it falls back to other functionality (list of those methods included).

tagged: utf8 support string patchwork framework helper mbstring iconv intl


Ahmed Shreef's Blog:
iconv misunderstands UTF-16 strings with no BOM
Aug 27, 2010 @ 13:36:56

Ahmed Shreef has a recent post to his blog about an issue he had converting UTF-16 strings over to UTF-8 with the iconv functionality in PHP. Specifically, he ended up with "rubbish unreadable characters" after the conversion.

I had a problem last week with converting UTF-16 encoded strings to UTF-8 using PHP's iconv library on a Linux server. my code worked fine on my machine but the same code resulted in a rubbish unreadable characters on our production server.

In his example (a basic "Hello World" in Arabic) he notes that there's no byte order mark on the string and, because of this, the iconv feature tries to guess if it's big-endian or little-endian. This guessing varies from machine to machine resulting in the inconsistencies he saw. The solution is to define the "to" and "from" for the conversion manually rather than letting it just guess.

tagged: byteordermark bom iconv utf16 utf8 convert