On Newsstands Now…

My “Solving the Unicode Puzzle” article was just published in the May 2005 issue of php|architect. Although the magazine is subscription only, they chose my article as this month’s free sample. Aside from introducing a few typos, it looks like they didn’t do any editing, so the published version is almost exactly the same as my original.

Article for PHP Architect

Remember that long post on UTF-8 from a few weeks ago – the one that gave you a sudden urge to take nap? I ran it by the editors of PHP Architect, and they’ve commissioned an article. I’m in the process of pumping it up to the required 4,000 words. I’ll be sending it off to them in a couple days. Commissioning the article only obliges them to pay me a small (very small) sum for it – it doesn’t mean they’ll necessarily publish it. But hopefully they will! I’ll let you know as soon as I find out – probably in a few weeks.

Converting Web Applications to UTF-8

UPDATE: I expanded this to a full length article, which was published in the May 2005 issue of php|architect. They had it available for several years as a free download, but it’s no longer available there, so you can download it from me as a PDF. My apologies for not responding to earlier comments – I had a newborn baby at the time.

An Overview of UTF-8 in PHP, Smarty, Oracle, and Apache, with data exports to PDF, RTF, email, and text

Here at the Penn Med School we recently switched our web and database applications from Western/ISO encoding to Unicode/UTF-8. We did this so we can provide better support for international character sets (Greek, Japanese, etc.). As sometimes happens with projects that involve computers, it grew into a big, hairy beast that was way beyond anything we initially anticipated. I was partly responsible for managing the transition, and since I found no comprehensive guide to help us through it, I thought I’d write one now that we’re done. We’re using two-thirds of the open source PHP-Apache-MySQL trinity, with Oracle instead of MySQL. Even if you have a different mix of applications, the concepts I’ll describe are probably applicable to your situation, even if the semantics are different.

Getting Started

First, if you need some orientation in understanding character sets, start with The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). It’s actually quite readable, even if you’re not a techie.

Second, you need to read the Oracle document An Overview on Globalizing Oracle PHP Applications. It’s an excellent starting point, but unfortunately it doesn’t always explain the reasons behind its recommendations, which means you’ll get stuck if things don’t happen to work after you follow their instructions. I’ll try to fill those gaps here.

Persuading Apache and Oracle to talk to each other in UTF-8

PHP web applications are run under the Apache web server, which itself is running in a user account (assuming you’re in a Unix environment). So the first step is to set the environment of that account correctly, so it will know how to “speak” UTF-8 to Oracle. You do this by setting the NLS_LANG environment variable in the Apache configuration. The Oracle Overview document says to set it to .AL32UTF8, but doesn’t explain why. So when this didn’t do the trick for me, I had to do some more research. I found the Oracle Character Set descriptions, and found that .AL32UTF8 corresponds to Unicode 3.1. After talking with our DBA I learned that our Oracle database is set to Unicode 3.0, which meant I needed to set NLS_LANG=.UTF8 (we ultimately switched to .AL32UTF8, since it is Oracle’s recommended standard). The key point here is that NLS_LANG must exactly match the character set you’re using in Oracle.

Serving your web pages to users in UTF-8

There are a few different aspects to this:

  1. If you want all the documents on your server to default to UTF-8, then set the AddDefaultCharset directive in the Apache configuration to UTF-8. You should do either #2 or #3 below in addition to this (see the Apache documentation for the reason).
  2. If you want all your PHP documents served in UTF-8, but not necessarily other document types, set default_charset=UTF-8 in your php.ini file. It’s OK if the PHP charset is different from the Apache charset: the PHP charset will apply to PHP files, and the Apache charset will apply to all other types (this goes for #3 below as well).
  3. If you only want certain PHP documents in UTF-8, specify UTF-8 in the Content-type header of those documents. It’s important to point out here that, if you haven’t done #1 or #2 above, then you must set this header with the PHP header() function. If you try to set it with an HTML Meta tag, the charset defined in Apache will override your Meta tag.

UTF-8 in form submissions

In Windows 95 and 98, Microsoft used the Windows ANSI character set. If you ever copy-and-pasted text from Microsoft Word into a web form under Windows 9x, chances are any upper ASCII characters, such as ©, turned into something like ä in the web form. This is because the web page was probably Western ISO8859-1 encoded, and that character set organizes the upper ASCII range differently from Windows ANSI. So the web page thought it was receiving a different character than what you intended. Windows NT, 2000, and XP use Unicode, so you won’t have this problem under the newer versions of Windows. Macs and most other modern OSs use either Western ISO 8859-1 or Unicode. The first 256 characters of Western ISO 8859-1 are the same in Unicode. So your Unicode encoded web form should correctly interpret upper ASCII text provided by anyone not using Windows 9x (or a completely foreign, non-Unicode character set).

Additional PHP and Oracle configurations

You will want to enable multi-byte character support in PHP. Compile PHP with the -enable-mbstring option, and set mbstring.internal_encoding=UTF-8 in your php.ini file. Also, you should definitely look over the PHP documentation for multi-byte string functions. Note that if you haven’t upgraded to PHP 5 yet, the html_entity_decode() function will fail hard if you pass it a UTF-8 string. This was the only UTF-8 incompatibility we found in PHP 4.3.

You may want to implement PHP’s function overloading. An example will illustrate why this is important: in UTF-8, a string that is 4 characters long could occupy anywhere from 4 to 12 bytes depending on the multi-byte characters in it. The mb_strlen() function will correctly tell you the number of characters in such a string, but the regular strlen() function won’t (it’ll tell you the number of bytes). Enabling function overloading will cause PHP to automatically assume it’s handling multi-byte strings, so, in this example, it will execute mb_strlen() when you call strlen(). If you’re making a wholesale conversion to UTF-8, and you don’t want to tweak all your existing code, implementing function overloading makes sense. But there is one exception: you may not want to do function overloading on mail() – I’ll get to that in a minute.

Related to this, in Oracle 9, you can set NLS_LENGTH_SEMANTICS to use either character length or byte length semantics for the tables you create. That is, you can use it to indicate whether, for example, a varchar(10) column is 10 characters, or 10 bytes.

Smarty

If you’re using Smarty with PHP, you’ll need to override the escape() function. It calls the PHP htmlentities() and htmlspecialchars() functions, but it doesn’t provide them with the necessary charset argument so they’ll work with UTF-8. Make a copy of the escape() modifier and tweak it to pass along a charset argument to PHP, and then use it to override the original.

Exporting to other formats

As you’ll see below, it may not always be wise to do data exports in UTF-8. Sometimes you need to change the character set before performing the export. Take a look at PHP’s utf8_decode() and iconv functions to learn about converting UTF-8 to single-byte encoding. Note that utf8_decode(), while easy to use, is limited to the Latin character set (see the user contributed notes on the PHP utf8_decode() page for tips on dealing with other character sets).

  • PDF: we use PDFlib on our web server to create PDF documents on the fly. For it to work with UTF-8 data, you need to use it with a UTF-8 compatible font. The standard Arial font supports Greek and Cyrillic in UTF-8, which is generally sufficient (don’t confuse standard Arial with Microsoft’s Arial Unicode MS font – while it can print just about any UTF-8 character, it’s 32MB, so you probably don’t want to load it on your web server!). Also, Gentium is a very nice UTF-8 compatible serif font that supports Greek and Cyrillic.
  • RTF: we are moving away from RTF, but we still have some applications that generate RTF files. RTF does not provide good UTF-8 support. Our solution is to do a utf8_decode() on our data before generating RTF files (we can get away with this since none of the data going into our RTF files contain non-Latin characters – hopefully we’ll get rid of RTF before non-Latin characters start showing up).
  • Text: we also do data exports to text files, mainly in .csv format for use in spreadsheets. Surprisingly, Microsoft Excel does not support importing UTF-8 encoded text files. Again, our solution is to perform a utf8_decode() before generating these text files.
  • Email: I recommend not doing function overloading on PHP mail(). The reason has to do with line breaks. In Unix, a line break is represented by a line feed (LF) character. On Macs, it’s represented by a carriage return (CR) character. And on Windows, by a CR+LF. For email to work between platforms, an email standard was agreed upon in the early days of the Internet, which is CR+LF. So, for example, on Unix, sendmail will add a CR as needed to each LF it finds in the body of an email message. But when an email is UTF-8, mailers don’t try to wade through the multi-byte encoding, and they don’t “fix” the line breaks. We found that the line breaks in UTF-8 emails (generated on Unix) were interpreted as desired in Mac and Unix mail readers, and by Microsoft Outlook on Windows, but not by Eudora 6.2 (and previous versions) on Windows. In Eudora, the messages displayed with no line breaks. You can’t say it’s a Eudora bug, since the line breaks weren’t meeting the standard. At this time, the emails we generate only contain basic Latin characters, so sticking with the standard mail() function meets our needs for now.

Thoughts on Coppermine, and Integrating It with WordPress

I mentioned in an earlier post that I installed Gallery for managing my photos. Gallery turned out to be a train wreck: the features are nice, but the programming behind it reminds me of 80s style spaghetti code, making it almost impossible to customize. For example, I burned a few hours trying to figure out how to display a random image on a page other than “the random image page” before I concluded it wouldn’t be possible without a massive rewrite. So I dumped it in favor of Coppermine. Here are the photo albums I’ve created so far.

Coppermine allows you to create a custom theme, which consists of a style sheet and a couple of template files (they contain most of the HTML widgets that are used in building the pages). The implementation is hardly perfect though: there is a fair amount of hardcoded HTML inside the PHP functions, so you have to do some detective work if you want to change certain things (e.g. the code it has for embedding video files only works in IE, so I had to track it down and tweak it to support Firefox/Mozilla). Also, it’s filled with a sloppy mish-mash of HTML and XHTML tags, so coaxing it to generate a valid document has required me to touch a lot of code. But those are my only complaints: the features suit my needs and the management interface is nice. A real plus is that you can integrate it with the Windows XP “publish to web” feature, so you can publish images with just a few mouse clicks – no more FTP’ing!

Integrating Coppermine with WordPress has been an adventure. I’m using psnGallery2, which gives you custom WordPress tags and PHP functions for embedding Coppermine photos in WordPress. First I installed the latest stable release, but couldn’t get it working at all. So I installed the current alpha release, and with some hacking, got it to work. I described the problems and my solutions in this WordPress forum post.

I also wanted an easy way to link Coppermine pictures to their related blog entries. I did this by creating a “burl” tag in Coppermine. In the title or description of a photo I can type, for example, [burl=29]some link text[/burl] and it’ll link to WordPress entry number 29. It was easy to do. In bb_decode() – located in include/functions.inc.php – I added:

$text = preg_replace("/\[burl=([0-9]+)\]/", '<a href="/blog/index.php?p=$1">', $text);
$text = str_replace("[/burl]", '</a>', $text);

It’s all set up, so now I just have to slog through all my photos and get them in Coppermine 🙁 . Don’t expect it to happen overnight, but I will try to at least fix all the images that are now broken in my old blog entries as soon as I can 😉 .

More WordPress Hacking

The clever little URL rewrite I mentioned earlier is now out the window. I noticed the URLs it generated weren’t technically correct (with a slash between index.php and the URL argument) but they worked, so I let it be. That was a few weeks ago, and in that time, not unexpectedly, all my old Movable Type URLs disappeared from Google. But they haven’t been replaced by the new ones – except for the top page, my blog is not in Google anymore. I’m guessing Googlebot didn’t like how I was doing the URL rewriting. So I’ve shuffled things around, and now my blog really is in the “blog” directory, and hopefully Googlebot will like that better. Unfortunately, that means I can’t use WordPress to manage pages outside the blog directory. But I’ve realized that doesn’t matter much, as I’ll eventually fold the Kai and wedding pages into Coppermine, and the Route 50 pages into the blog, so there won’t be many static pages left.

I’ve also become active in the WordPress forums – I explained how to get past the bugs that have been plaguing a lot of folks trying to use psnGallery (it’s a plugin that gives you easy access to Coppermine photos from within WordPress) and how to fix a bug with sorting the WordPress archive listings (the WP folks have since released their own bug fix).

Hacking WordPress

So this is my first WordPress post. But if I did a good job matching it to my old Movable Type stylesheet, things should look pretty much the same (although I haven’t updated the templates for any of the pages besides the main page, so the archives, etc. are still a bit ugly).

Something I was looking forward to with WordPress was using it for managing all the pages on toppa.com. But I immediately ran into trouble when I discovered that, in order to do this, you have to install WordPress at the top of your web docs directory (which makes sense), and you can’t have the main page be named anything other than index.php. This was a potential show-stopper, since I’m not about to give up my uber-cool custom homepage for a same-as-everyone-else’s blog page.

I found a discussion thread on this, but the hidden option to change the filename they mentioned has been removed in the latest release of WordPress. It’s no longer a variable in the database, and they didn’t even have the decency to make it a variable in the code. Instead, the code alternates between hard-coded references to index.php, or it just assumes the default index file of the install directory. Lame.

But I like all the other aspects of WordPress enough that I decided it was worth a bit more investigation. First I thought of making a symlink, but I don’t have shell access to my site, and the FTP server is configured with a restricted set of SITE command options that doesn’t include symlink. So this is what I ended up with:

  • My home page is at /index.html
  • My blog page is at /index.php (fortunately the server is configured to give precedence to index.html)
  • I added the following to my root .htaccess file:
    RewriteEngine On
    RewriteRule ^blog(.*) /index.php$1 [R=301,L]

    Since WordPress builds all it’s dynamic pages by relying on arguments to index.php, this redirect won’t drop query string arguments. I just have to be careful not to create any directories that start with the word “blog.”

Goodbye Movable Type, Hello WordPress

I have to thank Pat W for his comment on my post yesterday – he suggested I try WordPress and TextPattern. I’ve done a test installation of WordPress, and it’s a dream. I had been resisting checking out other blog applications, as I’ve become so familiar with Movable Type I didn’t want to have to learn a new system. But there’s no denying the superiority of WordPress: the installation was fast and easy, and after even just a quick run through the administrative interface, it’s obvious that the features for creating entries, theme management, etc. (even including overall site management), are superior.

I’ve realized the problem with Movable Type is that it’s based on a web programming paradigm that’s at least 8 years old, when a “dynamic” page was something you generally only saw after filling out a form, and you needed to stick with static pages whenever possible anyway, for the sake of conserving server resources (it’s the kind of stuff I used to teach in my CGI/Perl class in 2000). The web has moved beyond that, but only with the latest version of Movable Type does it seem that Six Apart has even started to catch on. What’s ironic is that the early success of Movable Type has allowed Six Apart to become a real company (now with at least 50 employees), but they’re clearly not keeping up with what the smaller competitors are doing. Instead they’ve let themselves be distracted with developing TypeKey, a project they would have known was doomed from the start if that had just asked somebody (not even Microsoft has the market leverage to get everyone to sign on to a central registration system – how many of you use Microsoft Passport?). A slightly out-of-date but still useful overview of Movable Type and its competitors is at the Unbounded blog entry Goodbye, Movable Type.

The time-consuming part of the transition will be migrating my stylesheets and templates away from the custom Movable Type tags and into PHP. But since PHP is the main language I program in now, it’ll ultimately be a good thing.

Building a PVR with Mike & Chris

My blogging has been sparse recently because my usual blogging time has been taken over by my Personal Video Recorder (PVR) assembly project. A PVR is a do-it-yourself TiVo. The main advantage over TiVo is that you don’t have to pay anyone a monthly subscription fee. The disadvantage, especially if you want to use your PVR with a TV instead of a computer monitor, is that you’re dealing with bleeding edge technology. That means you’ll find lots of debates about the “right way” to configure the system and you’ll inevitably hit a few snags while setting things up. But if you’re a geek, that’s also what makes it fun. And besides, according to the New York Times, everybody’s doing it.

My friend and co-worker Chris wanted to build one too, so we decided to pool our expertise and save on shipping costs by buying our components together. We built our systems from scratch. Where things got tricky was deciding what to do for the TV tuner card, video card, and PVR software. We did lots of Googling and browsed through the SageTV forums to assess our options (the Build Your Own PVR site was also helpful). The only thing everyone agreed on is that you need a TV tuner card with hardware-based encoding, so that writing your favorite TV shows to files doesn’t slow your PC to a crawl.

Where folks disagreed was on how to get the best picture when decoding the files back to your screen. If you use an ordinary TV tuner card (like the Hauppauge PVR-150) with an ordinary video card, and run the output via S-video to your TV, the picture quality will, at best, be about the same as a VHS tape. I started with a configuration like that, and was disappointed with the results. Cartoons, with their limited use of color and detail, looked fine, but live action scenes, especially if they involved hard-to-digitize video elements like smoke, looked lousy.

There are two ways to a better picture. One is to go with a higher-end video card that’s designed for gaming (specifically, an ATI or NVIDIA card). Most of the folks in the forums who were watching on a TV instead of a computer monitor used the S-video out on these cards. Some of the new cards apparently have a component out as well. Some used a VGA-to-component adapter (but you have to be careful not to blow up your TV!).

The other approach is to go with the Hauppauge PVR-350, which is a tuner card that has a hardware-based decoder and S-video out built-in. This is supposed to give the best possible picture, but there’s a major drawback: it will only output data that was processed through the tuner. That is, you can’t use it as a substitute for a video card (e.g. it won’t show your Windows desktop). The big breakthrough for this card came last year when the SageTV folks figured out how to run their TV scheduling video overlays through it, so you could at least see that much through your TV.

I currently have the PVR-150, which I’m going to give to Chris (as he hasn’t bought his TV tuner card yet, and that’s the one he wants). I’m going to get the PVR-350. Since I don’t want to have a computer monitor sitting next to the TV, I’m also going to get an S-video selector box. My TV only has one S-video input, so I can use the selector box to switch between the output from the TV tuner card and the output from the video card. I’ll probably only need access to the desktop every once and a while, so it should work out fine.

My main motivation for doing all this was to provide time-shifting and commercial-removal for the shows Kai watches. SageTV is really cool: you can easily search the TV schedule for a show (say, Sesame Street) and then tell it to record every new instance (i.e. so it won’t record a re-run if you’ve already recorded it), and then you can tell it to keep, say, the 10 most recent episodes on the hard drive, and to just delete older ones. And I’ll finally be able to keep up with The Daily Show! I’m putting Kai to bed when it’s on at 7, and I’m asleep when it’s on at 11. Since The Daily Show seems to have more than your average number of commercials, I bet I can watch an entire episodes in 15 minutes after skipping them.

The other cool thing is that you can hook up your VCR to it, so you can transfer your videotapes to DVD. We have some infant videos of Kai I’d like to digitize! And I guess it’s the only way I’ll ever see the non-Special Edition of Star Wars on DVD…

Keeping up with P2P

Back in the day, (all of 2 years ago), I found good stuff on Napster. The concerns over copyright weren’t significant in regard to my interests, as I was mostly looking for obscure tracks (B-sides, concert bootlegs, etc.) from obscure bands (NoMeansNo, Ed’s Redeeming Qualities, Steroid Maximus, etc.) – not the kind of stuff that’s going to hurt anybody’s record sales. (Rather than fighting new technology like most of the music industry, insound.com has embraced it, and they’re making a bundle, but that’s another topic…).

There were two reasons Napster had such a huge library: 1. it was the only significant online P2P system around at the time, and 2. it got a huge amount of free publicity from the news media. Now we have a number of different P2P networks and a variety of client software packages to choose from. Venturing into this world, I stumbled around for a while before figuring out the best approach. There are a lot of client software options: Morpheus, Limewire, BearShare, Xolox, Phex, neoNapster, Shareaza, and more. There are also different networks you can connect to: Gnutella1, Gnutella2, the eDonkey network, bitTorrent, and probably more.

I started out with Morpheus but it was a huge resource hog. It also came with spyware. I then tried LimeWire, which was much nicer to my PC and did not contain spyware. But my searches would not persist. What I mean is this: I’d enter a search, and it would chug away for 10-15 minutes, and then it would essentially forget about. My search would continue to display, but if I didn’t get any results right away, then I would never get any at all. I’d have to keep re-running the search to keep up with changes on the network. I’ve now settled on using Shareaza, which runs nicely, has no spyware, connects to all the major networks, and diligently runs my searches continuously.

The eDonkey network is better at finding what you’re looking for and has sophisticated handling for large files, which means a lot of the action, especially for movies, has moved there. The downside is that the queues on eDonkey can be mighty long (I often get a queue position over 1,000) so you have to be willing to leave your PC on continuously (and hope your network connection doesn’t go down, which will force you out of your queue position if you can’t get back online quickly). Out of curiosity I downloaded a couple of movies. I found the quality to be poor: really major compression artifacts were always a problem, and some of the movies were just recorded by someone in a theater with a camera, so you sometimes get people walking in front of the camera, ambient noise, etc. But given the ever-increasing bandwidth capacity of networks, and the ever-improving compression technologies, I imagine Hollywood is soon going to fully join up with the music industry in its war against filesharing.

AIM Users Beware

If you use AOL instant messenger, it has probably installed a program called wildtangent on your system. This is an online gaming plugin. According to the company that makes it, it’s not doing anything pernicious. But spyany.com says it that will share your name, address, phone number and email address (if it can get them, presumably from your AOL profile) and track your software product usage. They also tell you how to uninstall it.

What’s interesting is that AOL didn’t modify their EULA to cover the inclusion of wildtangent until after they began distributing it with AIM. So even if you actually bothered to read the fine print, you wouldn’t have known about it.

I discovered wildtangent installing itself on my system when my spybot system monitor caught it trying to make changes to my system registry. I’ve been refusing my AIM client’s recent attempts to upgrade itself, but that apparently doesn’t stop the wildtangent installation from happening. If you’re using Windows, I highly recommend spybot – you can download it for free.