A comment on a previous post deserves a followup:
If you’re interested in writing it up, I would certainly be interested in reading about the details of the utf-8 data issues you experienced (and how you fixed it).
It’s a fair question, and easy to answer once you know what to look for, but not entirely obvious. The symptom I had was that my copyrights, which have the © symbol in them, were showing spurious characters in them; it was clearly a weird UTF-8 issue (I love the “I’ve dealt with this before, now I just have to remember how” problems).
My first thought was that I just needed to convert the character into an HTML entity. I loaded up “HTML::Entities” and ran the string through it encode_entities(); that’s the right thing to do in general, but, well, didn’t fix the problem.
The not quite so obvious answer: Perl’s internals predate UTF, so there’s been a lot of whacking it with a stick to make it work with international character sets. One side effect of that is that unless it knows you’re using UTF-8, or you tell it you are, it assumes everything is 8bit ascii. If you’re doing unicode type things within the code itself, Perl will figure it out and it’s (mostly) transparent to the programmer.
Not so with external data; typically, this is a problem when reading in from a database, but EXIF data loaded from an image is handled the same way. Unless you tell Perl that data may have UTF-8 data in it, it treats it as 8bit data.
There are a couple of ways of doing this. What I ended up doing was loading in the Encode library (“use Encode;”) and then running the string through decode_utf8(). That tells Perl to treat the string as unicode and does the necessary internal conversions. After that — it’ll handle things behind the scenes for you (mostly).
$s .= '<div class="piccopy">' . encode_entities(decode_utf8($$picinfo{'Copyright'})) . '</div>' . "n";
You can also tell perl and any data coming from an incoming stream is unicode when using open() and etc. Google is your friend here.
So the answer is fairly simple, the causes somewhat baroque, and frankly, I’m probably being a bad person by not building unicode support into my scripts automatically (but I’ve been coding Perl a long, long time, and habits die hard). This is a place where I need to update my best practices, probably.
And I still need to clean up this script so that all of the incoming EXIF data is properly decoded. I solved this problem, but I haven’t yet updated the script to solve this issue generally for all of the data. And yes, that is in the TODO list…

