perl, UTF-8, and photo EXIF data…

A comment on a previous post deserves a followup:

If you’re interested in writing it up, I would certainly be interested in reading about the details of the utf-8 data issues you experienced (and how you fixed it).

It’s a fair question, and easy to answer once you know what to look for, but not entirely obvious. The symptom I had was that my copyrights, which have the © symbol in them, were showing spurious characters in them; it was clearly a weird UTF-8 issue (I love the “I’ve dealt with this before, now I just have to remember how” problems).

My first thought was that I just needed to convert the character into an HTML entity. I loaded up “HTML::Entities” and ran the string through it encode_entities(); that’s the right thing to do in general, but, well, didn’t fix the problem.

The not quite so obvious answer: Perl’s internals predate UTF, so there’s been a lot of whacking it with a stick to make it work with international character sets. One side effect of that is that unless it knows you’re using UTF-8, or you tell it you are, it assumes everything is 8bit ascii. If you’re doing unicode type things within the code itself, Perl will figure it out and it’s (mostly) transparent to the programmer.

Not so with external data; typically, this is a problem when reading in from a database, but EXIF data loaded from an image is handled the same way. Unless you tell Perl that data may have UTF-8 data in it, it treats it as 8bit data.

There are a couple of ways of doing this. What I ended up doing was loading in the Encode library (“use Encode;”) and then running the string through decode_utf8(). That tells Perl to treat the string as unicode and does the necessary internal conversions. After that — it’ll handle things behind the scenes for you (mostly).

$s .= '<div class="piccopy">' . encode_entities(decode_utf8($$picinfo{'Copyright'})) . '</div>' . "n";

You can also tell perl and any data coming from an incoming stream is unicode when using open() and etc. Google is your friend here.

So the answer is fairly simple, the causes somewhat baroque, and frankly, I’m probably being a bad person by not building unicode support into my scripts automatically (but I’ve been coding Perl a long, long time, and habits die hard). This is a place where I need to update my best practices, probably.

And I still need to clean up this script so that all of the incoming EXIF data is properly decoded. I solved this problem, but I haven’t yet updated the script to solve this issue generally for all of the data. And yes, that is in the TODO list…

This entry was posted in Community Management, Photography. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Both comments and trackbacks are currently closed.
  • Daniel J. Luke

    I’m no python expert, but aren’t the problems (and solutions) largely the same with both perl and python?

    Sticking just with utf-8:
    - If you have a source file that has utf-8 characters, you need to put something special in the script so that the interpreter does the right thing
    - If you are reading in/writing out data, you probably have to tell the interpreter the encoding of the data you are reading/writing (or you have to encode/decode the bytes you read)
    - If you are using an ‘external’ library (to parse xml, pull data from a db, something else) it may or may not handle things nicely for you

  • http://twitter.com/kevinmarks Kevin Marks

    Python’s well-thought out UNicode handling is worth the switch form perl, IMO. 

    • http://www.chuqui.com chuqui

      I agree completely. I had an ulterior motive staying in perl in that I needed to brush the rust off my perl for a work project, and this was a convenient way to do it. If I were starting any new, significant projects, I’d do them in python. And don’t be suprised if the next version of this beast is in python… but it was a really useful way to dive back into the perl pool for me, as well as a practical one. 

      I would, in fact, say that 99% of the time, people should consider Python or node for a project any time they find themselves thinking “I wonder if CPAN has a module for that….” — much as I love perl, I find it hard to justify starting new projects in it. Even though I actually did do one…