Saturday, February 14, 2009

Unicode in Perl

Sometimes it feels that perl's power in string manipulations comes at a cost of its synthax being awkward.

When you open a file for reading without caring in what encoding is its contents, you do:


open FILE, "<".$filename or die $!;


But if you do care of an encoding you should open the file using the following instruction:


open ENC_FILE, "<:encoding(cp1251)", $enc_filename or die $!;


Now the key point is in comma following the encoding instruction. If you put there "." instead (which I believe does the concatenation of stream direction sign "<" and the filename), the file fails to open.

Another important addition is: if you know in advance in which encoding the file contents is represented, specify it using the above encoding instruction. Doing this you get all the string data to be in internal perl's representation which is by default utf8.

No comments: