Converting files to UTF-8

Here’s a common problem you often face as a Java ME programmer:

You’re internationalizing your game or application, so you send all of your files full of texts and labels to the translators, and you get back a bunch of files that are saved in some standard character encoding scheme, but not utf-8. You can’t bundle these files directly into the resource directory of your game’s jar file because some devices won’t be able to read them. Ideally, the solution is to tell the translating service that you need the files in utf-8 format, but often it isn’t the developer who is in charge of this, and sometimes such information gets lost in the shuffle. So your product manager hands you a pile of files and leaves you to figure out how to make them work.

Many standard text editing programs (emacs, for example) are capable of reading in a text file in one encoding and saving it in another. But if you’re a professional software engineer, you don’t want to waste your time opening up fifty files one by one, changing the encoding, and resaving them — especially if you’re likely to get more files and updates later.

What to do?

Some operating systems have built-in commands (such as native2ascii) to change the character encoding of a file. But looking at my options, I’d say the simplest and most portable solution is to just write a trivial little Java SE file converter, like this:

/**
* A utility to convert text files to utf-8.
*/
public class FileEncoder {

/**
* args[0] is the input file name and args[1] is the output file name.
*/
public static void main(String[] args) {
try {
FileInputStream fis = new FileInputStream(args[0]);
byte[] contents = new byte[fis.available()];
fis.read(contents, 0, contents.length);
String asString = new String(contents, “ISO8859_1”);
byte[] newBytes = asString.getBytes(“UTF8”);
FileOutputStream fos = new FileOutputStream(args[1]);
fos.write(newBytes);
fos.close();
} catch(Exception e) {
e.printStackTrace();
}
}

}

Because it’s written in Java, you can call this directly from your Ant build script (see this post for an example of calling an arbitrary Java program from an Ant script). That way you can actually leave the originals as they are and create the corrected files on the fly while building the rest of the resources for each target device.

In the above example, I’ve hard-coded “ISO8859_1” as the encoding of the source file. That’s ISO Latin-1, a character encoding I see a lot of here in France. For a list of other encodings supported by java (and their names for use in Java) look here. Note that the names of the encodings are a little different in Java SE (formerly J2SE) than they are in Java ME (J2ME). So in the above Java SE program, I write the output file in “UTF8” but once I’ve read the resource file into a byte array in the Java ME program on the device, I convert it to a String as follows:

String contentsOfMyDataFile = new String(dataFileByteArray, 0, dataFileByteArray.length, “utf-8”);

Now if you want to hard-code a string with non-ascii characters directly into your Java ME application, what you do is completely different from what you do when reading resource files from the jar. In the code, you use escape characters. “\u” is the signal that what follows is a unicode character code. A standard example is printing a price in euros: to put the euro symbol in a String in your code, you would write “\u20ac”

For a list of character code charts, look here.

Advertisements

6 comments so far

  1. Aline Diab on

    if u want to convert to UTF-8 Format u must reverse the 2 formats as in the following code:
    String asString = new String(contents,”UTF-8″);
    byte[] newBytes = asString.getBytes(“ISO-8859-1”);

    Thnx
    Aline

    • Manly on

      The original code is definitely correct. When something exists within a Java String object, it is already in unicode – there are no two ways about that. So a String, is a String – it is imposible for a “readable” java String object to hold UTF-8, ASCII, ISO-8859-3, or any other encoding other than that which ALL Strings use. In fact a String should be thought of as pure and absolute Unicode (i.e. built from code-points (though it actually uses UTF-16 internally, but this is almost irrelevant to the programmer).

      Instead, you only need to consider the actual character encoding when that data is written to a file, converted back to a byte array/stream, or when applying Document.parse() for XML – usually the Strings holding an XML document will have a line such as “. When you read a FileInputStream (which is a binary stream) into a byte[] array, it retains the original binary encoding – so remains in the that source encoding. To convert that Byte[] array into a String, you obviously need to specifuy the encoding. But once you have genreated the String object, it no longer exists in that original encoding (a String is a String, however an InputStream is not so does have a specific encoding that you need to be aware of).

  2. Rolf Thunbo on

    Whats wrong with Ant Copy task, on this task you can set an outputencoding.

    /Rolf

  3. carolhamer on

    Good point. The above solution is one quick way to do it if you don’t already have Ant installed, but obviously I have Ant installed. 😉

  4. kcobuntu on

    I have a good question, what happens if you don’t know the encoding of the input file???


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: