Ruby Encoding
I still remember the frustration in the migration from Ruby 1.8 to 1.9 for Performance Analysis Suite. Most of the pain came from encoding…
I still remember the frustration in the migration from Ruby 1.8 to 1.9 for Performance Analysis Suite. Most of the pain came from encoding problems.
Encoding in Ruby could drive you crazy if you have not dealt with it before or if you are coming from other languages like Java and Python. I found Yehuda Katz has the most comprehensive explanation on the encoding topic. You should read his Encodings, Unabridged and Ruby 1.9 Encodings: A Primer and the Solution for Rails first. If you ever want to know more background on encoding and Unicode, Joel Spolsky has the best treatment: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
It’s actually is not all that hard.
In Ruby (1.9+), each String has its own Encoding. This is different from other languages like Java and Python, which transcodes every String to the same Unicode. With all Strings using the same encoding, Strings can be freely mixed together by various operations (like concatenation). Developers don’t need to worry about encoding when dealing with Strings. Do you remember seeing encoding exception thrown from String operation in Java? This is not the case with Ruby:
incompatible character encodings: ISO-8859–1 and UTF-8 (Encoding::CompatibilityError)
The reason for this error is dead simple: ISO-8859–1 and UTF-8 encodings are not compatible and we just can’t mix them together (unless both contain only ASCII characters).
Note that it is a runtime error when two Strings with incompatible encoding come together. It is quite likely things work perfectly fine until some day it blows.
To prevent this error, we have to make sure never mix Strings of incompatible encoding together. One way is to manually transcode either one to the other’s encoding. String.encode
makes it really easy to do that (more on this method’s usage later in this article).
An even easier way is to make Ruby behave like Java and Python, by default transcode all Strings to the same encoding:
Encoding.default_internal = ‘UTF-8’
(If you are still using 1.9, you also need to add magic comment # encoding: utf-8
in the first or second line of every source file. On 2.0 or later, the default source encoding is already UTF-8.)
Ruby, Ruby standard libraries, and most major libraries should already respect this option. But that’s just half of the story. If you are responsible for bringing Strings into Ruby, you have to make sure they are transcoded form the correct encoding and to the intended encoding:
# Encoding.default_internal = ‘UTF-8’
str = File.binread(“…”) # str.encoding is ASCII-8BIT
str.force_encoding(“SHIFT-JIS”).encode!
The force_encoding
tells Ruby to start using the given encoding for the String and then encode!
re-encodes it from SHIFT-JIS to Encoding.default_internal
. Note that force_encoding
does not cause any re-encoding but merely re-tagged the String’s encoding. What this code snippet does is reading a file in, using SHIFT-JIS to decode it, then transcoding the read String to UTF-8. You could achieve the same effect with the following code:
# Encoding.default_internal = ‘UTF-8’
str = File.open(“…”, “r:SHIFT-JIS”).read
That’s probably all you need to know about Ruby encoding. You also need to be careful of the source encoding when reading data in (via IO, File, etc.), but that is really not specific to Ruby.
Now, my experience from on dealing with a few encoding related problems.
Dealing with Unknown Encoding
Before you read in a file or receive some data from the internet, you have to know what encoding it uses in order to decode it. What if we just don’t know its encoding?
I’m not sure if there is a complete solution for auto-detecting encoding. At least not that I can find easily.
The first problem from Performance Analysis Suite was with DB2 snapshot text output, which is in plain text without encoding information. Most user databases are now using UTF-8 codeset, so UTF-8 encoding was chosen to read in the snapshots. The only problem with this is with non UTF-8 databases, the SQL queries in snapshots might contain literals incompatible to UTF-8. Either the literal contains illegal byte sequence in UTF-8 or the literal contains some non-ASCII characters which would look weird using UTF-8 encoding.
Since the DB2 snapshot statement tab has simulated statement concentration function, the actual literal values are not really critical to the performance analysis. It is therefore okay for our application to accept some data loss. So we remove all invalid characters from the SQL queries:
str = File.binread(“…”) # str.encoding is ASCII-8BIT
str.encode!(‘UTF-8’, ‘ASCII-8BIT’, invalid: :replace, undef: :replace, replace: ‘’)
Any character from the source encoding ASCII-8BIT that can not be transcoded to UTF-8 would be removed (replaced by empty String). In the end, you get the snapshot text output content read in as UTF-8 String.
Using Binary when You Need Binary
Another problem was related to the dumping of possible binary data to YAML file. Some bitwise operation was applied on each character and then Strings are persisted by YAML.dump
. We used to apply bitwise operation directly on the String with UTF-8 encoding. This caused a lot of pain as there’s no reliable way to read it back.
We should have explicitly used force_encoding(“ASCII-8BIT”)
before applying bitwise operation:
# change to binary before applying bitwise operation
str.force_encoding(‘ASCII-8BIT’)
result = ‘’.force_encoding(‘ASCII-8BIT’)
# bitwise operation
str.size.times {|i| result << bitwise(str[i].ord, decoding)}
# change back to UTF-8 if decoding
result.force_encoding(‘UTF-8’) if decoding
{% endhighlight %}
When decoding, after bitwise operation we just tag the encoding back to UTF-8.