Lenny Domnitser’s
domnit.org

⇙ Jump to content

explicit

This is a static archive of the domnit.org blog,
which Lenny Domnitser wrote between 2006 and 2009.

Javascript Fix for Mixed Character Set Errors

Unicode is good, but my knowledge basically stopped at higher level encode/decode functions, so I decided to learn UTF-8. Turns out it’s easy, and the little pedagogic project I came up with came out useful. I wrote a bookmarklet that converts broken stuff like “Ãœníçøde” to “Üníçøde” in Firefox. That is, it fixes UTF-8 that has been interpreted as ASCII, ISO-8859-1 (Latin-1), or Windows-1252.

Firefox already lets you switch encoding (View → Character Encoding) if the author of a web page incorrectly declared what character set to use, or didn’t do at all. My code is different from the built-in stuff in a few ways:

  1. It does not reload the page.
  2. It does not see the raw bytes. It works from the Unicode text that Firefox builds.
  3. It converts broken UTF-8 while preserving correct ISO-8859-1 text.

Characters of different encodings can get mixed together in naïve cut-and-paste scenarios. Here’s an example broken page. If you try to use Firefox to reload the page as UTF-8, some text will be fixed, but the previously good text will lose its “ó” for a big honking “�”.

UTF-8 has can take anywhere from 1 to 4 bytes per character. One byte UTF-8 characters are just 7-bit ASCII, a subset of the ISO and Windows sets, so they are left alone. The rest of the characters match byte patterns in certain ranges which are unlikely to appear in regular text. They can be described in a compact and machine-readable way by a regular expression:

var pattern = /[\xC2-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF4][\x80-\xBF]{3}/g

Broken down, the regular expression matches one of:

  1. A byte between 0xC2 and 0xDF, followed by a byte between 0×80 and 0xBF, or
  2. A byte between 0xE0 and 0xEF, followed by 2 bytes between 0×80 and 0xBF, or
  3. A byte between 0xF0 and 0xF4, followed by 3 bytes between 0×80 and 0xBF.

These correspond to 2-, 3-, and 4-byte UTF-8 characters.

Given a matching byte sequence, I can use a Javascript trick to make Unicode. First, URI-encode using the Unicode-ignorant escape function. Then, URI-decode using the Unicode-aware decodeURIComponent function. So, using the regular expression from above, this fixes UTF-8 text labeled as ISO-8859-1:

text.replace(pattern, function(s) {
return decodeURIComponent(escape(s));
});

Except it doesn’t. Some characters failed to be encoded properly. They were all in a certain range, so I looked up ISO-8859-1 and saw that those were undefined codes. Not seeing any pattern, I built a translation table of undefined bytes and the Unicode characters they produced. The result looked familiar: Sam Ruby’s iñtërnâtiônàlizætiøn survival guide, which I obviously had not looked at closely enough before, listed the same data as a mapping from Windows-1252 to Unicode. Apparently, Firefox has some voodoo that takes UTF-8 in ISO-8859-1 and treats it as Windows-1252. Instead of keeping the byte intact, the string available to my script contains the Unicode equivalent to the Windows character, which has no correlation to the original byte, and therefore the original UTF-8 text.

So, besides matching UTF-8 byte sequences, I must also match all the two-character sequences of [\xC2-\xDF] followed by the Unicode equivalent of some Windows-1252 characters. This led to somewhat less elegant code, partly because of the lookup table I’ve included, partly because escape produces some '...%uXXXX...' gobbledygook, which has to be fixed up before passing to decodeURIComponent.

Note that there is no way for the code to know when text is actually “broken”. All it does is detect the byte patterns that can be UTF-8, which, luckily (by design), are rarely anything but mislabeled UTF-8.

The final Javascript code is fixencoding.js. I think the only browser it currently works in is Firefox 2, because it uses Javascript 1.7 array comprehensions.

The bookmarklet is available on my bookmarklets page.