Lenny Domnitser’s
domnit.org

⇙ Jump to content

explicit

This is a static archive of the domnit.org blog,
which Lenny Domnitser wrote between 2006 and 2009.

Pretty Swirly Color Thing

Look at the colors go!

Javascript Fix for Mixed Character Set Errors

Unicode is good, but my knowledge basically stopped at higher level encode/decode functions, so I decided to learn UTF-8. Turns out it’s easy, and the little pedagogic project I came up with came out useful. I wrote a bookmarklet that converts broken stuff like “Ãœníçøde” to “Üníçøde” in Firefox. That is, it fixes UTF-8 that has been interpreted as ASCII, ISO-8859-1 (Latin-1), or Windows-1252.

Firefox already lets you switch encoding (View → Character Encoding) if the author of a web page incorrectly declared what character set to use, or didn’t do at all. My code is different from the built-in stuff in a few ways:

  1. It does not reload the page.
  2. It does not see the raw bytes. It works from the Unicode text that Firefox builds.
  3. It converts broken UTF-8 while preserving correct ISO-8859-1 text.

Characters of different encodings can get mixed together in naïve cut-and-paste scenarios. Here’s an example broken page. If you try to use Firefox to reload the page as UTF-8, some text will be fixed, but the previously good text will lose its “ó” for a big honking “�”.

UTF-8 has can take anywhere from 1 to 4 bytes per character. One byte UTF-8 characters are just 7-bit ASCII, a subset of the ISO and Windows sets, so they are left alone. The rest of the characters match byte patterns in certain ranges which are unlikely to appear in regular text. They can be described in a compact and machine-readable way by a regular expression:

var pattern = /[\xC2-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF4][\x80-\xBF]{3}/g

Broken down, the regular expression matches one of:

  1. A byte between 0xC2 and 0xDF, followed by a byte between 0×80 and 0xBF, or
  2. A byte between 0xE0 and 0xEF, followed by 2 bytes between 0×80 and 0xBF, or
  3. A byte between 0xF0 and 0xF4, followed by 3 bytes between 0×80 and 0xBF.

These correspond to 2-, 3-, and 4-byte UTF-8 characters.

Given a matching byte sequence, I can use a Javascript trick to make Unicode. First, URI-encode using the Unicode-ignorant escape function. Then, URI-decode using the Unicode-aware decodeURIComponent function. So, using the regular expression from above, this fixes UTF-8 text labeled as ISO-8859-1:

text.replace(pattern, function(s) {
return decodeURIComponent(escape(s));
});

Except it doesn’t. Some characters failed to be encoded properly. They were all in a certain range, so I looked up ISO-8859-1 and saw that those were undefined codes. Not seeing any pattern, I built a translation table of undefined bytes and the Unicode characters they produced. The result looked familiar: Sam Ruby’s iñtërnâtiônàlizætiøn survival guide, which I obviously had not looked at closely enough before, listed the same data as a mapping from Windows-1252 to Unicode. Apparently, Firefox has some voodoo that takes UTF-8 in ISO-8859-1 and treats it as Windows-1252. Instead of keeping the byte intact, the string available to my script contains the Unicode equivalent to the Windows character, which has no correlation to the original byte, and therefore the original UTF-8 text.

So, besides matching UTF-8 byte sequences, I must also match all the two-character sequences of [\xC2-\xDF] followed by the Unicode equivalent of some Windows-1252 characters. This led to somewhat less elegant code, partly because of the lookup table I’ve included, partly because escape produces some '...%uXXXX...' gobbledygook, which has to be fixed up before passing to decodeURIComponent.

Note that there is no way for the code to know when text is actually “broken”. All it does is detect the byte patterns that can be UTF-8, which, luckily (by design), are rarely anything but mislabeled UTF-8.

The final Javascript code is fixencoding.js. I think the only browser it currently works in is Firefox 2, because it uses Javascript 1.7 array comprehensions.

The bookmarklet is available on my bookmarklets page.

Social Software

A cousin I haven’t spoken with since early childhood (”cousin” is as specific as I know offhand) added me as a Facebook friend and posted this message:

so i google for something that can clear my facebook mini feed without spending 3 hours deleting every entry, and lo and behold…

There it is, near the top. It’s nice to see that some of my software is useful. I imagine more of the surprise was on his side.

My Shitty Techno

I installed Hydrogen and started playing, and here are a few short, bad bits of techno.

Garbage 1 (MP3, 0:44)
Garbage 2 (MP3, 0:38)

Party time!

File Under Cultural Facts that Contradict Puritanical Laws

They’re running freshman orientation twice a week here. At the Dell table, they hand out Microsoft-Intel-Dell–branded bottle openers.

Siren, Noisettes, Surveillance

Siren fest was a bunch of fun. We got there a bit late and left a bit early, and spent too much time on trains and buses, but it was worth the $80ish I dropped to come to this free festival, considering the travel costs also bought me a few hours at my technically permanent but effectively rare residence and a trip to the beach.

I only knew Dr. Dog and Matt & Kim (even them not well), and had just listened to a little bit of the other stuff after reading the line-up announcement. Dr. Dog met my high expectations, and M&K were good, except that I only heard the last few songs and was pretty far away.

The Twilight Sad and Voxtrot were OK, not something I’d listen to. We Are Scientists were rather good. The Black Lips have a nice sound, though I enjoyed the set for reasons unrelated to the music.*

The big discovery of the day was The Noisettes. They bring their guitar-wailing, drum-jumping, stage-climbing energy from London with beautiful instruments, excellent tunes, and Shingai’s powerful voice.

We had a great time hanging out at Coney Island. Supposedly this was the last Siren, because of impending condos. Damn.

* Siren 07 Mosh @ YouTube

I’m the one with the backpack at 00:01 and 00:20 (not the one with a backpack who is shirtless, annoyingly drunk, and mostly responsible for starting the mess).

Misc. Goodies I’ve Been Sitting On

Since I didn’t write about these on their own, here they are all at once.

Yannick Murphy’s Here They Come’s doesn’t have a blurb on the back, but it’s short enough that it doesn’t need one. It’s about a poor family in New York in the 1970s, and it’s very good. A child tells the story, but it’s neither childish nor fake. You can read some of it, then get it from the library or buy it. The author talked about it on the radio last year.

The Bones of Davey Jones is quiet, haunting pop music. (Sometimes it’s loud and haunting.) Too bad he doesn’t have an album; I think I’d listen every day.

I heard TBoDJ on an episode of Fair Game, which is a great radio show and podcast I first heard a few weeks ago. The show doesn’t keep a good archive of show descriptions, but l got to the site in time to catch these MP3 links: untitled, Lost To You, Sue, I Remedy, Bedtime Monsters.

David Dondero’s Simple Love comes out in August, and a couple tracks are available now. On first listen, I thought they weren’t as tight as his last album, South of the South. Then I found myself listening to Rothko Chapel five times in a row so I could think along to the lyrics.

And, wow, Project Jenny, Project Jan made an album. Xoxoxoxoxo is coming out in August, and two of the new songs are on their website. (Some of the songs were previously available as unreleased tracks.) I saw them at a ultimate frisbee party in Albany for free, which may have been lucky, depending on whether they get as popular as they deserve to.

Now enjoy these.

Ten

Back in the twentieth century, we didn’t say the year was one thousand nine hundred . . ., we used simpler numbers—an optional 19, then nothing bigger than 99—but in the Two Thousands, we say two thousand . . . . Before the new millennium, media loudly worried about software that could only handle two-digit years. But they failed to see the bigger problem: a bug in the people that taught the computers. Since a lifetime is less than a hundred years, humans evolved only to understand two-digit years.

Y2K came without any major computer failure, and we quickly forgot about it. We went on with our lives, ignorant of the social metamorphosis we had just made. We had changed from a thinking people to one whose mind was monopolized by the memorization of a year that’s nearly 2,000 bigger than any year we had ever had to say before. We were confused and afraid, and taken advantage of.

Congress, so as not to confuse the year with a page number, didn’t read the (rather long) USA PATRIOT Act before passing it into law. The public, which already had a big year to keep track of, certainly could not risk losing it in year-filled history books before agreeing to the Iraq war. And what if the year gets bigger, as it consistently has since Jesus’ time? We need a strong executive to keep us from anarchy we would surely devolve into.

We were made to fear by powerful men telling us times were different, the good liberal says. Good liberal, the 2004 election showed that we really did change into a fearful mass. Have hope for 2008 if it will keep you from depression, but we will be scared sheep until we return to a simpler time, in twenty ten. Or as I will call it, democrat that I am, Ten.