2007-11-30 01:10Backwards Unicode and Shell SupportA colleague at work gave me another challenge recently, or at least they showed me something that Perl can do, which as usual caused me to seek out a more elegant or efficient implementation in another language. A previous challenge had been to write a speedy script that found common rows between two text files, and I had resorted to Python for my implementation, just to make it difficult for myself. After a few rounds of optimisation we found that, when run once, the Perl version was faster, but when run 10 times in a row the Python version took the lead. In this case, though, the challenge was to write a one-liner which would take a string as input and reverse the letters in it. As I enjoy replacing my colleague’s unreadable Perl syntax with a handful of Flipping stringsSo, I couldn’t find any obvious tools for reversing text on the command line, but fortunately if you set your “hackishness” filter to allow solving a problem with BASH, you find yourself not too far away from just about any crazy idea that enters your head. With a portion of my mind always worrying about someone typing in some bizarre combination of characters and breaking one of my websites, I was soon inspired to think about Right To Left (RTL) languages. I can’t be the only person who’s seen a chat client get confused (or at least make the most of a difficult situation) while trying to render a line which contains a mix of Arabic and ASCII or Hebrew and Hangul. All I needed, I thought, was a single character from an RTL language, and to put the rest of the string after that. I even dared hope that there would be some bug in some terminal emulator where removing the RTL character after the flip would keep the string reversed. Somewhere along the line I remembered the amusing upside-down text trick, but I also learnt about the non-printing Unicode character U+202e that makes text go backwards, which I suppose wasn’t difficult to Google upon accidentally. Everything that needs to be said about it has probably been said already, except that Google does not filter for this character in search results, leading to an amusing result (if the Google dance hasn’t interfered with that page). Unicode SupportThe first problem in using this trick to solve the problem was getting a console program to output the desired character.
and expect to see oof as the result. It’s here that things start to get a bit murky though. My first attempt was running this code in Konsole, but it seemed to just ignore the Unicode character. This was disappointing, as it managed to print Arabic and Hebrew text correctly, albeit in a left to right manner. For the purposes of proving the abilities of BASH, I thought it was fair to use any terminal emulator software at my disposal, so I installed mlterm which prides itself on supporting This was good enough to get started, though, and with help from another colleague, I came up with a version which relied on subshells to capture
which produced a left-aligned line containing simply easy and a nasty hollow square at the beginning representing the non-printing character. The BASH trick used here is the Playing with mlterm, and testing its much vaunted support for Unicode, I tried pasting some strings into it and seeing how it would react. Disappointingly, I found that when a Hebrew string was pasted, it appeared with an unnecessary @ sign at the end and prompted a console beep due to an error. The string could be Perhaps some of these bugs could be fixed by fighting with the myriad of font settings, but it seems a shame that a program which prides itself on multi-lingual support behaves worse than one which lists multi-lingual support as just another great feature. I did try to fix the problem, even though I had no particular interest in Greek and Arabic text. I installed fonts and an extra configuration program, and specified options on the command line before invoking the terminal, but nothing worked. All that happened was I got more confused by all the permutations. Did I need to “Process received strings via Unicode”? Hebrew seemed to be working, so I was sure I was sending the strings correctly. Was my “XIM locale” correct? Again, I had to assume it was. ConclusionUnicode support is difficult for all sorts of applications, and I accept that; but when dealing with a console-based program, it feels like the straw that breaks that camel’s back and highlights how badly the whole console software stack has kept up with developments in technology compared to modern complementary environments. More on this next time… Did I prove that BASH could do whatever Perl could do, for people who don’t mind ugly hacks? That depends, are you running the script 10 times in a row? Trackbacks
Trackback specific URI for this entry
No Trackbacks
Comments
Display comments as
(Linear | Threaded)
[...] I have already blogged about the problems with just trying to get Ctrl+Backspace working in the console, and some of the trouble getting Unicode characters to work, but these problems are not exceptions, they are endemic and unavoidable, stemming from the failure of the current console paradigm. [...]
|
QuicksearchCategoriesSyndicate This BlogBlog Administration |