Monday, August 01, 2005

Unicode->EWTS at alpha, but only for data exchange

This weekend I worked a bit on a Unicode->EWTS converter (a reverter, if you will). It generates EWTS for computers, though, not for humans. That's useful for data exchange but you wouldn't want to publish it in a book. Humans like 'brtags ', not 'bar+tagasa '.

There's a bug with the unlikely combination of U+0F68 and U+0F39.

Saturday, July 16, 2005

EWTS->Unicode, EWTS->Tibetan converters at Alpha

I hereby declare an Alpha release of my EWTS->TibetanMachineWeb and EWTS->Unicode converters. Grab them in Jskad's nightly build -- Last Night's Build. Go to the 'Tools' menu and select 'Launch Converter...'.

Note that Jskad's EWTS input method doesn't use the new converter. It should, but it doesn't. But you can use 'Tools/Convert Selection>/Convert Wylie*' to get a correct conversion.

Monday, June 20, 2005

Tiny progress on EWTS->Unicode/TibetanMachineWeb converters

I'm not done with EWTS->Tibetan conversion, but at least I started work again. (I have ugly cvs commits to prove it.) Right now I've been away from the code for so long that I know I'll never have a warm, bug-free feeling about the code unless I reimplement from scratch. (Jython calls me.) But then, even when the programmer feels like there are no bugs, there are bugs. My test cases, though, are now strangers to me too.

I'll push some documents through end-to-end soon now. When they look good to me, I'll see if I can't get real people to try it and report bugs.

Because of the long, slow development process and the very fuzzy spec, this is going to be very different from my earlier, bulletproof converters. Maybe it'll be useful, though.

Sunday, June 19, 2005

Minutia

I created a cvs tag before_big_ewts2tmw_and_unicode_checkin. If my coming 3000-line-diff check-in breaks stuff, hopefully I'll remember what to revert to thanks to this post.

Saturday, March 05, 2005

Should Roman transliteration allow implicit a-chen?

EWTS's decision to allow an implicit a-chen (e.g., to allow "u" to mean \u0f68\u0f7c) means that a+yauna refers to \u0f68\u0fb1\u0f7d\u0f53 instead of \u0f68\u0fb1\u0f68\u0f74\u0f53. You should transliterate \u0f68\u0fb1\u0f68\u0f74\u0f53 as a+ya.una (though a+ya.un etc. are legal too).

ACIP requires a-chen to be explicit, making programmer's lives easier. And, arguably, nontechnical human users' lives, too.

Tuesday, March 01, 2005

Mimer SQL Tibetan Collation Chart

Need a database that can collate Tibetan? Mimer claims to be one: Mimer SQL Tibetan Collation Chart

Sunday, February 27, 2005

KDE supports Tibetan OpenType fonts now?

I've heard that KDE now supports shaping of Tibetan OpenType fonts such as Tibetan Machine Uni. I haven't had a chance to try it out. OK, I had a little chance but I gave up because at that point Debian didn't have the packages available. :)

GNOME supports it too, I think, thanks to Pango, but I haven't tried it.

Tibetan Line Breaking in Word 2003 SP1

Tibetan Line Breaking in Word 2003 SP1 seems to work really well with the right customization. E-mail me about this if you're interested.

EWTS (THDL Wylie) to Unicode conversion

Right now I'm working on EWTS (THDL Wylie) to Unicode conversion. The tricky thing is that EWTS is not all that precise a standard on paper, so my converter will convert EWTS-as-I-understand-it to Unicode. I do have a very clear idea of EWTS after writing an ACIP->Unicode/TMW converter, though. It might not be the same idea everyone else has, but that really won't matter the way I'm building the converter -- it spits out as many warning messages as you can tolerate.

The funny thing is that such a converter is not all that useful because there are tons of documents in "Wylie", whatever that means, but very few in true EWTS-approved Wylie. (Zero in the sense that EWTS is a moving target.) So the converter will probably have to accept a ton of customization before I can say, "We solved that Wylie to Unicode problem."

For example, does your "Wylie" require the plus sign for stacking or not? Is padma the same as pad+ma to you? What character does it use for disambiguation, as in g.ya rather than gya? How about wazur, is it 'v' or 'w' or what? How do you transliterate U+0F71?

My first effort will convert EWTS. And as people ask for help with converting their piles of documents to Unicode (or Tibetan Machine Web), these features will have to be added.

Sorting Tibetan strings requires near-omniscience

To sort two Tibetan Unicode strings the way popular dictionaries do (I think the tsig mdzod chen mo does it this way), you have to know an awful lot about the nature of a Tibetan tsheg bar (roughly a syllable).

You have to know which root stacks accept which prefixes, for one thing. Scholars could dispute the exact rules of that at length.

But you even have to know which stack is the root stack in those cases that someone just learning Tibetan -- which is about the level of intelligence of good software -- would call ambiguous. Is that mngas or mangs I'm looking at when I see "\u0F58\u0F44\u0F66" (མངས་)? This is because dictionaries care about whether a letter is a prefix or not. There's no way to know which is the root stack without a perfect Tibetan dictionary. And that's assuming that there's never an actual ambiguity like there'd be if both mngas and mangs were distinct Tibetan words. If there is an ambiguity, then the perfect sorting algorithm would have to do perfect natural language processing to figure out which word it was from context. And of course that's not always going to do away with the ambiguity -- what do I mean in English when I say "I'm going to the bank?"

I don't see why a sorting algorithm would have to be aware of the suffix and post-suffix rules, though, unless it wanted to sort native Tibetan differently than transliteration of non-Tibetan in Tibetan script (e.g., Sanskrit in Tibetan).

ICU holds promise for Tibetan in Java

ICU, started by IBM, has C++ and Java libraries for Unicode. It may already support Tibetan, but I doubt it. If it doesn't, someone should make it do so because I think that's the most economical way to make Java support Tibetan Unicode. And writing software in Java is an economical way of having it work on Linux (already big in China), Windows, Macintosh, and the next big thing.

I wonder how well Blogger supports Tibetan

Here's a traditional Tibetan greeting to test Blogger's support of Tibetan Unicode (in the UTF-8 encoding):

བཀྲ་ཤིས་བདེ་ལེགས།

If you can't see this properly (in Wylie transliteration, it's "bkra shis bde legs/") it could be your browser's problem -- try changing its character encoding to UTF-8. Or that you don't have a good Tibetan Unicode font like Tibetan Machine Uni. Or good shaping software for that font like a recent version of Microsoft's Uniscribe.

Blogger claims to support UTF-8. Blogger is owned by Google, who employs Rob Pike, who was, with Ken Thompson, co-creator of UTF-8. So I guess that I'm not surprised that I can see the above greeting properly. But searching is another matter. You can search for the entire greeting and find it, but searching for just one word of it doesn't work.

Wondering how a person comes into possession of some Tibetan Unicode? I used Jskad, but I could've used my ACIP->Unicode converter, my Tibetan Machine Web->Unicode converter, a Keyman keyboard, or a hex editor. OK, maybe not a hex editor -- ;)

32-bit Unicode character support in Java

This paper by Sun's Lindenberg and Okutsu describes Java's support for 32-bit Unicode characters. Java's built-in data type 'char' is 16-bit, so it's not obvious how Java supports it.

The skinny as far as EWTS->Unicode conversions are concerned: use JDK 1.5 or later's Character.toChars(int). This seems a bit dirty to me because I think of the integer corresponding to U+F000000 as being too large to represent in a Java 'int'. It requires a 'long' in my mind. But maybe U+Fxxxxxxx are not valid? The article says that only a little over 1 million valid code points exist out of the 32-bit space that Unicode provides.