A key research theme of mine is in attempting to expand the base of people that are able to write their language. That began as a suspicion last trip, not the least because my attempt to hire someone to transcribe my data ended up in resounding failure. Which is funny, really, because that in itself you could call a finding. I didn't have a real goal for the material I recorded. A few days ago I had a long chat with a local academic, engineer and visionary Liu Yuyang (劉宇陽) who has built no end of impressive tools as part of klokah.com.tw, tools which get adopted and populated with data because he spends a great deal of time connecting with the indigenous communities. He was great sanity check on my hypothesis. People don't like to write because they know they don't have perfect commend over the writing standard. People that write, are generally teachers. Largely because people put in the work if they are interested in that career path and teaching indigenous languages most certainly is a career path in Taiwan.
As an outsider fascinated by the sinosphere for years, this felt familiar. While these are indigenous communities, there's no point pretending that the culture here isn't predominantly Chinese. This carries a bunch of implications which are relevant. The elevated status of a teacher, for example, and the necessity to pass through an endless series of examinations from high-school upwards. Liu Yuyang recently introduced a writing composition tool on Klokah, aimed at teachers and writing. The point is that the teacher composes a text and records audio to go with each sentence. Liu Yuyang's tool performs lexical lookups as the teacher composes. Selecting word forms and checking words with a dictionary and cross-corpus searches is really part of the whole big ideal that I share with one of my supervisors Steven Bird. I wasn't at all surprised to see Liu Yuyang role up his sleeves and just quietly get on with building it.
We need this in Aikuma, our software tool (I'll post exclusively about that later), but we have the luxury, and perhaps a touch of requisite academic arrogance, to think that we might be able to make something a bit more broadly applicable. In essence I'd like our tool to be able to perform precisely these kinds of lexical lookups but with a smattering of modern machine learning based on corpus training, and to make all of that happen, we need to sit down to work out a common way (or API if you like) to connect annotation tools and corpus/archives. By a stroke of luck, there's a tools symposium in Melbourne at my university in June, not long after I get back.
I might have dallied off the point there a little. Suppose that we had such a tool. Imagine that your mother tongue is not bad, you can read the local signs in your language but you aren't very good at writing it. You don't know, for example, that you have to type the glottal stops in front of words because some dusty academic decided that when they came up with the writing system. So you type roughly, in a somewhat phonetic way, and then by the miracle of technology, collaboration and syndicated data, you can then set about correcting the words to the right form, which you can confirm by looking at the dictionary and other sentences. The final text is standard or 標準 (biaozhun) as they love to say in Chinese. You learn the regular form as you go without making too many face-losing errors and ultimately produce valuable materials that help cement your credentials as a teacher of this endangered language.
I'd love to say that the tool we built has this, but it doesn't yet. I will say that half of the battle is knowing clearly that you need something, that there is a clear use case. I know we need this, it's out of scope for what we have today, but next time I come back, we'll have something. This is a particularly exciting aspect of this kind of research.