Inside the design of the US Army's Arabic translation app, Jibbigo
Researchers at Carnegie Mellon University were first approached by the United States government to develop automatic translation software for military personnel deployed in Iraq, but they were afraid it was too delicate a task.
It was not just that they were being asked to devise a way of providing rapid, smooth voice-to-voice translation in a combat environment, where an error or poor interpretation could be fatal for those using it.
Nor was it even that they would need to do this intensive task without access to grid electricity or the internet.
The trickiest part, they soon realised, was that their system would have to cope with an Arabic dialect so different from Modern Standard Arabic (MSA) that even the popular greeting, "shlonak" – or "what is your colour" – could be misinterpreted.
They set out at first to build the application from an existing pool of Modern Standard Arabic terms.
The program would use two databases: one of recorded dialogue and its text equivalent, and another that matched those texts with their translations. The developers added a sprinkling of Iraqi vocabulary and pronunciations, but the model had far too many blemishes.
"The speech recognition was of no help for an Iraqi speaker," said Alex Waibel, who led the project.
"In those situations, miscommunications or misunderstandings can potentially be lethal when all parties are well-meaning. How does a US soldier know when someone is looking for his little child or has ill intentions?
"At checkpoints that is always a dangerous situation, and problems arise because people simply do not understand one another."
The problem of making machines that can translate Arabic has long been recognised. The language is so widely spoken by so many people – more than 200 million of them, from Morocco to Oman – that extremely distinct dialects have developed. Arabian Gulf Arabs often have difficulty understanding Iraqis, Iraqis have trouble communicating with Egyptians and almost nobody understands Moroccans.
The variations in word choice, sentence structure and phonology – how sounds are spoken – are big enough that many linguists consider Arabic as a cluster of separate languages, as different to each other as German to Dutch or Spanish to French.
That makes voice recognition machines based on MSA, which is taught in schools and used by government, academia and the media, all but useless.
So Jibbigo, the app developed by Dr Waibel's team at Carnegie Mellon, has been fed a database of Iraqi Arabic collected and translated by soldiers on the ground over the course of the war, giving it a 40,000-word vocabulary for voice-to-voice translation to English and voice-to-text translation to more than a dozen other languages.
Though Iraqi is its only Arabic dialect so far, the team is beginning to collect samples of Algerian Maghribi Arabic.
Building that kind of database from scratch is arduous, requiring hundreds of thousands of hours of speech to be collected and processed. Google alone deals with more than 1,000 hours of spoken Arabic, about half of it from the Gulf, each day.
Instead, researchers are increasingly looking to crowdsourcing, online and on the ground. Jibbigo pays users to correct their mother-tongue translations, and hires people locally to collect audio.
Google, meanwhile, relies heavily on comparisons of audio tracks of everything from news reports to user-created content on YouTube. It also compares sentences spoken into its software by participating Android mobile users, which has helped to reduce errors by more than 20 per cent.
And there is a huge new resource for this crowdsourcing, thanks to the stunning rise of the internet since the start of the Arab Spring. Blogs and social media now represent a rich pool of written and spoken dialects.
Twitter has put to work thousands of volunteers translating and localising tweets in real time. Microsoft, meanwhile, has developed translation tools to work in tandem with Wikipedia and the search engine Bing, in collaboration with users.
"Localisation and customising for individual countries is key," says Hussein Salama, the director of a Microsoft research centre in Cairo. "We need to look at how to provide Arabic speakers with better tools to communicate, because right now they have a lot to say but are underserved."
The biggest challenge in correlating speech with written text for translation is diacritics, the dots and dashes that represent short vowels – or the lack thereof – in formal Arabic text.
But while they represent one of the biggest differences between the region's dialects, they are often missing from written Arabic.
Meanwhile, dissecting online text is layered in other complications, as words are often muddied with numbers and symbols, or mix Arabic and Latin text, to account for sounds that do not transliterate easily.
That leaves researchers looking for a formula that takes all of those factors into account, building modules that detect patterns of words and intonation to discern context and, in turn, diacritics and dialect. A module could, for instance, recognise a higher frequency of accented key words in Egyptian Arabic compared with that of other countries.
At Google, speech recognition experts have broken Arabic down to four major dialects: Egyptian; other North African; Arabian Gulf; and Levantine, the dialect spoken by Syrians, Lebanese and Jordanians. Iraqi is sometimes grouped with the Gulf dialect, although those involved in the project agree it is unique enough to warrant its own category.
And they have reached the same conclusion as their counterparts at Carnegie Mellon: the various dialects are so acoustically different that a single voice recognition system cannot work for them all. Each needs its own recognition and translation system.
"Because the dialectical forms are in some cases very extreme, you can have a person from Morocco not understanding someone from the Gulf," said Pedro Moreno, a New York-based senior researcher leading the speech engineering group for Google's Android division.
"Rough sounds such as the 'ga' in Egyptian do not exist in Lebanese, while the 'ja' sound does not exist in Cairo dialect but does in nearly every other dialect.
"When mapping the phonetic structure of a word and the acoustic distribution of each phoneme, it is just too diverse."
Recognising those differences in phonetic habits can also help in identifying a speaker's regional dialect, even when they are speaking Modern Standard Arabic, classical Arabic, or even English.
"In Egyptian Arabic, almost every word is accented, so when they speak English, you can tell they are Egyptian," said Fadi Biadsy, another senior researcher on Google's team in New York.
It is an important tool to have. Google's module for Gulf Arabic contains at least 60,000 regularly-used English words.
"Arabic speakers tend to change between dialects and languages very freely, and we are forced to model that. In North Africa, there is a lot of code-switching, which has caused some delay there."
There is a long road ahead in perfecting databases of dialectical speech, researchers agree. And for some, time may be running out.
"In remote provinces where there is not a local paper or published form of the language they speak, or if the local people do not know how to read and write, we literally have to establish a written language for them to build a dialect system," said Dr Waibel.
"That's obviously a laborious effort, but without it, their dialect could die out. And without it, people in the Arabic world cannot communicate with one another, and that really is a shame."
Updated: July 6, 2013 04:00 AM