The present input methods for Tamil, while providing reasonable support for using the language in Computers and other devices, they have several shortcomings. We will explore the current input methods and their shortcomings briefly before proposing a new Input Method for Tamil that can be used in both keyboard based and touch devices.
Current Input Methods:
Presently there are 3 predominant input methods that are in wider use.
- Tamil99
- Murasu Anjal
- Google Keyboard (GBoard)
The inadequacy of existing Input methods for Tamil and the idea for a new Input Method for the language has been around for several years [2]. More recently Elango Cheran published a detailed blog post [1] not only explaining the shortcomings of the current Input methods but also expounding his idea of exploiting the phonemic nature of the Tamil alphabets for the new Input method. Here is the brief summary of the key shortcomings of the current Input methods.
Unnatural Design: In Tamil Consonant Vowels such as 'க' and 'தை' are generated by the combination of pure consonants 'க்' and 'த்' with the vowels 'அ' and 'ஐ' respectively. Thus the Vowels and the Consonants, which form the basic units of the sounds (phonemes) in Tamil, should be the basis of designing a good Input method. Several Input methods including the Tamil99 follow the unnatural design of Vowels and Consonant vowels (CV) such as 'க', 'ங' and 'ச', as the basic units in the Keyboard. To be fair, this design came to be used, because these CV characters
Coming back to the Input methods, one of the unintended consequence of this design approach is that, this can produce illegal character sequences in Tamil such as with dangling consonant vowel modifiers such as '்', '
Transliteration Dependency: Murasu Anjal and other transliteration keyboards, rely on users being familiar with the English alphabet and actually inputting the Tamil words in transliterated English letters, which are then transliterated back to Tamil. This method is hugely popular due to the high English literacy among the Tamil speakers the population around the world. However, we believe this is doing more harm because the speakers no longer have to learn the script but only the sounds in the language. Secondly, there are multiple ways to represent a Tamil character in English, because the variations in the sounds in the two languages.
Non-adherence to QWERTY layout: Most of the Tamil keyboards do not adhere to the widely-used QWERTY layout in terms of the key positions reserved for punctuations and other symbols in the keyboard. These Input methods assign Tamil characters in these positions. Consequently, the bilingual users using QWERTY will find it difficult to switch back and forth between English and Tamil typing and they will be forced to learn and follow the different key positions for typing punctuations and symbols while using Tamil.
GBoard Design Incongruity: The Google Keyboard or GBoard for Tamil is the soft key layout launched for touch devices. It lays out the Vowels and Consonant Vowels ('அ' வரிசை such as 'க', 'ங', 'ச', ...) in a 9x4 matrix. The vowel character panel on the left changes every time a consonant vowel is pressed to show its other CV variations. The layout uses a simplistic sequential positional of characters in the alphabet, without any concern for either optimizing finger movements or the character frequency based layout design. Combined with the incongruity of ever changing vowel panel, GBoard's design choice is probably the least efficient Tamil key layout in use.
A New Input Method - Design Principles:
We want to design a new Input Method for Tamil that address the shortcomings in the existing one and also makes it easier to learn the new method with a short learning curve. Based on our research, we decided on the following design goals for the new Input method.
- A design that adheres to and exploits the Phonemic nature of Tamil, taking the phonemes as the basic unit
- Frequency analysis of base phonemes and consonant vowel combination in order to achieve an optimal design that speeds up touch typing in computers and equivalently reduces finger movement in touch devices
- Intuitive arrangement of keys to make the learning easier that is consistent across different platforms and devices
- Prevent any illegal character sequences in the output text
- Maximize compatibility with the QWERTY keyboard to make the transition between English and Tamil typing seamless and easier.
- Eliminate the forced requirement for the user to know other script/ language and instead facilitate typing in the Tamil script
It should be noted that, the same design principles could be used for designing better Input methods for other Abugida languages as well.
Designing the Input Method:
We started the design by identifying Tamil corpus data for doing usage frequency analysis of the characters in the language. We identified large enough corpora (approx. 537M words) from two different sources as below:
- Kaggle - Tamil Language Corpus for NLP
- Tamil Articles Corpus
- Tamil New Corpus
- Tamil Language Corpus
- Github - Opensource Tamil Corpus
- Wikipedia, TheHindu - 58M words in total
Frequency Analysis:
Our goal is to understand the usage frequency of basic phonemes as well as for the full set of Tamil alphabets including the consonant vowel characters. Once we understand the usage frequency of the phonemes and the full set of alphabets, we can exploit this information to design the keyboard layout. It should be noted that we have omitted the Sanskritized characters (வடமொழி எழுத்துக்கள்) 'ஜ்', 'ஶ்', 'ஷ்',
Among the top-10 most frequent characters are, we have 5 consonant vowels and 4 'அ' ending CVs and
- Consonants: ம், ர், ல், க் and ன் - 349.13M
- 'அ' ending CVs: க, த, ப and வ - 306.19M
- 'உ' ending CV: து - 64.63M
Thus by using the base phonemes (vowels and pure consonants) for out keyboard layout, would result in a saving of nearly 43M keystrokes for this dataset. Now consider two more heatmaps i) by characters ending with vowel sounds (column-wise sum of the above heatmap) and ii) by the characters for each consonant-vowel series (row-wise sum).
Notice that the pure consonants (right-most cell) tend to be more frequent than any consonant vowel series. Here again by using the basic phonemes as the keys instead of the 'அ' ending consonant vowels, these pure consonants can be typed with a single key press as opposed to two presses, saving about 37M keystrokes on this dataset. Also notice that the vowels and CVs ending in short form vowel sound are much more frequent than their long form counterparts.
Using the frequency statistics of the Vowels (first two in the first heatmap) and the Consonant vowels above, we can design optimal keyboard layout to minimize the movement of fingers and to use the dominant fingers for the high frequency phonemes. The next section discusses the design decisions and explains our Tamil Phonemic keyboard layout.
Phonemic Keyboard Layout Design for Tamil:
As we mentioned earlier in our design goals, we want the new keyboard layout to be easier for the users to learn and use across different devices. We want to minimize Given the constraints of available keys and total required keys, we had to make certain design decisions in the character assignment to the key positions.
- We want to use Home row of the keyboard for the Vowels and some high-frequency Consonants in the language
- We also believe that the dominant index and middle finger keys in two non-Home rows should take precedence over the Home row keys with weaker fingers. We'll be using this later in optimizing the character assignment to the keys.
- Given the high frequency of short form vowel ending characters, we have retained the short vowels in the Home row and assign the corresponding long form vowels to the same key in the Shift row. Following the wider convention, we've assigned the vowels on the left side of the keyboard.
- Most of the Consonants are going to be assigned in the right hand side of the keyboard to exploit the dominant hand for the majority of the population. Note that this arrangement allows the CVs to be typed efficiently by a mix of both hands, without making the same hand/ finger to move to a different position for typing a single character.
- Based on our observation #2 above, the frequent Consonants starting with 'க்' and 'த்' are assigned in the dominant finger positions in the Home row and the rows above and below. We assign the rest of the consonants to the weaker key positions in the decreasing frequency order.
- We now specifically consider the case of மெல்லினம் (nasalized consonants), which are typically be followed by the corresponding வல்லினம் (plosive/ stop consonant) in the Tamil text. Thus, it made sense for us to place these nasalized consonants on the left side of the keyboard (above and below the home row) so that the following வல்லினம் can be typed with the right hand. We made an exception for 'ம்' and assign it to the dominant key position on the right side, due to its high frequency in both Consonant and CV forms.
- On the Shift key layout, we assigned the Tamil numerals right below the roman numerals to make the typing intuitive and easier. Additionally, the Sanskritized consonants and other Tamil symbols are assigned on this Shift layout.
The screenshots of the new keyboard layout for the regular and Shift keys are below.
Input Method Analysis:
Based on the keyboard layouts for the three input methods, viz Anjal, Phonemic and Tamil99, we analyzed their efficiency and ease of typing in two ways. We first calculated the number of absolute keystrokes required to type the Tamil words in the above corpora of 537M words. To keep the analysis simple, we ignored the punctuations and any non-Tamil words/ characters for this. We also ignored the shift key here because the shift key is pressed simultaneously with the key following it. Here are the absolute number of keystrokes required for typing the above Tamil corpora by the 3 input methods.
- Anjal : 4,470,795,879
- Phonemic : 4,045,040,635
- Tamil99 : 4,124,838,873
The Phonemic method requires the least number of keystrokes among the 3 methods; specifically it requires 80M fewer keystrokes than Tamil99. This is because, the pure consonants are usually frequent than their CV combination. Tami99 requires an additional keystroke '்' for typing the pure consonants. For Anjal, we used the standard transliteration mapping as suggested in the Sellinam app, thus requiring two keystrokes for each long vowel as well as for long CV combinations.
We then analysed the heatmap on the keyboard layouts of the three input methods to see which keys are typed in more frequently and their relative position in the keyboard. We plotted the heatmap on the 3 keyboard layouts separately for this analysis. As above, we ignored the punctuation marks and non-Tamiil words to keep this analysis simple. However, we considered the shift key in this analysis.
The layout will be easier for typing if the frequently typed keys are in the position of dominant fingers of both hands or in the Home row of the keyboard. The Anjal keyboard is clearly the least efficient option as the most of the frequently used keys are outside of the dominant finger positions of the keyboard.
Between the Phonemic and Tamil99 keyboards, the frequent keys are mostly located in the dominant finger position, which makes the typing easier. The dominant left and right index fingers positions (in all 3 rows) alone account for 61.22% and 46.32% of the overall typing in Phonemic and Tamil99 keyboards respectively. This difference of 15% is significant and makes the Phonemic layout a better (in terms of ease of use) option than the Tamil99 keyboard.
We then look at the percentage of typing for the keys in the Home row, which is the usual resting position for the hands when not typing. It thus has the advantage that the user will not have to move their hands from its resting position. The Phonemic layout accounts for 58.33% of the Home row typing, while the Tamil99 is slightly better with 61.83% of overall typing. We believe this small difference of Home row typing is far outweighed by the advantage gained in the Phonemic keyboard layout by the dominant index fingers across all the rows. In addition to the efficiency in typing, the new Phonemic keyboard layout offers other advantages over the Tamil99 keyboard as discussed earlier in this post.
References:
- Redesigning an Input Method for an Abugida Script. Elango Cheran's Blog
- Optimization of Tamil Phonetic Keyboard. Sendhil Kumar Cheran, Thuraiappah Vaseeharan and Elango Cheran. Tamil Internet Conference. 2004.