Character Frequency Analysis for Malayalam

31.05.2025 01:22 AM Comment(s) By Baskaran Sankaran

As part of our efforts to build optimized, intuitive keyboard layouts for Indian languages, we’ve conducted detailed character frequency analysis for Malayalam. This post presents our findings—showing which characters, vowels, and consonants occur most frequently in real-world usage based on large Malayalam dataset—and briefly discusses how this data influenced the keyboard design.


For an explanation of our overall keyboard design methodology, including the rationale behind layout decisions, please refer to our Tamil keyboard design post: Designing a New Input Method for Tamil


Dataset Used for Frequency Analysis

To ensure broad linguistic coverage and reliability, we used Malayalam corpora by combining text data from multiple sources, including:

  • IndicCorp
  • Swathantra Malayalam corpus
  • Wikipedia Malayalam corpus
  • News articles, blogs, etc.

Here is the high-level statistics of the corpora used in our analysis:

  • Total words: 1.699 billion
  • Unique words: 25.345 million

This large and diverse dataset gives us a realistic view of character usage in modern written Malayalam.


Overall Character Frequency Heatmap

This heatmap illustrates the relative frequency of Malayalam characters based on their usage across a dataset. The top row represents the frequency of vowels, while the last column shows the frequency of pure consonants. The remaining cells capture the usage of consonant-vowel combinations. Color intensity indicates frequency, with darker shades signifying higher usage and lighter shades indicating lower frequency.

Heatmap of overall character frequency in Malayalam, based on a 1.7B-word corpus. Darker colours indicate more frequent characters with the frequency being shown for each character in Millions.

Characters such as അ (a)മ (ma)ന (na), and ക (ka) appear with consistently dark shades, across their rows, indicating they are among the most frequently used. These are foundational phonemes in Malayalam and are common across both spoken and written formsCharacters like പ (pa)ത (tha), and ല (la) show moderate frequency across their respective row, suggesting they are contextually important but not as dominant. Their usage may vary depending on the domain (e.g., literary vs. conversational text).


Vowel Frequency Heatmap

To gain deeper insights, we examined the frequency of Malayalam vowels both in their standalone and combined consonant-vowel forms. This analysis was derived by summing the columns of our overall character frequency chart, offering a clear view of vowel usage patterns. The results are visualized in the chart below.

Heatmap showing frequency distribution of vowels in Malayalam.

Malayalam has a rich vowel system, but a handful of vowels—especially , and —occur most frequently. Further as expected, the shorter vowels are more frequent than their longer versions. Note that these counts include the usage frequency of vowels as well as the vowel sign glyphs in their consonant vowel forms. The last cell in the heat map actually refers to the frequency of the Chandrakala (Virama) character.


Consonant Frequency Heatmap

In addition to vowels, we also analyzed the frequency of Malayalam consonants. This was done by summing the rows of the overall character frequency chart, which highlights how often each consonant appears across different vowel combinations. The resulting data provides a clearer picture of consonant usage patterns in the language.

Heatmap of consonant frequencies in Malayalam.

Among consonants, we can see a clear dominance of ക്ന്ത് യ്, and ര്. These findings informed our decisions in distributing consonants across the keyboard layout for minimizing the finger movements during typing. In line with the InScript keyboard layout convention, each aspirated consonant was assigned to a specific key, while its corresponding unaspirated counterpart was placed in the corresponding shift position.


Chillaksharam Frequency Heatmap

We also analyzed the frequency of chillaksharam (ചില്ലക്ഷരം)—the special consonant forms used at the end of syllables in Malayalam, such as ൽ, ൻ, ൾ, and ൿ. These characters are essential for accurate representation of the language and are frequently used in written text.

Frequency of ML chillaksharams in our dataset

Following our keyboard design principles, each chillaksharam is placed in the Alt + Shift position of the corresponding base consonant key. For example, the character ൿ is mapped to the Alt + Shift position of the ക് key. Since mobile keyboards typically do not include an Alt key, these chillaksharam characters are made accessible through long-press gestures on their respective base keys. So, long-pressing the ക് key on a mobile keyboard would reveal ൿ, ensuring consistency and ease of access across platforms.


Impact on Keyboard Layout Design

Whether you're using a Malayalam phonetic keyboard or a standard Malayalam keyboard, these insights can guide better input method development. As part of our design, we excluded individual vowel signs and instead generate them dynamically from consonant-vowel combinations. This approach reduces complexity and prevents invalid character sequences.


These frequency trends helps in designing the layout for our Malayalam keyboard. High-frequency characters were assigned to primary home row or easy index finger positions, ensuring reduced typing effort and faster input for users. Low-frequency characters were positioned in secondary or long-press locations. As part of our design, we excluded individual vowel signs and instead generate them dynamically from consonant-vowel combinations. This approach reduces complexity and prevents invalid character sequences.


You can explore our optimized keyboard layout through the Varta Keyboard apps, available on Android and iOS, as well as through browser extensions for Chrome, Edge, and Safari.


Explore Frequency Analyses in Other Languages

We’ve performed similar analysis for other Indian languages as well. Explore them below:

  • Character Frequency Analysis for Hindi
  • Character Frequency Analysis for Kannada
  • Character Frequency Analysis for Tamil (includes design principles)
  • Character Frequency Analysis for Telugu

Share -