Character Frequency Analysis for Kannada

30.05.2025 04:40 PM Comment(s) By Baskaran Sankaran

Continuing with our initiative to develop optimized and user-friendly keyboard layouts for various Indic languages, we conducted a detailed character frequency analysis for Kannada. Using large-scale monolingual datasets collected from diverse web sources, this study aimed to uncover real-world usage patterns of Kannada characters to inform better design decisions for digital tools. This post lists the outcomes and our findings regarding the detailed character frequency analysis undertaken and how it guided the keyboard layout design for Kannada.

Dataset Overview

To ensure statistical robustness in our analysis, we collected a large Kannada dataset from multiple sources in the web, including from:

IndicCorp
Leipzig Kannada Corpus
Wikipedia articles
Samanantar v0.3 (En-Indic; Indic-Indic)
Kannada News Dataset

After data pre-processing and cleaning of the dataset, we arrive at following statistic for this:

Total tokens: 1.567 billion
Unique tokens: 14.825 million

This large and diverse dataset provided a solid foundation for analyzing character usage trends across Kannada as being used in modern daya.

Overall Character Frequency Analysis

The core of our analysis is a character frequency heatmap, where vowels are represented in columns and consonants in rows. Each cell in the matrix reflects the frequency of a specific vowel, consonant or consonant-vowel (CV) combination, with color intensity ranging from light yellow (low frequency) to dark purple (high frequency).

High-frequency vowels like ಅ (a), ಇ (i), and ಉ (u) appear prominently across many consonants, indicating their central role in Kannada phonology. At the same time, characters such as ನ (n), ಮ (m), and ರ (r) show consistently high usage across vowel combinations, suggesting their foundational presence in the language.

An interesting pattern observed in the heatmap is that hard consonants (such as ಗ (ga), ಡ (dda), and ದ (da)) generally appear more frequently than their soft counterparts (such as ಕ (ka), ಟ (tta), and ತ (ta)). This trend suggests a phonetic preference unique to Kannada, differing from other closely related languages like Telugu and Malayalam. These insights are particularly valuable for keyboard layout optimization, as they help prioritize the placement of more frequently used characters for improved typing efficiency.

Vowel Frequency Analysis

To better understand the distribution of vowel usage in Kannada, we analyzed the frequency of individual vowel sounds using a dedicated heatmap by summing the frequencies across the rows from the above heatmap. The results reveal clear patterns in how vowels are used across the language:

Heatmap showing the usage frequency of Kannada vowel sounds. Frequencies range from 0M to 1.5B, with colors from purple to yellow indicating high to low frequencies respectively.

As in other Indic languages, the shorter vowels are more frequent than their longer counterparts. The vowels ಅ (a), ಇ (i), and ಆ (aa) are the frequently used either as standalone vowels or as vowel signs in a consonant-vowel (CV). Vowels like ಉ (u) and ಐ (e) also show significant overall usage. The ardhakshara (virama) sign occurs more than in billion instances (last cell in the heatmap) in this dataset, underscoring the frequent use of pure consonants in Kannada. This pattern aligns with other Dravidian languages such as Telugu and Malayalam, where the virama plays a similarly prominent role in forming consonant clusters and suppressing inherent vowels.

However, a notable divergence emerges in Kannada: the akāra CV form (first cell) exhibits a higher frequency than the virama, a trend not observed in other Dravidian languages, where virama usage often surpasses or is comparable to akāra CV combinations. This suggests a greater reliance on vowel-led syllables in Kannada, possibly reflecting phonotactic or orthographic preferences unique to the language.

Consonant Frequency Analysis

We then analyzed the frequency of the consonant by performing a row-wise sum over the frequency matrix. This gives us the frequency of each consonant across all its CV forms as captured in the heatmap below. The first cell captures the frequency of all the vowel forms and can be ignored from this analysis.

Heatmap titled 'Kannada Consonant Sounds Usage Frequency Heatmap' showing the frequency of various Kannada consonants in millions. Darker colors indicate higher usage.

The consonants ರ್ (r), ದ್ (d), and ತ್ (t) are the most frequently used, highlighting their central role in Kannada phonology and word formation. Characters like ಕ್ (k) and ಗ್ (g) also show high usage, each exceeding 360M occurrences, indicating their importance in everyday vocabulary. Aspirated and retroflex consonants such as ಘ್ (gh), ಙ್ (ng), ಝ್ (jh), and ಞ್ (ny) appear very infrequently, with some registering near-zero usage. These are typically found in Sanskrit-derived or less common words.

As noted earlier, the hard consonants are slightly more frequent than their softer equivalents (except for ಕ್ (k) and ಪ್ (p)). However, in keeping with the conventions followed in both the Kannada InScript and Varta keyboard layouts for other Indic languages, we chose to place the soft consonants on the home or bottom row, while assigning the hard consonants to the top row. This arrangement maintains consistency across layouts and supports more intuitive typing patterns.

Special Characters and Ligatures

Kannada, like many Indic scripts, includes a variety of ligatures and conjunct characters. However, to support commonly used consonant conjuncts in Kannada, we’ve assigned ತ್ರ್ (tr), ಕ್ಷ್ (kss), ಶ್ರ್ (shr), and ಜ್ಞ್ (jny) to the Alt + Shift positions of specific keys on desktop keyboards. On mobile keyboards, where the Alt key is not available, these conjuncts can be accessed by long-pressing the keys Y, U, I, and O, respectively. This design ensures that these frequently used clusters remain easily accessible across both desktop and mobile platforms.

Keyboard Design Implications

Our character frequency analysis for Kannada offers valuable insights for anyone working with the Kannada keyboard layout. From native speakers to developers building Kannada phonetic keyboards, understanding which characters are most frequently used can enhance typing efficiency and user experience. This data-driven look at the Kannada keyboard supports smarter design and localization.

The insights from this frequency analysis plays a key role in designing the Varta Kannada keyboard. Frequently used characters are placed in more accessible positions (home row or index finger positions), while less common ones are assigned to secondary layers (e.g., Shift or long-press positions). This ensures a balance between comprehensive script coverage and ease of use. To improve usability, we opted not to include standalone vowel signs. Instead, vowel signs are generated automatically from consonant-vowel sequences, which simplifies the layout and minimizes input errors.

You can explore our optimized keyboard layout through the Varta Keyboard apps, available on Android and iOS, as well as through browser extensions for Chrome, Edge, and Safari.

Explore Frequency Analyses in Other Languages

We’ve performed similar analysis for other Indian languages as well. Explore them below:

Character Frequency Analysis for Hindi
Character Frequency Analysis for M alayalam
Character Frequency Analysis for Tamil (includes design principles)
Character Frequency Analysis for Telu