Character Frequency Analysis for Telugu

30.05.2025 04:39 PM Comment(s) By Baskaran Sankaran

As part of our ongoing effort to create optimized and accessible keyboard layouts for Indic languages, we carried out an in-depth character frequency analysis for Telugu. Drawing from large-scale monolingual datasets sourced from across the web, this study focused on identifying real-world usage patterns of Telugu characters. This post lists the outcomes and our findings regarding the detailed character frequency analysis undertaken and how it guided the keyboard layout design for the language.


Dataset Summary

Our Telugu analysis is based on a large and diverse dataset compiled from multiple sources in the web including the following key resources.

  • IndicCorp
  • Leipzig Telugu Corpus
  • Samanantar v0.3 (En-Indic; Indic-Indic)
  • Telugu News Articles Dataset
  • Telugu Books Dataset

After pre-processing, this dataset contained over 1.45 billion total words and 12.3 million unique words, providing rich & diverse data for studying character frequency patterns in the language. This extensive dataset provides a strong statistical foundation for understanding character usage patterns in Telugu helping, us design an optimal keyboard layout.


Overall Character Frequency Heatmap

We created a heatmap to visualize the frequency of Telugu characters, where vowels are represented in columns and consonants in the rows. Each cell represents the frequency of a basic vowel, consonant or consonant-vowel (CV) combination, with darker shades indicating higher usage

A heatmap showing the usage frequency of Telugu letters, with vowels as columns and consonants as rows. Darker colors represent higher frequencies, highlighting commonly used characters.

Vowels such as అ (a) dominate the frequency spectrum, similar to how 'e' and 'a' are dominant in English. Consonants like క (ka), న (na), and ల (la) also show high usage, reflecting their foundational role in Telugu script. A number of characters, particularly aspirated or Sanskrit-derived forms, show very low or near-zero usage, indicating their limited role in everyday Telugu. We also note a unique pattern where long vowels like ఓ (oo) are more frequent than their short counterparts, a trend not commonly seen in other Indic scripts.


Vowel Frequency Analysis

By calculating the column-wise sum of the above character frequency matrix, we obtain the frequency heatmap of each vowel, either in its base form or as a vowel sign in a consonant-vowel (CV)/ consonant conjuncts.

Heatmap showing character frequencies of Telugu vowels in their base as well as CV forms. Colors range from purple (high frequency) to yellow (low frequency), highlighting the relative usage of each vowel sound. As in many other languages, the short vowel forms—అ (a), ఇ (i), and ఎ (e)—are generally more frequent than their long counterparts. However, Telugu exhibits a unique pattern with the long vowels ఏ (ee) and ఓ (oo), which occur significantly more often than their corresponding short forms. Notably, ఓ (oo) is used nearly five times more frequently than ఒ (o), highlighting a distinct phonological preference in Telugu. Vowels like ఊ (U) and ఋ (vR) appear far less frequently, indicating limited usage in modern Telugu text. These insights are crucial for prioritizing vowel placement in keyboard layouts.


The final cell in the heatmap represents the frequency of the Telugu pollu sign, which denotes a pure consonant (i.e., without an inherent vowel). Interestingly, its distribution is comparable to that of the CV forms with the vowel అ (a)—found in the first column—indicating a similar usage pattern across consonants.


Consonant Frequency Analysis

We then started analysis of the Telugu consonants. By summing the rows of the character frequency chart, we obtained the frequency of each consonant, across its different vowel combinations. This row-wise analysis reveals the most commonly used consonants in Telugu.

Heatmap titled 'Telugu Consonant Sounds Usage Frequency Heatmap' showing the frequency of various Telugu consonants in millions. Darker colors indicate higher usage.
The consonant forms of ర్ (r), న్ (na), and ಲ್ (la) emerge as the most frequently used in Telugu, underscoring their central role in the phonetic structure of the language. Following closely are క్ (ka), త్ (ta), and ప్ (pa), each with approximately 300 million occurrences. In contrast, aspirated and less commonly used consonants such as ఙ్ (nga), ఝ్ (jha), and ఢ్ (ḍha) appear only rarely, reflecting their limited presence in contemporary Telugu usage.

Special Characters and Ligatures

Telugu, like many Indic scripts, includes a variety of ligatures and conjunct characters. However, to support commonly used consonant conjuncts in Telugu, we’ve assigned త్ర్ (tr), క్ష్ (kss),  శ్ర్ (shr), and జ్ఞ్ (jny) to the Alt + Shift positions of specific keys on desktop keyboards. On mobile keyboards, where the Alt key is not available, these conjuncts can be accessed by long-pressing the keys Y, U, I, and O, respectively. This design ensures that these frequently used clusters remain easily accessible across both desktop and mobile platforms.

Keyboard Design Implications

Our Telugu character frequency analysis is a must-read for those interested in refining the Telugu keyboard layout. With growing demand for intuitive Telugu phonetic keyboards and mobile-friendly Telugu keyboard solutions, this data helps developers to validate and appropriately align their design based on actual language usage patterns.

The insights from this frequency analysis directly inform our design strategy for the Varta Telugu phonetic keyboard. Frequently used characters are placed in more accessible positions (home row or index finger positions), while less common ones are assigned to secondary layers (e.g., Shift or long-press positions). This ensures a balance between comprehensive script coverage and ease of use. To streamline the typing experience, we decided not to include separate vowel signs. Instead, the correct sign is automatically generated from the consonant and vowel input, reducing errors and simplifying the keyboard design.

You can explore our optimized keyboard layout through the Varta Keyboard apps, available on Android and iOS, as well as through browser extensions for Chrome, Edge, and Safari.

Explore Frequency Analyses in Other Languages

We’ve performed similar analysis for other Indian languages as well. Explore them below:

  • Character Frequency Analysis for Hindi
  • Character Frequency Analysis for Kannada
  • Character Frequency Analysis for Malayalam
  • Character Frequency Analysis for Tamil (includes design principles)
    Share -