Telugu Text classification — Part 1

Challenges

Pradeep Miriyala
One epoch at a time

--

In this series of articles, I would discuss my experiments on Telugu text classification using Annamayya Lyrical dataset (https://dx.doi.org/10.21227/qnxd-yv43).

Dataset Overview

The dataset consists of lyrics for 14283 songs written by Annamayya during 15th century. Each song has two labels, Raga and Genre. Genre is the label of our interest. As there are only two genres, the task is binary classification of lyric in to one of two possible Genres.

The dataset has approximately 80% Romantic and 20% Devotional lyrics with three different authors (Annamayya, Peda Tirumalacharya, China Tirumalacharya).

Source code for creating the dataset can be downloaded from https://github.com/pradeep-miriyala/ttd-selenium-crawler

Few classical challenges

The very fundamental challenge is Telugu being a morphologically rich language and a low ML resource language.

Agglutination (సంధి) :

Several Indic languages, including Telugu, have support for agglutination, which allows creation of new words based on a predefined set of Sandhis. Telugu has around 28 sandhis each having several rules. When combining words, Sandhis can

  • Convert a vowel to another vowel
  • Convert a consonant to another consonant
  • Convert a vowel to consonant
  • Remove a letter
  • Introduce a letter
  • Introduce a word

Currently, there are not enough resources that can address the agglutination challenge for Telugu. Stemming is one common technique to attack this problem. However, stemming is not very effective for Sandhi words.

Dialects, Prakruti and Vikruti

Based on region and location, language usage forms differ and several dialects are created over time. Dialects derive new forms for the same words. Refer https://archive.org/embed/mandalikapadakos021234mbp for a dictionary of dialect words.

Prakruti words refer to words derived from Sanskrit, while Vikruti words are altered forms of Prakruti words. Both Prakruti and Vikruti words will have same meaning. However, there are no digital resources that can identify which word is Prakruti and which word is Vikruti.

E.g.,
అగ్ని and అగ్గి are Prakruti and Vikruti words, both meaning “Fire”.
ఉడికించు in one dialect means provoking, while dictionary meaning is boiling.

Language Era

Telugu language kept evolving every century. The evolution bought new words, removed some words, made some letters obsolete, and fused words from foreign languages such as English.

The dataset I have selected has Telugu lyrics of 15th century, which means it has letters and words that are not in common usage. So can state-of-the-art models such as MURIL, XLM-R, Indic BERT can produce decent results on this dataset? We will discuss that in upcoming articles.

POS Tagging

The POS tagger authored by Sivareddy is by far the most popular Telugu POS tagger which was developed using Python2. To use it in experiments that run on Python3, the tagger is ported to Python3 and hosted at https://github.com/pradeep-miriyala/Telugu-POS-Python3.

Typos

The raw content copied from TTD website to create this dataset has several typos. Detecting these typos manually is not possible. At the same time, relying on automatic correction tools is also not reliable. Mainly because the words of dataset are from different era than most of the spell checkers are built on.

In the next part, we will discuss preprocessing and simple model development for binary classification task.

--

--

Pradeep Miriyala
One epoch at a time

Engineer by profession, Programmer by interest, Poet/Writer by hobby. Poetry blog: http://pradeepblog.miriyala.in