Content Identification in Hindi and Bangla

Subhabrata Banerjee

Content Identification in Hindi and Bangla

Abstract

This poster tries to focus on content identification on the two most popular languages of the Indian sub-continent, Hindi and Bangla. The emergence of substantial online content in Indian languages has given us the forensic linguistics challenge to detect the content of languages. In our effort, we could develop an online content detection system, which may identify the contents of both Hindi and Bangla. The system exploited another issue, the use of gold standard data, to overcome the crisis of data in Indian languages. The system used gold-standard data for the development of this system. It used trigram, and Named Entity(NE) data to identify contents, with the use of standard classifier. The result of both the system created from two monolingual corpora is above 90.0 in the F1-score measure. The work also gives a comparative study of the nature and distribution of trigram and NE data of the languages.

Author(s): Subhabrata Banerjee

Abstract | PDF

Share This Article

Awards Nomination 17+ Million Readerbase

Google Scholar citation report

Citations : 205

International Journal of Advanced Research in Electrical Electronics and Instrumentation Engineering received 205 citations as per Google Scholar report

Content Identification in Hindi and Bangla

Abstract

Share This Article

Google Scholar citation report

Citations : 205

Open Access Journals