Content Identification in Hindi and Bangla

This poster tries to focus on content identification on the two most popular languages of the Indian sub-continent, Hindi and Bangla. The emergence of substantial online content in Indian languages has given us the forensic linguistics challenge to detect the content of languages. In our effort, we could develop an online content detection system, which may identify the contents of both Hindi and Bangla. The system exploited another issue, the use of gold standard data, to overcome the crisis of data in Indian languages. The system used gold-standard data for the development of this system. It used trigram, and Named Entity(NE) data to identify contents, with the use of standard classifier. The result of both the system created from two monolingual corpora is above 90.0 in the F1-score measure. The work also gives a comparative study of the nature and distribution of trigram and NE data of the languages.

Author(s): Subhabrata Banerjee

Abstract | PDF

Share This Article