Similarity Analysis of Telugu Language Documents Using Different Similarity Measures

Jalaja Kumari Bygani, Dr. Yalla Venkateswarlu and Dr.K. Vramana

Computing similarity among various documents plays an important role in data mining applications like classification, clustering, information retrieval etc. It is often necessary to make a partition among large document data and such a partition can be done effectively by measuring similarities among the data. Measuring similarities among documents has attracted good number of researchers but, regional languages like Telugu are given only a little consideration. Hence, there is a space in regional languages like Telugu to carry out research on similarity measurement among various documents. In this paper initially we take Telugu language documents from seven famous news paper websites related to three news topics and had initially measured the similarity without applying any preprocessing, and then we had measured the similarity after removal of stopwords and then finally measured similarity after removal of stopwords and applying synset comparison. In all these phases we had applied standard bench mark similarity measures. Finally, we had expressed the percentage of similarity among all the document pairs related to both the considered news topics.

Volume 13 | Issue 1

Pages: 17-56

DOI: 10.5373/JARDCS/V13I1/20211003