Archives

Classification Models for Handling Incomplete Data Using Entropy


Jong Chan Lee
Abstract

The data containing missing data is defined as incomplete data. This incomplete data processing problem occurs frequently in a ubiquitous environment using multiple sensors or information, and the research that deals with this problem is very important for developing a classification model with high accuracy.This paper introduces a method to solve the missing part of incomplete data based on the data extension technique. The data extension is divided into a part to fill in the value of the missing variable and a part to process the weight indicating the importance of each event record. As a method for compensating for missing data, a data expansion technique has been proposed and applied to a wide range of fields to produce good results. The method attempts to maximize Entropy by assigning equal probability values to the missing variables. For this reason, the Entropy-based algorithm (C4.5 etc.) played a role of preventing this variable from being selected at the root of the decision tree. However, this method does not use the information inherent in the training data by ignoring the existing information as well as the record containing the damage variable in the training data and assigning the same value uniformly according to the cardinality of the damage variable. In contrast, in this paper, the basic idea is to extract the information inherent in the training data using Entropy, and then to preserve the information by assigning the damage variable in the form of probability.Experimental results show how much information can be recovered using the proposed corrupted data processing technique. This experiment set the degree of loss in UCI data in five steps and then examined how much information was recovered from the lost data. As a result, slightly different results were produced depending on the number of variables in the training data, but the results were generally good.Researches on how to utilize the information inherent in lost learning data without ignoring it are necessary. In the simplest case, a study should be followed to find the most efficient method through various experiments on various methods such as assigning the average value of the information inherent in the damaged variable. This research should be developed to compensate for the data corruption problem in the ubiquitous environment.

Volume 11 | 06-Special Issue

Pages: 1981-1985