Tuesday 9 February 2016

Data Pre-Processing for Visual Data Mining


Most Visual Data Mining tools use interactive 2D visualization methods in delivering the meaningful information from the data but when the high number of features, attributes and observations in a data set is large, the methods do not scale well and it is hard to visually determine the data set. This leads to the development of cooperative approach using a consensus theory based on feature selection algorithm, clustering for sampling and visualization for weight assignments in order to aggregate multivariate and multidimensional data sets. This allows visualization and then visual data mining without loss of classifier accuracy, thus resulting in a successful experiments with several high dimensional data sets.

  • Feature Selection aims at choosing a subset of attributes that is sufficient to describe the data set. This process includes identifying and removing the irrelevant and redundant information. This method can be very helpful for Visual Data Mining as it reduces the load of handling large amount of data.
  • Numerosity Reduction reduces number of objects based on their redundancy and description as expected by the experiment. 
  • Dimensionality Reduction helps in reduction of attributes, this not only reduces the number of iteration linearly but also exponentially and hence the name explains it reduces the dimension that are irrelevant for the particular set of visualization or analysis that need to be performed. 
  • Discretization and Quantization reduces the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values. Binning techniques is the most commonly used form of Quantization.
  • Generalization is the process which abstracts a large set of task-relevant data in a database from low conceptual levels to higher ones. Generalization has some limitations like handling only dimensions of simple non-numeric data and measures of simple aggregated numeric values, lack of intelligent analysis, it can`t tell which dimensions should be used and what levels should the generalization reach. But it also includes one of the most simple way of reduction, that is; Generalization and specialization can be performed on a data cube by roll-up and drill-down.
Besides aggregation and filtering which are applied over the large data set or database technologies, can also be carried out by using Data Mining techniques. These techniques may involve database operations such as aggregation, sampling, or other method resulting into classification of data or the clusters as an output.

No comments:

Post a Comment