Data Preprocessing (part 1)

on Friday 28 June 2013

Overview


Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing.

Data preprocessing is used database-driven applications such as customer relationship management and rule-based applications (like neural net works).

Why we need Data Preprocessing ?
  1. Data in the real world is dirty
  • incomplete : the value of attribute doesn't complete, attribute that must exist but it just not exist, or just aggregate data is available
  • noisy : contain error or outliers
  • inconsistent : there is discrepancies in coding and value
  • redundant data
     2.  No quality data, no quality mining results (garbage in, garbage out)
  • quality decisions must be based on quality data 
  • data warehouse needs a combination of data which is have a certain quality
     3.  Data extraction, cleaning, and transformation is an important part for data warehouse


Data Preprocessing Task


Data Cleaning
  • fill in missing values, smooth noisy data, identify or remove outliers, and resolver inconsistencies
Data Integration
  • integration of multiple databases or files
Data Transformation
  • normalization and aggregation
Data Reduction
  • obtains reduced representation in volume but produces the same or similar analytical results
Data Discretization
  • part of data reduction but with particular importance, especially for numerical data

0 comments:

Post a Comment