Data Preprocessing (part 3)

on Wednesday 3 July 2013

Data Integration


Data integration is one of the steps Data Preprocessing that involves combining data residing in different sources and providing users with a unified view of these data. It does merging data from multiple data stores (data sources).

How it works?
It fundamentally and essentially follows the concatenation operation in Mathematics and the theory of computation. The concatenation operation on strings is generalized to an operation on sets of strings as follows :

For two sets of strings S1 and S2, the concatenation S1S2 consists of all strings of the form vw where v is string from S1 and w is a string from S2





Entity identification problem
How can equivalent real-world entities from multiple data sources be matched up?
  • identify real world entities from multiple data sources. Example, Frans Clark = David Clark
Detecting and Resolving data values conflicts
  • for the same real world entity, attribute values from different sources are different
  • possible reasons : different representations, different scales. Example, metric vs British units
How to handle redundancy at Data Integration?
  1. Redundant data occur often when integration of multiple databases
    • Object Identification : The same attribute or object may have different names in different databases
    • Derivable data : One attribute may be a "derived" attribute in another table. example, annual revenue
  2. Redundant attributes may be able to be detected by correlation analysis
  3. Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Data Transformation


In Data Mining preprocessing and specially in metadata and data warehouse, we used data transformation in order to convert data from a source data format into destination data.

We can divide data transformation into two steps :
  • Data Mapping that maps data elements from the source to the destination and captures any transformation that must occur
  • Code Generation that creates the actual transformation program

Data transformation can involve :
  1. Smoothing, that removes noise. Such techniques include binning, regression, and clustering. (it is form of data cleaning)
  2. Aggregation, where summary or aggregation operations are applied to the data. It typically constructs a data cube for multiple analysis of the data (refers to data reduction)
  3. Generalization of the data, where low-level data are replaced with higher-level concepts. It can be used for categorical and numerical attributes (refers to data reduction)
  4. Normalization, where the attribute data are scaled so as to fall whitin a small specified range (helpful for classification)
    • min - max normalization
             Min-max normalization: to [new_minA, new_maxA]                  
              

    • z - score normalization
    • normalization by decimal scaling
                    

Attribute construction, where new attributes are constructed and added from the given set of attributes to help the mining process

0 comments:

Post a Comment