Computer Science: Data Preprocessing (part 3)

Data Integration

Data integration is one of the steps Data Preprocessing that involves combining data residing in different sources and providing users with a unified view of these data. It does merging data from multiple data stores (data sources).

How it works?

It fundamentally and essentially follows the concatenation operation in Mathematics and the theory of computation. The concatenation operation on strings is generalized to an operation on sets of strings as follows :

For two sets of strings S1 and S2, the concatenation S1S2 consists of all strings of the form vw where v is string from S1 and w is a string from S2

Entity identification problem

How can equivalent real-world entities from multiple data sources be matched up?

identify real world entities from multiple data sources. Example, Frans Clark = David Clark

Detecting and Resolving data values conflicts

for the same real world entity, attribute values from different sources are different
possible reasons : different representations, different scales. Example, metric vs British units

How to handle redundancy at Data Integration?

Redundant data occur often when integration of multiple databases

Object Identification : The same attribute or object may have different names in different databases

Derivable data : One attribute may be a "derived" attribute in another table. example, annual revenue

Redundant attributes may be able to be detected by correlation analysis
Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Data Transformation

In Data Mining preprocessing and specially in metadata and data warehouse, we used data transformation in order to convert data from a source data format into destination data.

We can divide data transformation into two steps :

Data Mapping that maps data elements from the source to the destination and captures any transformation that must occur
Code Generation that creates the actual transformation program

Data transformation can involve :

Smoothing, that removes noise. Such techniques include binning, regression, and clustering. (it is form of data cleaning)
Aggregation, where summary or aggregation operations are applied to the data. It typically constructs a data cube for multiple analysis of the data (refers to data reduction)
Generalization of the data, where low-level data are replaced with higher-level concepts. It can be used for categorical and numerical attributes (refers to data reduction)
Normalization, where the attribute data are scaled so as to fall whitin a small specified range (helpful for classification)

min - max normalization

z - score normalization
normalization by decimal scaling

Attribute construction, where new attributes are constructed and added from the given set of attributes to help the mining process

Computer Science

Nav

Data Preprocessing (part 3)

Data Integration

Data Transformation

0 comments:

Post a Comment

Popular Posts

Blog Archive

Tags