Data Cleaning
"Data cleaning is one of the three biggest problems in data warehousing - Ralph Kimball"
"Data cleaning is the number one problem in data warehousing - DCI survey"
Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set,table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data.
After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data.
Task
- Fill in missing values
- Identify outliers and smooth out noisy data
- Correct inconsistent data
- Resolve redundancy caused by data integration
Missing Values
- Data is not always available ; example : many tuples have no recorded value for several attributes, such as customer account in bank data
- Missing data caused by :
- equipment malfunction
- inconsistent with other recorded data and thus deleted
- data not entered due to misunderstanding
- certain data may not be considered important at the time of entry
- How to handle missing data ?
- ignore the tuple : usually done when class label is missing (assuming the tasks in classification - not effective when the percentage of missing values per attribute varies considerably
- fill in the missing value manually (tedious + infeasible)
- fill in automatically with attribute mean, the attribute mean for all samples belonging to the same class (smarter), the most probable value (result from classification)
Noisy Data
- Incorrect attribute
- faulty data collection instruments
- data entry problems
- data transmission problems
- Other data problems which requires data cleaning
- duplicate records
- incomplete data
- inconsistent data
- How to handle noisy data ?
- binning : first sort data and partition into (equal - frequency) bins ; then one can smooth by bins means, smooth by bin median, smooth by bin boundaries ; etc
- regression : smooth by fitting the data into regression functions
- clustering : detect and remove outliers
- combined computer and human inspection : detect suspicious values and check by human (example, deal with possible outliers)
1 comments:
Nice and interesting information and informative too.
data cleanup tools
Post a Comment