Question

provide a hypothetical situation involving “dirty data” and discuss how data pre-processing would address this issue

provide a hypothetical situation involving “dirty data” and discuss how data pre-processing would address this issue

Homework Answers

Answer #1

In a data warehouse, dirty data is a database record that contains errors. Dirty data can be caused by a number of factors including duplicate records, incomplete or outdated data, and the improper parsing of record fields from disparate systems.

The following data can be considered as dirty data:

  • Misleading data
  • Duplicate data
  • Incorrect data
  • Inaccurate data
  • Non-integrated data
  • Data that violates business rules
  • Data without a generalized formatting
  • Incorrectly punctuated or spelled data

Thus, in order to make the best use of the data you have, it is very important to get rid of dirty data. This is Data pre-processing. Incorrect data always hampers and deteriorates the quality of analysis done. Performing data mining on incorrect data may give us misleading results.

Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of missing data, noisy data etc.

  • (a). Missing Data:
    This situation arises when some data is missing in the data. It can be handled in various ways.
    Some of them are:
    1. Ignore the tuples:
      This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple.
    2. Fill the Missing values:
      There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the most probable value.
  • (b). Noisy Data:
    Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty data collection, data entry errors etc. It can be handled in following ways :
    1. Binning Method:
      This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and then various methods are performed to complete the task. Each segmented is handled separately. One can replace all data in a segment by its mean or boundary values can be used to complete the task.
    2. Regression:
      Here data can be made smooth by fitting it to a regression function.The regression used may be linear (having one independent variable) or multiple (having multiple independent variables).
    3. Clustering:
      This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters.

2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves following ways:

  1. Normalization:
    It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
  2. Attribute Selection:
    In this strategy, new attributes are constructed from the given set of attributes to help the mining process.
  3. Discretization:
    This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.
  4. Concept Hierarchy Generation:
    Here attributes are converted from level to higher level in hierarchy. For Example-The attribute “city” can be converted to “country”.

3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis costs.

The various steps to data reduction are:

  1. Data Cube Aggregation:
    Aggregation operation is applied to data for the construction of the data cube.
  2. Attribute Subset Selection:
    The highly relevant attributes should be used, rest all can be discarded. For performing attribute selection, one can use level of significance and p- value of the attribute.the attribute having p-value greater than significance level can be discarded.
  3. Numerosity Reduction:
    This enable to store the model of data instead of whole data, for example: Regression Models.
  4. Dimensionality Reduction:
    This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after reconstruction from compressed data, original data can be retrieved, such reduction are called lossless reduction else it is called lossy reduction. The two effective methods of dimensionality reduction are:Wavelet transforms and PCA (Principal Componenet Analysis).
Know the answer?
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for?
Ask your own homework help question
Similar Questions
Q6) (a) In the data pre-processing stage, how would you analyze almost-unary columns (i.e. columns with...
Q6) (a) In the data pre-processing stage, how would you analyze almost-unary columns (i.e. columns with almost only one value)? (b) How might this column be generated?    (c) Does this type of variable have significant value in data mining?
Choose one of this and discuss, also provide ethical provision you believe would address the issue...
Choose one of this and discuss, also provide ethical provision you believe would address the issue and remember your citation. Dilemma #1: The student nurse is completing the preceptorship at the local hospital and cares for a young adult patient for the three nights she and her preceptor are on duty. On the last night for the week, while she is talking to the patient about a planned discharge, the patient asks the student to be friends on face book....
Discuss synectics in your own words. Provide an example of a situation when you would use...
Discuss synectics in your own words. Provide an example of a situation when you would use synectics.
A Ramallah pharmaceutical company has the following situation involving the processing of purchases transaction, were the...
A Ramallah pharmaceutical company has the following situation involving the processing of purchases transaction, were the outside entities are suppliers (who receive purchase orders and ship ordered goods with packing slip), the inventory control department (which prepares requisition related to the needed goods), the store room( which accepts received goods from the receiving department together with receiving reports), and the cash disbursements department ( which receives approved suppliers invoices, together with supporting purchase orders and receiving reports), and management (which...
Please discuss the following. Be sure to thoroughly address the questions/comment and provide a detailed explanation....
Please discuss the following. Be sure to thoroughly address the questions/comment and provide a detailed explanation. 5. How would you respond to the notion that "immigrants flood the labor market and drive down wages in the U.S." Do you agree or disagree with this statement? Why? Be sure to provide examples and explain your reasoning.
Consider the following hypothetical microprocessor. Assume this processor uses a 32-bit address and 32-bit data bus....
Consider the following hypothetical microprocessor. Assume this processor uses a 32-bit address and 32-bit data bus. Consider a 4-bit I/O port number. How many 8-bit I/O ports can be supported?
Faced with a similar situation to the Kapchebet Tea case, how would you address with your...
Faced with a similar situation to the Kapchebet Tea case, how would you address with your client the difference between the firm's current capabilities and their desired strategic move? (you do a disservice by only telling them what they want to hear, but you also must position the approach in a way the client is willing to adopt)
Discuss how marketers use primary data and secondary data, and provide some examples?
Discuss how marketers use primary data and secondary data, and provide some examples?
Athletes addicted to pain medications is an issue often seen on the news. Discuss how this...
Athletes addicted to pain medications is an issue often seen on the news. Discuss how this situation occurs, and what steps can be taken to resolve the issue.
Provide a thorough review, analysis, and conclusion to the following issue: QUESTION: How should a technical...
Provide a thorough review, analysis, and conclusion to the following issue: QUESTION: How should a technical or business systems department-level manager balance skills in: + Leadership, + Management, + Strategic Thinking, and + Critical Thinking. (To properly address these elements the answer MUST IDENTIFY and DISCUSS/COMPARE how different theories such as Corporate Social Responsibility, Great Man, Contingency/Situational, Game Theory or other theories) play a part in the mix of skill)
ADVERTISEMENT
Need Online Homework Help?

Get Answers For Free
Most questions answered within 1 hours.

Ask a Question
ADVERTISEMENT