Data cleaning is the process of ensuring that your data is
correct, consistent and useable.
Here are several key benefits that come out of the data cleaning
process:
- It removes major errors and inconsistencies that are inevitable
when multiple sources of data are getting pulled into one
dataset.
- Using tools to cleanup data will make everyone more efficient
since they’ll be able to quickly get what they need from the
data.
- Fewer errors means happier customers and fewer frustrated
employees.
- The ability to map the different functions and what your data
is intended to do and where it is coming from your data.
Data cleansing depends on thorough and continuous data profiling
to identify data quality issues that must be addressed.
Some of the data cleansing techniques involved in data
transformation during a data visualization project are as
follows:
- Defining a data quality plan: Derived from the
business case, the quality plan may also entail some conversation
with business stakeholders to tease out answers to questions like
“What are our data extraction standards,” “What opportunities do we
have to automate the data pipeline,” “What data elements are key to
downstream products and processes,” “Who is responsible for
ensuring data quality,” and “How do we determine accuracy.”
- Validating accuracy: One type of accuracy is
taking steps to ensure data is correctly entered at the point of
collection – for example, if a website has changed and the value is
no longer there, or if the pricing of a product is only available
when you put an item into a shopping cart because of a
promotion.
- Deduplicating: No source data set is perfect,
and sometimes source systems send duplicate rows. The key here is
to know the “natural key” of each record, meaning the field or
fields that uniquely identify each row. If an inbound data set
includes records having the same natural key, all but one of the
rows could be removed.
- Handling blank values: Are blank values
represented as “NA,” “Null,” “-1,” or “TBD”? If so, deciding on a
single value for consistency’s sake will help eliminate stakeholder
confusion. A more advanced approach is imputing values.
This means using populated cells in a column to make a reasonable
guess at the missing values, such as finding the average of the
populated cells and assigning that to the blank cells.
- Reformatting values: If the source data’s date
fields are in the MM-DD-YYYY format, and your target date fields
are in the YYYY/MM/DD format, update the source date fields to
match the target format.
- Threshold checking: This is a more nuanced
data cleansing approach. It includes comparing a current data set
to historical values and record counts. For example, in the health
care world, let’s say a monthly claims data source averages total
allowed amounts of $2M and unique claim counts of 100K. If a
subsequent data load arrives with a total allowed amount of $10M
and 500K unique claims, those amounts exceed the normal expected
threshold of variance, and should trigger additional scrutiny.
Hope this answers your questions, please leave a upvote if you
find this helpful.