As a data practitioner, you probably spend over 80% of your time preparing your data for analysis. This may be frustrating at times as you are eager to perform the analysis and uncover trends and insights.
However, data preparation is an integral component of data analysis. Without proper data preparation, subsequent data analysis will be flawed.
The preparation process includes reformatting and cleaning the data, correcting any errors and outliers, and combining data sets if applicable.
Effective data preparation is beneficial for organizations to optimize the analysis process. Preparing the data is a key step to ensure the data available is of good quality and insights derived from it are accurate and reliable.
Additionally, a thorough exploration of the data and possible methods during this phase is well worth the effort and can save a lot of time and aggravation down the line.
DATA PREPARATION SCENARIO
Imagine a mobile communications company wants to understand the characteristics of customers who churn or leave the company for another provider. If you were the data analyst presented with this request, you’d want to use the data preparation phase to help you understand the feasibility of this request before diving into the data.
The first step is to determine how far back stakeholders want the data to be investigated and then identify if there is data available on who churned during that period (e.g., 5 years). Machine learning and artificial intelligence models can help determine if the data set is available and applicable.
Decisions will arise. Will you delete or ignore missing data? Or would you try to fill in missing values through imputation? If there are extreme values, will you keep or delete them?
QUESTIONS TO KEEP IN MIND FOR DATA ANALYSIS
Here is a checklist with questions to help you ensure you cover all the important bases of the data preparation phase of a data project.
This checklist helps to ensure you have access to accurate data and identifies other key issues from the start. It begins with general overview questions and becomes more specific and action-oriented.
You will likely want to add relevant questions of your own pertaining to your industry and organization.
AT FIRST GLANCE:
1. Does the data you need exist?
2. Do you know how the data was generated and collected?
3. Is the data enough to reach reliable conclusions?
UPON FURTHER REFLECTION:
4. Does it measure what you need?
5. Are the variables the correct types or levels?
6. Do you understand the labels and codes used?
AFTER EXPLORING THE DATA:
7. Does the data include the required range and variability?
8. Are the distributions as you would expect?
9. Have you identified outliers or anomalies?
CONSIDER RETURN ON INVESTMENT (ROI):
10. Are you focusing on predictors you can control?
11. Have you identified the costs of manipulating the predictors?
12. Have you identified the potential benefits of conducting your analysis?
AT THE END OF YOUR PREPARATION:
13. Are you confident that your analysis will produce the desired insights?
14. Have you identified if anything can be safely and usefully reduced?
15. Can you explain and justify your conclusions and recommendations?
[Want to see more data preparation questions not listed here? Download our ebook – Prepare: Avoid Common Pitfalls by Analyzing the Right Data]
Data is only as useful as its accuracy.
As organizations spend resources and time to ensure the quality of their data is accurate and reliable, an error or issue in the data can significantly impact decision-making or skew insights. Asking the right questions when preparing your data is critical to getting accurate data insights.
Advance From A Tactical Role to Being A Strategic Contributor
Translate business needs into achievable data projects with Pragmatic Institute’s course, Business-Driven Data Analysis. The course is built around the Pragmatic Data Insights Model to ensure data practitioners and stakeholders embrace an optimized approach to data projects. Master the Pragmatic Data Insights Model and implement these skills within your own organization using real-world data.