Data is like food for machine learning, and any machine learning project should always begin with exploratory data analysis(EDA). EDA allows us (the practitioner) to understand the data and make informed decisions about different downstream components of the machine learning pipeline. Below I have listed down some guidelines for completing a good EDA.
- Always know the source of your data. Where the data came from, what does it mean, and what it should be telling you, and so on. Ask as many questions as possible.
- Calculate descriptive statistics such as mean, median, mode, standard deviation, min, max, quartiles, skewness, and others that you need. Understand these numbers.
- Analyze relationships between variables. Use correlation matrix, scatter plots between two variables, histogram of a single variable, box-plot of a single variable, and other types as needed.
- Clean the data if needed by identifying and removing redundant variables, outliers, and invalid entries.
- Perform statistical tests.
- Always take extensive notes of your process and make your EDA reproducible.
Finally, some tools that I use for EDA:
Enjoy your EDA