Exploratory Data Analysis

As someone new to data science, I have recently been listening to data science podcasts with the hopes of familiarizing myself with data science jargon and just learning a thing or two. In one episode of Not So Standard Deviations, hosts Roger Peng and Hilary Parker discuss data science processes and how most of the time, there is no formal process to data analysis, and most data scientists just know what to do.

This got me thinking about how to approach a data set for a project. Unless you’re handed a hypothesis to test, intuition and creativity really comes into play in the initial phases of a data investigation; a.k.a. Exploratory Data Analysis.

What is Exploratory Data Analysis (EDA)?

Exploratory Data Analysis (EDA) is the initial analysis of your data with the goal of bringing important aspects of a data set into focus. These are these aspects that ultimately drive your full data investigation, which you will communicate to others.

Every data analysis contains EDA whether or not one intends to do it. Even if you are handed one question to answer, this one question probably resulted from an EDA.

As mentioned, EDA is the initial analysis of your data and consists of you asking questions about your data without making any assumptions. As with any project, it is natural to approach a data set with many questions. When there is data everywhere, you have to ask questions!

In order to find that needle in a haystack that is your data set, EDA provides an efficient way of exploring your data with a variety of techniques. EDA tools include: data visualization, transformation and modeling. With these methods, you can examine different aspects of your data set and look for patterns based on the questions you’ve asked that can tell you something new.

Again, I say this is just the first step. Once you use these tools and find answers, you should use these answers to refine your original questions, or create new questions, which you will then investigate further.

EDA does not have a set of rules for you to follow, and really should be an exploratory process in which you investigate all questions and ideas you have about your data. Eventually you will end up with that one thought-provoking question that provides an intriguing insight.

Beyond bringing focus to your data set, EDA also allows you to examine the quality of your data. Exploring your data allows you to find missing variables and visualizing your data can easily highlight outliers that can be removed with data transformation. If your data does not meet your expectations, then you will need to ask questions and employ the tools of EDA in order to clean your data set.

Questions in Exploratory Data Analysis

Questions will be your main tool while investigating your data. Each question you ask should focus on a different part or aspect of your data set. Additionally, your questions should lead you to the EDA tools that you ultimately need to use to answer those questions.

You may have heard the saying “quality over quantity”, but when it comes to EDA, the larger quantity of questions you start off with, the higher quality questions you will end up with. Questions you ask at the beginning of EDA may be general, but with each aspect of your data that you discover, you will drill down to the most important parts of your data set, and to that one thought-provoking question to examine and share.

Again, while there are no rules about what specific questions to ask, looking at the variation within a variable and the covariation between variables is a good place to start. Look for future posts about EDA with more information on variation, covariation and more!