Last year I graduated from Data Analyst Nano degree from Udacity and learned a lot of the process of data analysis. Also, I talked about this process by Tableau Fringe Festival on 8th of March, 2019. The video recording you can find here.  Now I would like to share my knowledge with you.

The data analysis process is broken into 5 steps:

  • Asking questions
  • Wrangle
  • Explore
  • Draw conclusions
  • Communicate

Asking questions

The data analysis process always starts with asking questions. Sometimes you’re already given a data set, other times your questions come first which will determine what kind of data you’ll gather later.

If your data is already given, then these could be your questions:

  • What this data set about?
  • What is the metadata?
  • Which time period is given?
  • Which dimensions/measures does the data set have?
  • Is there are any trends in the data?
  • etc.

Tipp: Ask as many questions to the context as you can. Make sure you understand the context of data correctly!

Other times, you might be just interested in any topic and you must look for data. For example, I was interested in data of WeRateDogs on Twitter. WeRateDogs is a Twitter account that rates people’s dogs with humorous comments about the dogs. Today WeRateDogs has over 7 million followers and has received international media coverage.

My questions to WeRateDogs were:

  • If breed dogs are more popular than non-breed dogs?
  • What is the most popular name for breed dog or for non-breed dogs?
  • What is the most popular breed by WeRatedogs?
  • Which tweet has the most likes?
  • etc.

In order to find the answers to these questions, I had to gather data from Twitter.

Data Wrangle

Sometimes Data Wrangle is used an anonymously to EDA (Exploratory Data Analysis) because their purpose often overlaps. Thus, I would like to explain the difference.

Data Wrangle means gathering, assessing and cleaning data. But keep in mind, that the modifications you make when cleaning the dataset won’t make your analysis, visualizations or more models better, they just make them work. Alternatively, EDA means you’ll be exploring and augmenting your data to maximize the potential of your analysis, visualizations, and models.

Data Gathering

There are a lot of possibilities to gather the data. Here are some options:

Advanced method

  • get API
  • scrape data from a web page

Simple method

Very simple method

participate in data projects like:

For the data of WeRateDogs I used Twitter API. I created a development account by Twitter and got follow access data: consumer key, consumer secret access, token etc. This is a small part of my code for the gathering of data:

With Twitter API I could access Twitter data of WeRateDogs directly from Python to Twitter and downloaded the data.

For people, who can code, Python is a great solution for data wrangling.  Alternatively, you can use Tableau Prep. Tableau Prep is a light version of ETL Tool and you don’t need any code. With this tool, I could transform my data, and generate a Tableau extract.

After downloading the data from Twitter, I used Tableau Prep in order to assess and clean my data.

Data Assess

Assessing data is the second step in the Data Wrangling process. When assessing, you’re like a detective at work. Inspecting the data set for two things: data quality issues and lack of tidiness.

For assessing data, I used Tableau Prep with this flow. The assessing is the first and the second steps in this flow.

I documented all points about quality and tidiness of data:

Quality issues by the data of WeRateDogs are:

  • Data set: 17 columns; 2300 rows
  • No duplicates in the data set
  • Not every dog is a breed dog
  • Not every tweet has an image
  • In the column ’name,‘ I could find 109 entries with lower cases. I assume, there are not actually name
  • Some names were entered into the column „Text“
  • Contents of ‚text‘ cutoff

Tidiness

  • The dimensions or measures must have a correct datatype
  • All dog’s names should be entered the column „name“
  • Join ‚tweet_info‘ and ‚image_predictions‘ to ‚twitter_archive‘
  • The dataset should not have duplicate data
  • All dog’s name should be upper cases

The assessing data gave me a great overview of how to clean my data.

Cleaning data

Before you start to clean your data, make sure you did a copy.

I also used Tableau Prep for cleaning up the data and I took the following steps for cleaning:

  • Formatting the measures and dimensions where it was necessary, e.g. tweet_id from float to string
  • Check missing data
    • The missing data is a common problem when you work with the data. And this issue should be handled differently depending on several factors such as the reason those values are missing and whether the occurrences seems random.

I followed all the points I documented by assessing and tried to clean up my data. After cleaning I created a hyper extract for Tableau Desktop. Now I can explore the data, i.e. I can find patterns in my data by creating plots.

Exploratory Data Analysis (EDA)

EDA is an examination of data and the relationship among variables, through both numerical and graphical methods. It often takes place before more formal, more rigorous statistical analysis. EDA is often the first part of the largest process. It can lead to insights or to new questions or even feed into the process of predictive models. It is an opportunity to check some of your assumptions and intuitions about a data set.

For the EDA, I used Tableau Desktop. Feel free to have a look at my workbook:

I answered some questions by exploring the data:

  • Which name is the most common?
  • Which breeds are the most common?
  • Which names are popular for a golden retriever?
  • Favorite Counts over time
  • When do people twitter mostly?
  • On which days people assign a “like” and retweet the post?
  • What time people usually do assign a “like” and retweet the post?
  • Correlation between Favorite Tweets and their Retweet Action

Draw conclusion

And now try to make a conclusion from your visualization. Try to understand the context of your data and ideally to find a conclusion. For example:

  • The data I am dealing with contains data about breed dogs.
  • The time period we look at is November, 17th 2015 to August 1st, 2017. Also, there is no data for every day
  • etc.

Visualize the data by communicating the message.

If all steps are done, you can visualize your data by creating your data story. The dashboard I build is included in this workbook.

But first, what does the data story mean? For me, data story is a focus on some points of context in your data, e.g. I focus on Top 10 breed dogs and showed their common names and how popular are the tweets for this breed dogs.

And in order to create a story with your data, you must visualize the data by choosing the right chart type. A very helpful source for this could be this workbook of Andy Kriebel.

Don’t forget to add context and interaction (like a filter, or other actions) if necessary. Also point out the Call to Action, as not everyone knows how to use your dashboard. It is important to add the data source and the name of people who inspire you for the viz if that is the case.