Initializing & Cleaning a Dataset

This is part of the “Intro to Data Analysis with Python” series of posts, with content from the Enki app. If you stumbled upon this, you could start from the beginning.

Using the dataset from the previous insight[1], we will show you how to clean it up before we start the analysis.

First off, when we import a dataset, we can use the head() or tail() functions to check the top or bottom 5 rows, respectively.

You can also pass a number to head() and tail() to overwrite the default value of 5.

Using importedRawData.head() we get:

‍

Using importedRawData.tail() we get:

‍

This is useful to know right away if your dataset has loaded or not.

As you can see, there are a lot of columns in this dataset.

To check the total number of rows and columns in your dataset, add .shape to your DataFrame.

This dataset has 6234 rows and 12 columns.

Rows start from 0 instead of 1. This is why the last columns show_id is 6233 instead of 6234.

We will remove the columns we don't need for our analysis and leave the ones we will use in this workout.

To determine which columns we will remove, let's first check which cells have missing data.

To check which data is missing run the .isnull() command:

‍

This will give us a table with True / False values. True meaning empty.

Footnotes

[1:Previous Dataset]

About Enki

Fully personalized online up-skilling
Unlimited AI coaching
Designed by Silicon Valley experts

Get Started

Meet your AI-enabled coach

Professional athletes have a coach for every aspect of their performance. Why can’t you for your work? Enki’s AI-powered coaching on-demand - combined with state of the art, structured learning content - makes this a reality.

1

1:1 AI Coaching

How do I remove duplicate emails?

Convert the list to a set and back to a list. Sets automatically remove duplicates.

2

Personalized Exercises

3

Interactive practice

Initializing & Cleaning a Dataset

Learn to Code Today!

Footnotes

About Enki

More articles

Matplotlib: Practical Applications in Data Analysis

Why Learning NumPy and Pandas Will Supercharge Your Career

🚀 Swift vs Kotlin: Why Every Developer Should Learn One (or Both!)

How to use HAVING clause in SQL?

How to Calculate Standard Deviation in Python

Concatenation - or How to Combine Strings in Python

Meet your AI-enabled coach

Unlock full access to all skills on Enki with a 7-day free trial

Reviews

Skills

Resources

About