Netflix Movies & TV Show — Exploratory Data Analysis (EDA) and Visualization

--

Photo by freestocks on Unsplash

Netflix is a global streaming service that offers a wide variety of TV shows, movies, documentaries, and other content across genres. It allows subscribers to watch unlimited content on multiple devices for a monthly fee. Netflix produces original content, known as Netflix Originals, including series, films, and documentaries, which have gained popularity worldwide. The platform also provides personalized recommendations based on users’ viewing history and preferences. Overall, Netflix has revolutionized the way people consume entertainment, making it accessible anytime, anywhere.

Given Netflix’s role as a platform that offers streaming services for movies and TV shows to a substantial user base, it naturally accumulates vast amounts of data. In this article, I will delve into data analysis using the Python library, showcasing the insights that can be derived from such extensive data sets.

We can use data from Kaggle about Netflix Movies and TV Shows 2021. To download the dataset, please click here

Table of Contents :

  1. Importing Libraries
  2. Load the Dataset
  3. Data Overview and Information
  4. Cleaning Missing Data
  5. EDA (Exploratory Data Analysis), Visualization, and Conclutions

Importing Libraries

Before we start, we need to import modules first. We’ll be using Pandas, NumPy, Matplotlib, and Seaborn.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Load the Dataset

Netflix Movies and TV Shows 2021 dataset is sourced from Kaggle, enabling us to draw insights through analysis and visualize the findings through plots. This method of storytelling enhances comprehension for readers, making easier to glean insights compared to parsing raw data files.

data = pd.read_csv('netflix_titles_2021.csv')

Data Overview and Information

data.head()
Output 1. Data Head of Netflix Movies and TV Shows 2021

Only show 5 first rows of dataset

#Show data shape

data.shape
Output 2. Data Shape

Based on the output, we can see that the dataset consists of 8.807 rows and 12 columns

#Show columns

data.columns
Output 3. Columns of Dataset

Dataset shows 12 columns consisting ‘show_id’, ‘type’, ‘title’, ‘director’, ‘cast’, ‘country’, ‘date_added’, ‘release_year’, ‘rating’, ‘duration’, ‘listed_in’, ‘description’.

#Information about dataset

data.info()
Output 4. Dataset Information

Based on the output obtained, it’s evident that most columns contain a total of 8,807 non-null values. However, there are 6 columns with varying non-null value counts. Furthermore, the analysis reveals that 11 columns are of object data type, while 1 column is represented as an integer with a 64-bit data type.

#Number of null values per column

data.isna().sum()

Check for missing data from each column in the dataset using .isna().sum()

Output 5. Dataset Information

Based on the output provided, it’s evident that 6 columns contain null values. Specifically, the ‘director’ column has 2,634 null values, ‘cast’ has 825 null values, ‘country’ has 831 null values, ‘date_added’ has 10 null values, ‘rating’ has 4 null values, and ‘duration’ has 3 null values.

Cleaning Missing Data

data = data.dropna(subset=['director'])
data = data.dropna(subset=['cast'])
data = data.dropna(subset=['country'])
data = data.dropna(subset=['duration'])
data = data.dropna(subset=['rating'])

.dropna function is using for delete missing data in spesific column one by one, then we’re gonna check, is there still missing data in that column?

data.isna().sum()
Output 6. Dataset Information After Drop Missing Data

EDA (Exploratory Data Analysis) and Visualization

  1. The different types of show or movie are uploaded on Netflix.

For our initial analysis, we’ll explore the variations in the types of shows and films accessible on the Netflix platform. with following this code we’ll find out the different types of show or movie are uploaded on Netflix.

data.groupby('type')['title'].count().sort_values(ascending = False)
Output 7. Count for Different Types of Show or Movie in Netflix

We have determined that there are 5185 movies and 147 TV shows in our dataset.

data.type.value_counts()[data.type.unique()].plot(kind='bar')
Output 8. Plot for Different Types of Show or Movie in Netflix

From the chart above, it’s apparent that there are two types of shows or movies available on Netflix. Movies take the lead with 5185 content items on the platform.

2. Correlation between the features.

We want to know about the correlation between the features because correlation analysis is crucial in various fields such as statistics, data science, and finance. It helps in understanding patterns, making predictions, feature selection, and determining the strength of relationships in datasets. First of all, we must know about data type of ‘date_added’ with .dtypes.

data.dtypes
Output 9. Information for Data Types Every Column

We must change type int64 of ‘date_added’ into datetime using to_datetime and check just 2 rows with .head(2)

data['date_added'] = pd.to_datetime(data['date_added'], format='mixed')
data.head(2)
Output 10. Check after Change Data Type

As you can see, format in column ‘date_added’ has changed to YYYY-MM-DD

data['year_added'] = data['date_added'].dt.year.astype('int64')
data['month_added'] = data['date_added'].dt.month.astype('int64')
data.head(2)

Let’s extract the year from the date and create a new column named ‘year_added.’ Similarly, we’ll extract the month and create a new column named ‘month_added.’ Don’t forget to see the 2 first row for the result using .head(2).

Output 11. Check after Add New Column

If you scroll to the right, ‘year_added’ and ‘month_added will appear thats make us easy to search about correaltions using .corr(numeric_only = True)

Output 12. Correlations

3. Most watched show on Netflix.

This section resembles the first one; we’re aiming to determine the most-watched content type between movies and TV shows.

data.groupby('type')['title'].count().sort_values(ascending = False)
Output 13. Count for Different Types of Show or Movie in Netflix

We have determined that there are 5185 movies and 147 TV shows in our dataset.

Output 14. Plot for Different Types of Show or Movie in Netflix

According to the plot, movies are the most-watched content on Netflix, accounting for 5185 titles. But, if we want to show with pie chart and persentation, use this code :

type_show = ['Movie', 'TV Show']
value_count = [5185, 147]
plt.pie(value_count, labels = type_show, autopct='%2.2f%%')
plt.show()
Output 14. Pie Plot for Different Types of Show or Movie in Netflix

According to the plot, movies are the most-watched content on Netflix, 97.24% accounting.

4. Distribution of Ratings.

In this section, our objective is to analyze the distribution of ratings across each title in our dataset.

sns.countplot(x=data['rating'])
plt.xticks(rotation=90)
Output 15. Distribution of Ratings

When comparing ratings across titles on the Netflix platform, it becomes evident that titles with the rating “TV-MA” dominate the platform, surpassing other rating categories.

5. Highest Rating TV Show or Movies.

This thing is the important thing, we wat to know about highest rating TV Show or Movies on the Netflix Platform

data.groupby('type')['rating'].agg(pd.Series.mode)
Output 16. Distribution of Ratings

Based solely on this output, a discernible trend emerges indicating that the highest-rated content among the two types available on Netflix is consistently associated with the ‘TV-MA’ rating category. This observation suggests a preference or trend towards content with this particular maturity rating among viewers on the platform.

6. The best Month for releasing content

The best month for releasing content depends on factors like genre, target audience, and competition. How many titles were released in each month. For streaming platforms like Netflix, it varies based on data and avoiding competition.

data.month_final.value_counts().to_frame('value_counts')
Output 17. Value Counts About Month Releasing Content

It shows the distribution of content releases across different months. It provides a count of how many titles were released in each month, giving insights into the distribution of content launches throughout the year.

data.month_final.value_counts().plot(kind='bar')
Output 18. Plot About Month Releasing Content

7. Highest Genre Wached on Netflix

We must create new column for ‘new_genre’

new_genre = data['listed_in'].str.rsplit(',', n=2)
new_genre
Output 19. Make New Column about New Genre
data['Genre 1']=new_genre.str.get(0)
data['Genre 2']=new_genre.str.get(1)
data['Genre 3']=new_genre.str.get(2)
data.head(2)
Output 20. Genre_1, Genre_2, Genre_3 will appear in the Right

If you scroll to the right, ‘Genre_1’, ‘Genre_2’, and ‘Genre_3’ will appear thats make us easy to search about genre in each title. Then continue with .describe() all genre.

data['Genre 1'].describe(include=all).to_frame()
Output 21. Genre_1
data['Genre 2'].describe(include=all).to_frame()
Output 22. Genre_2
data['Genre 3'].describe(include=all).to_frame()
Output 23. Genre_3

Too see pie plot about highest genre watched on Netflix, use this code :

Genre_type = ['Genre 1', 'Genre 2', 'Genre 3']
value_count = [5332, 4231, 2295]
plt.pie(value_count, labels = Genre_type, autopct = '%2.2f%%')
plt.title("Most Watched Genre on Netflix")
plt.show()
Output 24. Pie Plot Most Watched Genre on Netflix

8. Released movie over the years

In this part, we will know about released movie over the years

data.groupby(['release_year'])['release_year'].count().sort_values(ascending=False).to_frame()
Output 25. Released Movie Over The Years
sns.countplot(x='release_year', data=data)
sns.set(rc={'figure.figsize':(40,20)})
plt.show
Output 26. Bar Plot Released Movie Over The Years

The bar plot drawn is that the year 2017 saw the highest number of movie releases, totaling 657 titles.

9. Movies Made On Year Basis

Analyzing or categorizing movies based on the year they were produced or released. It involves looking at movies as individual entities grouped according to the specific year of their creation or release

data.groupby('year_added')['type'].value_counts().sort_values(ascending=False).to_frame()
Output 26. Movies Made on Year Basis
data['year_added'].value_counts().plot(kind='bar')
Output 27. Bar Plot Movies Made on Year Basis

The conclusion is that in the year 2009, there were 1236 movies released, making it the year with the highest number of movie releases.

10. Show all the movies that were released in year 2000

We want to know all the movies that released in year 2000

data[(data['type']=='Movie') & (data['release_year']==2000)]
Output 28. Movies That Were Released In Year 2000

The analysis includes a minimum of 32 rows, providing an example of the number of movies released in the year 2000.

11. Show only the title of all TV shows that were released in India only

In this section, we want to lnow the title of all TV Shoes that were released in India Only.

data[(data['type']=='TV Show') & (data['country']=='India')]['title'].to_frame()
Output 29. Movies That Were Released In India Only

12. Show top 10 director, who gave the highest number of TV shows & Movies to Netflix

This will show the top 10 director the highest number of TV Shows and Movies

data['director'].value_counts().head(10)
Output 30. Top 10 Director the Highest Number of TV Shows and Movies
data['director'].value_counts().head(10).plot(kind='bar')
Output 31. Bar Plot Top 10 Director the Highest Number of TV Shows and Movies

The top directors with the highest number of TV shows and movies are Raúl Campos and Jan Suter, respectively.

13. In how many movies/ tv shows, ‘tom Cruise’ was cast

We want to know how many Tom Cruise cast in Movies or TV Shows.

data[data['cast']=='Tom Cruise']

If ther’s no result, use this code :

data.dropna()
data.head(2)
data[data['cast'].str.contains('Tom Cruise')]
Output 32. Tom Cruise Cast in Movies or TV Shows

There is only 2 movies staring Tom Cruise as a Cast

14. How many movies got the “TV-14” rating in the Canada

We want to know how many movies got “TV-14” Ratings in Canada

data.loc[(data['type']=='Movie') & (data['rating']=='TV-14') & (data['country']=='Canada')]
Output 33. “TV-14” rating in the Canada

There is onley 10 movies got “TV-14” Rating in Canada.

--

--

Muhammad Tangguh Prayoga Rafiansyah
Muhammad Tangguh Prayoga Rafiansyah

No responses yet