Netflix Movies & TV Show — Exploratory Data Analysis (EDA) and Visualization
Netflix is a global streaming service that offers a wide variety of TV shows, movies, documentaries, and other content across genres. It allows subscribers to watch unlimited content on multiple devices for a monthly fee. Netflix produces original content, known as Netflix Originals, including series, films, and documentaries, which have gained popularity worldwide. The platform also provides personalized recommendations based on users’ viewing history and preferences. Overall, Netflix has revolutionized the way people consume entertainment, making it accessible anytime, anywhere.
Given Netflix’s role as a platform that offers streaming services for movies and TV shows to a substantial user base, it naturally accumulates vast amounts of data. In this article, I will delve into data analysis using the Python library, showcasing the insights that can be derived from such extensive data sets.
We can use data from Kaggle about Netflix Movies and TV Shows 2021. To download the dataset, please click here
Table of Contents :
- Importing Libraries
- Load the Dataset
- Data Overview and Information
- Cleaning Missing Data
- EDA (Exploratory Data Analysis), Visualization, and Conclutions
Importing Libraries
Before we start, we need to import modules first. We’ll be using Pandas, NumPy, Matplotlib, and Seaborn.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Load the Dataset
Netflix Movies and TV Shows 2021 dataset is sourced from Kaggle, enabling us to draw insights through analysis and visualize the findings through plots. This method of storytelling enhances comprehension for readers, making easier to glean insights compared to parsing raw data files.
data = pd.read_csv('netflix_titles_2021.csv')
Data Overview and Information
data.head()
Only show 5 first rows of dataset
#Show data shape
data.shape
Based on the output, we can see that the dataset consists of 8.807 rows and 12 columns
#Show columns
data.columns
Dataset shows 12 columns consisting ‘show_id’, ‘type’, ‘title’, ‘director’, ‘cast’, ‘country’, ‘date_added’, ‘release_year’, ‘rating’, ‘duration’, ‘listed_in’, ‘description’.
#Information about dataset
data.info()
Based on the output obtained, it’s evident that most columns contain a total of 8,807 non-null values. However, there are 6 columns with varying non-null value counts. Furthermore, the analysis reveals that 11 columns are of object data type, while 1 column is represented as an integer with a 64-bit data type.
#Number of null values per column
data.isna().sum()
Check for missing data from each column in the dataset using .isna().sum()
Based on the output provided, it’s evident that 6 columns contain null values. Specifically, the ‘director’ column has 2,634 null values, ‘cast’ has 825 null values, ‘country’ has 831 null values, ‘date_added’ has 10 null values, ‘rating’ has 4 null values, and ‘duration’ has 3 null values.
Cleaning Missing Data
data = data.dropna(subset=['director'])
data = data.dropna(subset=['cast'])
data = data.dropna(subset=['country'])
data = data.dropna(subset=['duration'])
data = data.dropna(subset=['rating'])
.dropna function is using for delete missing data in spesific column one by one, then we’re gonna check, is there still missing data in that column?
data.isna().sum()
EDA (Exploratory Data Analysis) and Visualization
- The different types of show or movie are uploaded on Netflix.
For our initial analysis, we’ll explore the variations in the types of shows and films accessible on the Netflix platform. with following this code we’ll find out the different types of show or movie are uploaded on Netflix.
data.groupby('type')['title'].count().sort_values(ascending = False)
We have determined that there are 5185 movies and 147 TV shows in our dataset.
data.type.value_counts()[data.type.unique()].plot(kind='bar')
From the chart above, it’s apparent that there are two types of shows or movies available on Netflix. Movies take the lead with 5185 content items on the platform.
2. Correlation between the features.
We want to know about the correlation between the features because correlation analysis is crucial in various fields such as statistics, data science, and finance. It helps in understanding patterns, making predictions, feature selection, and determining the strength of relationships in datasets. First of all, we must know about data type of ‘date_added’ with .dtypes.
data.dtypes
We must change type int64 of ‘date_added’ into datetime using to_datetime and check just 2 rows with .head(2)
data['date_added'] = pd.to_datetime(data['date_added'], format='mixed')
data.head(2)
As you can see, format in column ‘date_added’ has changed to YYYY-MM-DD
data['year_added'] = data['date_added'].dt.year.astype('int64')
data['month_added'] = data['date_added'].dt.month.astype('int64')
data.head(2)
Let’s extract the year from the date and create a new column named ‘year_added.’ Similarly, we’ll extract the month and create a new column named ‘month_added.’ Don’t forget to see the 2 first row for the result using .head(2).
If you scroll to the right, ‘year_added’ and ‘month_added will appear thats make us easy to search about correaltions using .corr(numeric_only = True)
3. Most watched show on Netflix.
This section resembles the first one; we’re aiming to determine the most-watched content type between movies and TV shows.
data.groupby('type')['title'].count().sort_values(ascending = False)
We have determined that there are 5185 movies and 147 TV shows in our dataset.
According to the plot, movies are the most-watched content on Netflix, accounting for 5185 titles. But, if we want to show with pie chart and persentation, use this code :
type_show = ['Movie', 'TV Show']
value_count = [5185, 147]
plt.pie(value_count, labels = type_show, autopct='%2.2f%%')
plt.show()
According to the plot, movies are the most-watched content on Netflix, 97.24% accounting.
4. Distribution of Ratings.
In this section, our objective is to analyze the distribution of ratings across each title in our dataset.
sns.countplot(x=data['rating'])
plt.xticks(rotation=90)
When comparing ratings across titles on the Netflix platform, it becomes evident that titles with the rating “TV-MA” dominate the platform, surpassing other rating categories.
5. Highest Rating TV Show or Movies.
This thing is the important thing, we wat to know about highest rating TV Show or Movies on the Netflix Platform
data.groupby('type')['rating'].agg(pd.Series.mode)
Based solely on this output, a discernible trend emerges indicating that the highest-rated content among the two types available on Netflix is consistently associated with the ‘TV-MA’ rating category. This observation suggests a preference or trend towards content with this particular maturity rating among viewers on the platform.
6. The best Month for releasing content
The best month for releasing content depends on factors like genre, target audience, and competition. How many titles were released in each month. For streaming platforms like Netflix, it varies based on data and avoiding competition.
data.month_final.value_counts().to_frame('value_counts')
It shows the distribution of content releases across different months. It provides a count of how many titles were released in each month, giving insights into the distribution of content launches throughout the year.
data.month_final.value_counts().plot(kind='bar')
7. Highest Genre Wached on Netflix
We must create new column for ‘new_genre’
new_genre = data['listed_in'].str.rsplit(',', n=2)
new_genre
data['Genre 1']=new_genre.str.get(0)
data['Genre 2']=new_genre.str.get(1)
data['Genre 3']=new_genre.str.get(2)
data.head(2)
If you scroll to the right, ‘Genre_1’, ‘Genre_2’, and ‘Genre_3’ will appear thats make us easy to search about genre in each title. Then continue with .describe() all genre.
data['Genre 1'].describe(include=all).to_frame()
data['Genre 2'].describe(include=all).to_frame()
data['Genre 3'].describe(include=all).to_frame()
Too see pie plot about highest genre watched on Netflix, use this code :
Genre_type = ['Genre 1', 'Genre 2', 'Genre 3']
value_count = [5332, 4231, 2295]
plt.pie(value_count, labels = Genre_type, autopct = '%2.2f%%')
plt.title("Most Watched Genre on Netflix")
plt.show()
8. Released movie over the years
In this part, we will know about released movie over the years
data.groupby(['release_year'])['release_year'].count().sort_values(ascending=False).to_frame()
sns.countplot(x='release_year', data=data)
sns.set(rc={'figure.figsize':(40,20)})
plt.show
The bar plot drawn is that the year 2017 saw the highest number of movie releases, totaling 657 titles.
9. Movies Made On Year Basis
Analyzing or categorizing movies based on the year they were produced or released. It involves looking at movies as individual entities grouped according to the specific year of their creation or release
data.groupby('year_added')['type'].value_counts().sort_values(ascending=False).to_frame()
data['year_added'].value_counts().plot(kind='bar')
The conclusion is that in the year 2009, there were 1236 movies released, making it the year with the highest number of movie releases.
10. Show all the movies that were released in year 2000
We want to know all the movies that released in year 2000
data[(data['type']=='Movie') & (data['release_year']==2000)]
The analysis includes a minimum of 32 rows, providing an example of the number of movies released in the year 2000.
11. Show only the title of all TV shows that were released in India only
In this section, we want to lnow the title of all TV Shoes that were released in India Only.
data[(data['type']=='TV Show') & (data['country']=='India')]['title'].to_frame()
12. Show top 10 director, who gave the highest number of TV shows & Movies to Netflix
This will show the top 10 director the highest number of TV Shows and Movies
data['director'].value_counts().head(10)
data['director'].value_counts().head(10).plot(kind='bar')
The top directors with the highest number of TV shows and movies are Raúl Campos and Jan Suter, respectively.
13. In how many movies/ tv shows, ‘tom Cruise’ was cast
We want to know how many Tom Cruise cast in Movies or TV Shows.
data[data['cast']=='Tom Cruise']
If ther’s no result, use this code :
data.dropna()
data.head(2)
data[data['cast'].str.contains('Tom Cruise')]
There is only 2 movies staring Tom Cruise as a Cast
14. How many movies got the “TV-14” rating in the Canada
We want to know how many movies got “TV-14” Ratings in Canada
data.loc[(data['type']=='Movie') & (data['rating']=='TV-14') & (data['country']=='Canada')]
There is onley 10 movies got “TV-14” Rating in Canada.