Welcome to another Core assignment! Some students like to explore the assignments before they're finished reading through the lessons, and that's okay! It can be good for your brain to have a preview of what your future challenges might be. However, before you begin this assignment, it's important that you've first:
Completed the preceding lesson modules
Taken the knowledge checks to confirm your understanding
Viewed lecture material related to the assignment topics
Completed and submitted your practice assignments
Learning Objectives:
Complete the assignment and submit it per the instructions below
Business Problem
For this project, you have been hired to produce a MySQL database on Movies from a subset of IMDB's publicly available dataset. Ultimately, you will use this database to analyze what makes a movie successful and will provide recommendations to the stakeholder on how to make a successful movie.
Over the course of this project, you will:
Part 1: Download several files from IMDB’s movie data set and filter out the subset of moves requested by the stakeholder.
Part 2: Use an API to extract box office revenue and profit data to add to your IMDB data and perform exploratory data analysis.
Part 3: Construct and export a MySQL database using your data.
Part 4: Apply hypothesis testing to explore what makes a movie successful.
Part 5 (Optional): Produce a Linear Regression model to predict movie performance.
For Part 1 of the project, you will be creating your project repository, loading the official IMDB data for the requested tables, filtering out unnecessary data, and saving the filtered tables as gzip-compressed csv files (".csv.gz") in your repository.
Getting Started Tips:
Please make sure to read the following lesson "Getting Started - Project 3" for additional tips and directions!
The Data
IMDB Provides Several Files with varied information for Movies, TV Shows, Made for TV Movies, etc.
Overview/Data Dictionary: https://www.imdb.com/interfaces/
Downloads page: https://datasets.imdbws.com/
From their previous research, they realized they want to focus on the following files:
title.basics.tsv.gz
title.ratings.tsv.gz
title.akas.tsv.gz
Specifications
Your stakeholder only wants you to include information for movies based on the following specifications:
Exclude any movie with missing values for genre or runtime
Include only full-length movies (titleType = "movie").
Include only fictional movies (not from documentary genre)
Include only movies that were released 2000 - 2021 (include 2000 and 2021)
Include only movies that were released in the United States
Deliverable
After filtering out movies that do not meet the stakeholder's specifications:
Before saving, run a final .info() for each of the dataframes to show a summary of how many movies remain and the datatypes of each feature
Save each file to a compressed csv file "Data/" folder inside your repository.
Commit your changes to your repository in GitHub desktop and Publish repository / Push Changes.
Submit the link to your repository