Stack 4 - Week 1 - Project 3 - Part 1 - Business Problem(IMDB)

Welcome to another Core assignment! Some students like to explore the assignments before they're finished reading through the lessons, and that's okay! It can be good for your brain to have a preview of what your future challenges might be. However, before you begin this assignment, it's important that you've first:

  • Completed the preceding lesson modules

  • Taken the knowledge checks to confirm your understanding

  • Viewed lecture material related to the assignment topics

  • Completed and submitted your practice assignments

Learning Objectives:

  • Complete the assignment and submit it per the instructions below

Business Problem

For this project, you have been hired to produce a MySQL database on Movies from a subset of IMDB's publicly available dataset. Ultimately, you will use this database to analyze what makes a movie successful and will provide recommendations to the stakeholder on how to make a successful movie.

Over the course of this project, you will:

  • Part 1: Download several files from IMDB’s movie data set and filter out the subset of moves requested by the stakeholder.

  • Part 2: Use an API to extract box office revenue and profit data to add to your IMDB data and perform exploratory data analysis.

  • Part 3: Construct and export a MySQL database using your data.

  • Part 4: Apply hypothesis testing to explore what makes a movie successful.

  • Part 5 (Optional): Produce a Linear Regression model to predict movie performance.

For Part 1 of the project, you will be creating your project repository, loading the official IMDB data for the requested tables, filtering out unnecessary data, and saving the filtered tables as gzip-compressed csv files (".csv.gz") in your repository.

Getting Started Tips:

The Data

  • IMDB Provides Several Files with varied information for Movies, TV Shows, Made for TV Movies, etc.

  • From their previous research, they realized they want to focus on the following files:

    • title.basics.tsv.gz

    • title.ratings.tsv.gz

    • title.akas.tsv.gz

Specifications

Your stakeholder only wants you to include information for movies based on the following specifications:

  • Exclude any movie with missing values for genre or runtime

  • Include only full-length movies (titleType = "movie").

  • Include only fictional movies (not from documentary genre)

  • Include only movies that were released 2000 - 2021 (include 2000 and 2021)

  • Include only movies that were released in the United States

Deliverable

After filtering out movies that do not meet the stakeholder's specifications:

  • Before saving, run a final .info() for each of the dataframes to show a summary of how many movies remain and the datatypes of each feature

  • Save each file to a compressed csv file "Data/" folder inside your repository.

  • Commit your changes to your repository in GitHub desktop and Publish repository / Push Changes.

  • Submit the link to your repository

My Github Repository Submissions

Full PDF Page