Stack 4 - Week 1 - Project 3 - Part 1 - Getting Started (Continued)

Getting Started - Project 3 (Continued)

Create Your Repository using GitHub Desktop

Make sure you have created your repository using GitHub desktop (refer to the practice assignment for instructions on how to do this if you are unsure)

Downloading the Files

  1. Navigate to the IMDB data downloads page: (https://datasets.imdbws.com/)

    • The URLs lead to the files themselves and pandas can read these URLs directly.

  2. Find the 3 files needed (title basics, titles akas, and title ratings) and right-click on each of them and select "Copy Link Address" and paste it into your notebook.

  • Example:

basics_url="https://datasets.imdbws.com/title.basics.tsv.gz"
copy

Loading TSV's with Pandas

From the data dictionary page: - "Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set."

  • What this means is that the files are actually .tsv.gz files instead of .csv.

  • We can still load them into dataframes using pd.read_csv(url)

    • However, we will sep='\t' to indicate they are tab-separated instead of comma separated.

    • Additionally, these files can be rather large and you may see a Pandas warning about mixed datatypes. The warning message will suggest adding low_memory=False to read_csv to prevent this message.

    • Example:

    basics = pd.read_csv(basics_url, sep='\t', low_memory=False)
    copy

Note that there is a lot of data, and this may take some time!
Troubleshooting: How to Handle "Out of Memory" errors

If you are unable to load all of the data at once on your computer due to an Out of Memory error, there are several fixes for this. The simplest is simply to close any programs that you are running in the background that you do not need. This may free up enough memory to allow you to process the data.

If you are still getting this error after closing applications, check out the "(Optional) Handling Large Files" for how to load in files in small chunks with pandas.

Handling \N Placeholder Values

According to the data dictionary, null values have been encoding as \N.

  • You will want to find those and replace them with np.nan.

  • However, the backslash (\) character is a special one that tells the computer to ignore whatever character comes next.

    • So if we were to say df.replace({'\N':np.nan}), the computer would see \N as an empty string.

    • To fix this, add a second backslash character, which will tell the computer that you actually WANTED to use a literal \.

    • df.replace({'\\N':np.nan})

    • Don't forget to make these replacements permanent!

Required Preprocessing - Details

  • Filtering/Cleaning Steps:

    • Title Basics:

      • Replace "\N" with np.nan

      • Eliminate movies that are null for runtimeMinutes

      • Eliminate movies that are null for genre

      • keep only titleType==Movie

      • keep startYear 2000-2022

      • Eliminate movies that include "Documentary" in genre (see tip below)

      • Keep only US movies (Use AKAs table, see "Filtering one dataframe based on another" section below)

  • AKAs:

    • keep only US movies.

    • Replace "\N" with np.nan

  • Ratings:

    • Replace "\N" with np.nan (if any)

    • Keep only US movies (Use AKAs table, see "Filtering one dataframe based on another" section below)

Tip: Excluding Documentaries

  • To filter out documentaries, you will need to check if the movie's value in the Genre column contains the word documentary. (Instead of =='documentary')

    • You will also want to use the ~ operator to take the inverse of your Trues/Falses.

  • Example:

    # Exclude movies that are included in the documentary category.
    is_documentary = df['genres'].str.contains('documentary',case=False)
    df = df[~is_documentary]
    copy

Filtering one dataframe based on another

Next you will filter the basics df to only include the movies that are present in your filter akas dataframe. This is how you will ultimately be able to filter the movies by region being in the US.

Here is how you can achieve this:

# Filter the basics table down to only include the US by using the filter akas dataframe
keepers =basics['tconst'].isin(akas['titleId'])
keepers
copy

Now filter basics

basics = basics[keepers]
basics
copy

Saving the Files in Your Repository

  • After downloading all of the csvs (can take a couple minutes), it is recommended that you save the DataFrames as .csv.gz files in your on your hard drive after previewing them and verify they downloaded correctly.

  • This will allow you to skip over downloading the files again and continue your work without waiting.

Creating a "Data" folder.

  • Create a folder called "Data" and save each DataFrame as a separate file in the Data folder.

  • You can do this in several different ways.

    • A programmatic (and recommended) way would be to use Pythons os module's os.makedirs function.

      • We would provide the name of folder we want to create, and we will specify that its ok if the folder already exists.

    • We can then run os.listdir('Data/') to confirm that the folder has been created. (It will error if the folder does not exist).

# example making new folder with os
import os
os.makedirs('Data/',exist_ok=True) 
# Confirm folder created
os.listdir("Data/")copy

If your output is just empty brackets [], it worked. Otherwise you would get an error message.

A less programmatic (but still acceptable) approach would be to use Jupyter Notebook or File Explorer (Windows)/Finder (Mac) to create the new folder.

Saving Compressed .csv.gz Files

  • After displaying a preview of your DataFrame and confirming that the data downloaded correctly, save each dataframe to a new file in the "Data" folder.

    • Use df.to_csv, but make sure to have your filename end in .csv.gz and add `compression="gzip".

    • You will want to add "index=False" to avoid creating an "Unnamed: 0" column.

## Save current dataframe to file.
basics.to_csv("Data/title_basics.csv.gz",compression='gzip',index=False)
copy
  • Finally, replace your original dataframe variable by reading in the file you just saved and confirming that it is correct.

# Open saved file and preview again
basics = pd.read_csv("Data/title_basics.csv.gz", low_memory = False)
basics.head()
copy

Note: now that you've confirmed you have loaded your dataframe from your local file, you can comment out the earlier code where you downloaded the file from URL.

If you go to your Jupyter homepage, you will see the files you just saved in your new Data folder

Once you've saved your .csv.gz's, it is recommended to save your notebook, hop back to GitHub Desktop, and commit your repo with the added files.

  • NOTE: If you filter out all of the requested conditions for your movies, and saved them as gzip-compressed csv's, your files should not be over the 100 MB limit!

Dealing with Files >100 MB

GitHub has a single-file-size limit of 100 MB. This means that attempting to commit any file larger than 100 MB will result in a fatal error when attempting to push your work to GitHub.

  • GitHub Desktop warns you if you are about to save a file that is over this limit using the pop-up below:

NOTE: If you filter out all of the requested conditions for your movies, and saved them as gzip-compressed csv's, your files should NOT be over the 100 MB limit!

  • However, should you run into this following warning from GitHub desktop about your data files being too big you should follow the instructions below:

    • DO NOT CLICK COMMIT ANYWAY!!!

    • Click Cancel instead. We are going to tell git to ignore those large files. This means that it will not try to save them to your commit, but it will let you keep them in the folder without complaining

  • For EACH file that was too large:

    • Right click on it from the sidebar of GitHub Desktop.

    • Select "Ignore file"

  • You will see the large files' names disappear from the changes sidebar. Instead you will see that a file called .gitignore has appeared. This file is how github desktop knows which files to ignore.


Full PDF Assignment