What is ETL?
Learning Objectives:
By the end of this module, you should be able to:
Define the Extract, Transform, and Load steps of the ETL process
Explain the purpose and basic principles of ETL
What Is ETL?
ETL is a process that Extracts, Transforms, and Loads data from multiple sources to a data warehouse or other unified data repository.
How Is It Used?
Through a series of business rules, ETL cleanses and organizes data in a way that addresses specific business intelligence needs, like monthly reporting, but it can also tackle more advanced analytics, which can improve back-end processes or end-user experiences. ETL is often used by an organization to: extract data from legacy systems, cleanse the data to improve data quality and establish consistency, or load data into a target database.
Intro to APIs
API (Application Programming Interface)
APIs define a set of rules in which you can interact with a particular application. For data scientists, APIs allow us to access data that programmers have made available to outside users. For example, let's say we have a huge database of Pokemon and developers have created an API to access this database to create cool Pokemon applications like a Pokedex. They create a series of URLs that map to important information in a predictable manner. As data scientists, we can access the Pokemon data, but we will need to read the documentation of the API to learn how to properly interact with the URLs and get predictable results. As an example you can explore the PokiAPI and documentation here.
A few things to know before making your first API call:
1) You will often need an API key for access. We will explore this along with how to safely save your credentials in a future lesson.
2) The results of your API call are commonly returned in JSON format. We will explore JSON in this module.
3) Every API is different. The more APIs you interact with, the more you will learn to read and understand the documentation to get the results you are looking for!
Every API is Different
Even though every API is designed with a set of governing principles in mind, every API you interact with will be different. It is up to the developer who wrote the program to specify the protocols needed to interact with that API, and it is up to you to study that documentation. Some APIs will let you access information as long as you ask for it in the URL. Some APIs will give you better information if you submit more data through a form. Some APIs will only allow you to request a certain URL a limited number of times per day. Some APIs will require you to sign up on their website to be provided an authenticity token to pass with every API request you make.
APIs are usually web URLs where you can pass information either in the URL or by posting a form to the URL and the server returns JSON (or other similar data formats). In essence, if a URL takes some data and outputs JSON (or other similar formats), it is an API. Take a quick look at some of the APIs listed below.
Open Weather Map: http://openweathermap.org/api
GitHub API: https://developer.github.com/v3/
Google Maps Directions API: https://developers.google.com/maps/documentation/directions/
Twitter API: https://dev.twitter.com/rest/public
Flickr API: https://www.flickr.com/services/api/
Although you have to interact with each API differently, notice how they are all collections of URLs that you can interact with as long as you follow the protocols that the developer has specified in the documentation.
How a Restful API communicates
REST stands for Representational State Transfer and refers to an architectural style for providing standards between applications that make it easier for the separate systems to communicate with each other as both systems change over time. The basic idea is to separate the concerns of the client (the requesting application) and the server (the application receiving the request) so that they can communicate without either having to know the state of the other. The separation occurs by a series of constraints that define the way that communication can happen, through an API language and protocol that enables Representations of States of the applications rather than through the actual states of the applications.
API Keys
Learning Objectives:
Recognize the purpose of an API key
Obtain an API key from the Yelp API
An API key is a code that is used to identify and authenticate users of an API. Not all APIs require a key, but most creators of APIs use the API key to track the use of the API in regards to how often you access their information, when you access their information, and which information you are accessing.
Some APIs even put a limit on how often you can access them and how many times in a day you're allowed to use them. APIs are all about information. When the information is free, it is called an "open" or "public" API. If there are charges to obtain the information, it is often called a "premium" API.
Getting an API Key
In order to get an API key, you will need to create a developer account with the API's provider (Twitter, Yelp, Google Maps, etc). You will then need to create an "app". In the context of API's, an "app" is a sub-account that will be assigned credentials to access the API. You will often need to provide a URL for your app. Generally, you can use any URL for this. A recommended option would be the URL to a repository in which you will be practicing using the API.
Let's use the Yelp Fusion API as an example. We will use this extensively throughout the week!
Saving and Using API Credentials
Learning Objectives:
By the end of this lesson, students will be able to
Store API credentials in JSON files locally.
Use APIs within their notebooks without exposing their credentials.
Saving Your Yelp Fusion API Credentials
After following the directions linked in the previous lesson, you should now see the credentials for your new App, as in the image below. If you haven't gotten to this step, please go back to the "API Keys" lesson before proceeding! We will walk through safely storing and using these credentials without revealing or exposing your access key.
Saving Your API Credentials Privately
Since our API-Keys are attached to our individual account and we are allotted a maximum number of API calls per day/month, it is very important to not publicly display our API credentials publicly, especially in our project repositories.
To prevent this, we are going to be creating a new hidden folder in your User folder called
.secret/
.
Open your terminal/gitbash and navigate to your user folder.
Mac Users:
Use this keyboard shortcut from inside finder:
Cmd
+Shift
+.
Use the same command again to hide hidden files.
Working with JSON Files
Learning Objectives:
Understand how to use the "open" function in Python
Understand the "with" statement in Python.
Use the json module to load and save JSON files.
What is a JSON file?
JSON files are text-based files that are structured as a dictionary or a list of dictionaries (also called "records"). JSON is used to store and transport data between computers or programs. The results of most API calls will be returned in JSON format. JSON stands for "JavaScript Object Notation." It is plain text written in JavaScript Object notation and is considered "language independent" meaning code to read and write JSON occurs in most other coding languages.
We will be using Python's json
module to make it easier to load and save data to and from JSON files in the correct format. If you were to open and read in the contents of a JSON file without using the json module, you will wind up with one large string.
We will be learning more about how to interpret JSON files when we obtain our first API results, but we are first going to take some time to ensure you know how to read and write files in python using "with Open". We will start with a simple text file, and then we will look more specifically at JSON files.
(Note that "JSON" as a file format is capitalized, but the file extension (.json) and the Python module "json" is not.)
Before working with .json files, we will go through an example of working with a .txt file.
Summary
Working with JSON files is required of data scientists and programmers. It is a file format that is designed for communication between computers. This lesson covered some fundamentals of opening files safely in Python using "with OPEN". The json module can be combined with "with OPEN" to easily work with JSON files in Python. We will learn more about how to interpret the format in a future lesson.
Traversing JSON with Python
Learning Objectives
Explore the contents of a JSON file using Python
Recognize when JSON data is stored in "records" format.
Load and save JSON with Pandas to_json/from_json.
Exploring JSON Files
As a refresher, JSON files are text files that contain dictionaries and/or lists.
We can use the
json
module'sload
function to store the file contents in a variable.
Since JSON data can be lists or key/value pairs, it can be tricky to get used to exploring their contents.
We will discuss some helpful tips/tactics when exploring new JSON data.
Questions to Ask
What is the very top-level of the JSON data? Is it a list or a dictionary?
If its a dictionary, what are the keys?
If its a list, how long is it?
What does the first entry look like?
Repeat these questions for each level of the dictionary.
If the next level is a dictionary, what are its keys?
If the next level is a list, how long is the list?
What are the items in the list? Integers? Dictionaries?
If they're dictionaries, do they seem to have to same keys for each dict in the list?
Reviewing the JSON Schema
We have a dictionary of Yelp query results.
The top level contains:
The Total # of businesses that matched the query.
The region that was searched.
A list of records for the businesses that matched the criteria.
Yelp API Package
Lesson Objective:
Describe how API calls work.
Define what an "endpoint" means.
Make API calls using the YelpAPI package
Navigate "pagination" in API results using "offset
Summary
Making API calls requires our API Key and knowledge of the endpoints we want to use.
Reading the documentation is particularly important for using APIs.
The output of your API call may be broken into bite-sized chunks known as pagination. You will need to use offset to access all of the relevant information.
The results can then be concatenated into a single data frame.
The Structure of an API Call
Yelp Fusion API Documentation: https://www.yelp.com/developers/documentation/v3/get_started
API Calls are URLs that are constructed to request information from a specific branch of an API.
An "Endpoint" is a branch of an API that is designed to return a specific type of data.
In the table below (from the Yelp Fusion API Link above), we can see that there are 7 different Endpoints.
Take note of the URL in the first sentence below the "Endpoints" header: "https://https://api.yelp.com/v3"
Also, take note that each endpoint in the table has a Path that starts with a
/
Constructing an API Call
To construct an API call, we must combine the base URL with the Path for the specific endpoint we want to use, as well as any parameters required for that endpoint.
Click on the blue text for the Path of the endpoint you are using to get detailed information on what parameters the endpoint accepts.
Example: Using the businesses search endpoint
We ultimately want to retrieve information from restaurants in NY, NY where "Pizza" is included as a search term.
Endpoint Documentation: https://www.yelp.com/developers/documentation/v3/business_search
We can see the URL for GETTING results from the API.
Note: as data scientists, we will be focusing on Extracting data from an API, and will therefore be focusing on GET methods.
Concepts of Efficient API Extraction
Learning Objectives:
Check your API usage and be aware of limits
Describe scenarios that necessitate the need for efficient API calls with safeguards
Consider conceptual solutions to these issues
Introduction
In a previous assignment you gained experience with API calls and practiced the concept of pagination. We limited you to calling two pages just to get you introduced to how the API calls work, what data is returned, and how you must use the offset to continue to build your results a page at a time. You can quickly begin to see that having to call each page separately could be quite cumbersome! This lesson will consider API limits and some problems you could encounter with long API calls. In this lesson, we will consider solutions to these issue from a conceptual perspective. The next lesson will address these issue with code!
Summary
This lesson presented several potential problems when making multiple API calls. First, you should be aware of the limits that may be placed on an open API, and be especially aware of what the costs are if using a premium API! Once we understand what is going on, and how pagination works, we want to be able to automate that process, but that means we want to build in several important safeguards so we don't lose any progress when our for-loop has to run for a long time! We will explore the code itself in the next lesson!
Code for Efficient API Extraction
Learning Objective:
Create a new file without accidentally erasing prior results.
Loop through a list of queries and save the results throughout the loop.
Use the tqdm library to make a progress bar to track the time remaining in a loop
Note: If you are starting to feel a little overwhelmed by writing this code, rather than just copy it, instead focus on reading it! Your abilities to write code will improve as you continue to explore and makes sense of code written by others!
Summary
This lesson talks you through each step of developing reproducible code to efficiently make API calls and prevent some of the potential pitfalls that can occur when dealing with many calls. This code is specifically for the YelpAPI, but gives you an idea of how to think through the process with other APIs.
Using the TMDB API
Learning Objectives:
Have access to the TMDB API
Use the tmdbsimple package to make API Calls
Write a function to make your API calls more efficiently
Summary
This lesson helped you get set up for the next step of your project. You should have created an account with tmdb and saved your credentials in a secret file. You also got a chance to explore the vast amount of information available with the API and created a function to combine the ratings (certification) with the other key information you will need for the project. Now that you have some familiarity with the tmdb API and the results you will get, the next task is to determine which movie ids you want to include in your API calls.
Efficient TMDB API Calls
You have already written a function to combine the certification with the rest of the .info() from the TMDB API results. This lesson will help prepare you for the project. You may want to check out the specifications of project - part 2 for an overview of the task prior to working through this lesson.
Learning Objectives:
Plan steps in complex code
Understand code snippets for making efficient TMDB API calls
Planning
Before jumping into the code, it is important to outline in plain language what you are trying to do. Before you can ask the computer to do it, you have to really understand what you are asking. This week has introduced some new code that you may still be getting used to, so this lesson will help walk you through the task. We will go through the individual pieces of code, but for the project, you will need to put it all together, in a logical order, with correct formatting! There will be an OUTER and INNER loop: a loop within a loop!
Summary
This lesson exemplifies the importance of planning your complex coding tasks so that you are clear on what you are trying to do in plain language before translating to code. While this lesson shows examples of the different segments of code that you may want to use in the next phase of the project, remember, it is still up to you to read and understand each step so that you can put together the final product! You will need to be conscientious of the order of the information and the appropriate format (especially tabs).
Advanced Transformations with Pandas Part 1
Now that you are extracting data via API calls, you will find that you will need to do some additional processing to get the data into a more usable format. This lesson uses results from the Yelp API that have been stored as a csv.gz file as a demonstration of some transformation techniques. While every data set may have unique transformation challenges, this lesson gives you a chance to explore some common issues you may encounter.
Lesson Objectives:
Apply json.loads to convert strings back to dictionaries or lists.
Handle dictionaries and lists stored within pandas DataFrames.
Leverage pandas string methods for filtering/separating data.
Summary
Since data rarely comes to us processed and ready to use, (especially in the case of API call results!), it is essential that you gain fluency in tackling transformation challenges. There are multiple ways to tackle these issues, and we have just provided some strategies that you can explore. In the next lesson, we will discuss some additional transformations that will be particularly helpful when working with a SQL database or when preparing data to be made into a SQL database.
Advanced Transformations with Pandas Part 2
Lesson Objective:
Construct a numeric index to replace text (for database normalization)
Use the dict and zip functions
Merge DataFrames with Pandas
Summary
This lesson showed you how to convert string ids to integer ids for use in SQL databases. You created a separate table that included the original string id with its new integer id. Then, you practiced merging two dataframes. Now, you are all set to practice your advanced transformations skills with Super Heroes in your next core assignment.
Transformations to Perform
We will continue working with our Yelp API results from the Advanced Transformations - Part 1 lesson.
For Part 2, we will be focusing on transformations that are helpful for creating and working with a relational database.
1. Replacing a column of string-ids with integer-ids and creating a string_id:integer_id lookup table
The "id" column
2. Joining two DataFrames with Pandas and pd.merge.
Practice joining the lookup table with the original dataframe.
Normalizing the Data
Let's say that we want to convert this table into a SQL database.
We will need to replace our long string-ids with a numeric version.
Note: there are some scrappy ways we could use
.reset_index()
to accomplish this, but this will be a less flexible option.Instead, we will create a mapper dictionary that contains the current string ids as the keys and an integer as the values.