Accessing Public Data Sources via REST APIs in Python

APIs are a modern shortcut to consume data sources and services on the internet or make them available to others. Regardless of whether these sources are publicly available or reside in corporate networks, it is important for data scientists to understand how APIs can be used to access data sources or interact with services. In this tutorial I will therefore show how to use APIs to access data. This tutorials gives a brief introduction on how to use public APIs in Python to access interesting data sources. This tutorial covers the following two API packages:

  • In Part A we use the statworx API to access COVID-19 data
  • In the second Part, we use the Pandas Webreader to access Financial Data from Yahoo Finance

After completing this tutorial, you will know how make a request to an API and get interesting data that you can use directly in your Python project for statistics.

What are APIs?

In a wider sense, an API is a contract between the provider and the consumer of a web service, who communicate with each other and exchange data. The contract is necessary because communication can lead to misunderstandings, for example when one party sends information that is not as expected. To avoid such misunderstandings, the contract defines standards for what is communicated between the parties and how it is communicated. The standards are specified in the API documentation in the form of input and output parameters. It is because of this standardization that it is possible to automate communication between different parties.

Communication via an API

APIs can be designed using different standards. An important architectural style used to design APIs is the Representational State Transfer (REST) standard. REST has become very popular in recent year and is often considered a simpler and modern alternative to the traditional Simple Object Access Protocol (SOAP). Both SOAP and REST are based on established rules, compliance with which is the basis of automated information exchange.

REST APIs

If you work with public APIs in Python, you will most likely – knowingly or unknowingly – use the REST protocol. A service provider that uses a REST API exposes an URL that receives requests. Requests made to the resource URL can have a payload in JSON, HTML, or XML format. A REST API will typically return the response in JSON format, but other formats, such as comma-separated values (CSV) are also possible. REST defines different HTTP methods (GET, HEAD, POST, PUT, PATCH, DELETE, CONNECT, OPTIONS and TRACE). However, the most important ones are:

  • GET: Used to receive data.
  • POST: Typically, used to send new data to the service provider. Sometimes also used to define how data should be retrieved, in subsequent requests.
  • PUT: Used to update data at the service provider.

There are various packages available for Python that offer build-in functionality for interacting with REST-based APIs. So we don’t need to program everything from scratch. Common packages are the pandas webreader and the requests package, which we will use in the following.

Using REST APIs in Python

Prerequisites

This tutorial assumes that you have setup a python environment. In case you have not yet set the environment up, you can follow this tutorial on how to setup the Anaconda Python environment. Furthermore, it is assumed that you have the following packages installed: requests, json, pandas, matplotlib

Access Data using the Requests Package

The Requests Package is a http library and we will use it as a first method to make an API call. According to github statistics, it ranks among the most downloaded python packages. It provides functionality to send a HTTP request to an API and receive the response, e.g., in Json format.

There are many APIs out there which provide more or less reliable data on COVID-19 cases. I found that a good one to use is api.statworx.com/covid. This API provides historic data on the number of COVID-19 cases. Since the API does not require an authentication key, it is quickly set up.

In the following, we will send a post request to the statworx.com API and get back COVID-19 data in JSON format as a response. We can then convert the response into a dataframe. With the following code we can send a HTTP request to the URL provided and will get the requested data back in JSON format:

import requests
import json

# Call the API
payload = {'code': 'US'} # or {'country': 'Germany'}
# To retrieve data for all countries use {'code': 'ALL'}
URL = 'https://api.statworx.com/covid'
response = requests.post(url=URL, data=json.dumps(payload))

# Convert to data frame
df = pd.DataFrame.from_dict(json.loads(response.text))
df.head()

By altering the information in our request, we can change the data that we will get back in the response. For example, we have specified the country code to “US” in the payload. Thus, the response will contain only COVID-19 data for the US. In contrast, if we want to retrieve data for all countries, we can do this by setting the code to ‘ALL’.

# We will get data for 'all' countries
payload = {'code': 'ALL'} 
URL = 'https://api.statworx.com/covid'
response = requests.post(url=URL, data=json.dumps(payload))

# Convert the response to a data frame
df = pd.DataFrame.from_dict(json.loads(response.text))

# Convert date column to date format
df.loc[:,'date']=pd.to_datetime(df['date'], format='%Y-%m-%d')

# Filter specific countries
# The following countries will be included
list_of_countries = ['Germany',  'Switzerland', 'France', 'Spain', 'China', 'United_States_of_America', 'Canada']
dff = df[df['country'].isin(list_of_countries)]

# Filter the data to a specific timeframe
date_start = '2020-01-15'
date_today = df[df['date']==df['date'].max()]['date'].iloc[0]
dff = dff[dff['date'] > date_start]{"type":"block","srcIndex":18,"srcClientId":"435d5f11-36ab-4ab1-999a-9430218a7a55","srcRootClientId":""}

If you like, you can use the following code to export the dataframe with the COVID-19 data to a csv file. The csv file will then be exported to the folder of you python notebook.

# Save the file as csv
df.to_csv('youfilename.csv', index = False)

Now let’s make something out of the data and create a simple plot.

# Plot the data
import matplotlib.pyplot as plt 
import matplotlib.dates as mdates
x = dff[dff['country']==list_of_countries[1]]['date']
years = mdates.YearLocator() 
fig, ax1 = plt.subplots(figsize=(16,10))
plt.ylabel('Total Cases', fontsize=20, color = 'black')
plt.grid()
for countryname in list_of_countries:
    print(countryname)
    countrydata = dff[dff['country']==countryname]['cases_cum']
    plt.plot(x, countrydata, label=countryname)
    
plt.legend(list_of_countries, loc='upper left')  
plt.show()

Access Financial Data using Pandas Webreader and the Yahoo Finance API

A second way to make API Calls is by using the webreader that is part of the famous Pandas package. The following code example demonstrates how to use this package in order to retrieve data on the German Stock Index DAX from yahoo finance.

import pandas as pd
from datetime import date

# Get the quote
today = date.today()
date_today = today.strftime("%Y-%m-%d")
date_start = '2010-01-01'

# Get S&P500 quote
symbol = 'DAX'
df = webreader.DataReader(symbol,  start=date_start, end=date_today, data_source='yahoo')
df.head(5)

Pandas webreader supports a wide range of service providers, including the Worldbank, Eurostat, the OECD, and several stock markets such as the NASDAQ. Each source has its own data reader that requires specific input arguments. If you want to learn more about the supported service providers and Pandas Webreader, check out the pandas webreader documentation.

Summary

In this quick tutorial you have learned how to access data sources using REST APIs in Python. The blog post has presented two different ways of making API calls: First, we have used JSON http requests to retrieve COVID-19 data. And second, we have used Pandas Webreader to retrieve data on the German stock market index.

I hope you learned something relevant. Let me know in the comments if you found the post helpful.

Author

  • Hi, my name is Florian! I am a Zurich-based Data Scientist with a passion for Artificial Intelligence and Machine Learning. After completing my PhD in Business Informatics at the University of Bremen, I started working as a Machine Learning Consultant for the swiss consulting firm ipt. When I'm not working on use cases for our clients, I work on own analytics projects and report on them in this blog.

Follow Florian Müller:

Data Scientist & Machine Learning Consultant

Hi, my name is Florian! I am a Zurich-based Data Scientist with a passion for Artificial Intelligence and Machine Learning. After completing my PhD in Business Informatics at the University of Bremen, I started working as a Machine Learning Consultant for the swiss consulting firm ipt. When I'm not working on use cases for our clients, I work on own analytics projects and report on them in this blog.

3 Responses

Leave a Reply