Scraping Play by Play Data for Sports Betting

In this post, we’re going to move beyond the NBA.com website and dig into using the free, public NBA api to get access to data. While many of the stats overlap, we’ve found that the API is much easier to use once you get a feel for the required parameters. In this article, we’re going to discuss specifically how to get play by play data for NBA games.

The first step to getting started is installing the nba api python package. To do so, run this command from your terminal:

pip install nba-api

Great, now you have everything to need to get started. If you’re familiar with APIs and endpoints, this will be pretty straightforward for you. If not, we’re not going to get into the details of APIs here, but you can easily Google REST API and read as much as you want about GET, POST, and PUT requests. Fortunately for this post, all you need to do with copy the imports below to your jupyter notebook / IDE:

from nba_api.stats.endpoints import teamgamelog
from nba_api.stats.static import teams
import time
import numpy as np
from nba_api.stats.endpoints import playbyplayv2
import pandas as pd

As was the case in our previous posts, the prerequisite to getting data for any NBA game is finding a way to uniquely identify the game you’re looking for. In the case of the nba api, they designate unique game ids for each game which you can get access to by pinging the TeamGameLogs endpoint. The following function takes in a season year, season type, and a game date, then gets all the games that were played on that day.

def get_game_ids(season, season_type, game_date):

    game_ids = []
    all_teams = teams.get_teams()

# Looping through all nba teams, and append game id if that team is playing that day
    for team in all_teams:
        team_log_dataframe = teamgamelog.TeamGameLog(season=season, season_type_all_star=season_type, team_id=team.get("id"), date_to_nullable=game_date, date_from_nullable=game_date)
        team_game_logs = team_log_dataframe.get_data_frames()[0]
        if not team_game_logs.empty:
            team_game_id = int(team_game_logs["Game_ID"].values[0])
            if team_game_id not in game_ids:
                game_ids.append(team_game_id)
        time.sleep(np.random.randint(1,4))
        
    return game_ids

game_ids = get_game_ids('2021-22','Regular Season','02/01/2022')

Let’s go through the above function piece by piece.

First, lets get all teams ids using the teams.get_teams() method. Then, we loop through all the teams using the TeamGameLogs endpoint. That endpoint returns multiple dataframes for each team on that particular day, including the game id. If the team didn’t play on that day, the returned dataframe will be empty, and the loop will move to the next team. If the team is playing that day, we store the game_id in the game_ids list variable.

play_by_play = pd.DataFrame()

for game_id in game_ids:
    appended_game_id = "00"+str(game_id)
    pbp  = playbyplayv2.PlayByPlayV2(start_period=1,game_id=appended_game_id)
    local_df = pbp.get_data_frames()[0]
    play_by_play = play_by_play.append(local_df)
    time.sleep(1)
Sports betting data scraper
Share your love