Predicting the Success of NBA Draftees Based on College Performance¶

Introduction¶

I want to build a model to predict the 5-year peak performance of NCAA basketball players drafted by the NBA, measured by their maximum single-season Win Shares (an advanced statistic that assigns credit for team success to individuals on the team). The majority of an NBA player's development takes place during this 5-year time frame. Statistics from a player's final NCAA season along with biographical information will be used as input for the model. Accurate performance prediction of NBA prospects is highly valued by the ownership and management of NBA teams. It helps them effectively determine which players to acquire and significantly impacts team success, which often relates directly to the organization's financial success.

The necessary dataset will be created by gathering NCAA and NBA player statistics from www.sports-reference.com with the use of sportsreference, a third-party open-source API, and BeautifulSoup. Only players drafted by the NBA from 2011-2015 will be included in the dataset, since several NCAA advanced statistics (i.e. Box Plus Minus) are unavailable for prior seasons. I will use repeated stratified nested cross-validation for model assessment, due to the limited size of the dataset. Several different statistical and supervised machine learning techniques will be used, coupled with Pearson's rank based variable selection, to model the data. As an evaluation metric, I have chosen to use mean squared error.

Dataset Creation¶

All necessary packages are imported, and display options and plotting styles are set.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tools.tools import add_constant
from statsmodels.stats.outliers_influence import variance_inflation_factor
from datetime import datetime, timedelta
from urllib.request import urlopen
from bs4 import BeautifulSoup
from scipy.stats import ks_2samp
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, Ridge, ElasticNet
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.pipeline import Pipeline
from sklearn.base import clone
import ast
from sportsreference.ncaab.roster import Player as Player_ncaa
from sportsreference.nba.roster import Player as Player_nba
from sportsreference.ncaab.teams import Teams

pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 100)
pd.options.display.multi_sparse = False

sns.set(style='darkgrid')

blues = sns.color_palette('Blues_r')

Skip ahead to Exploratory Data Analysis.¶

The sportsreference API pulls statistics from the relevant websites of www.sports-reference.com: www.sports-reference.com/cbb (NCAA stats) and www.basketball-reference.com (NBA stats). sportsreference.ncaab.teams.Teams (imported as Teams) retreives NCAA team statistics, while sportsreference.ncaab.roster.Player (imported as Player_ncaa) and sportsreference.nba.roster.Player (imported as Player_nba) retreive NCAA and NBA player statistics respectively.

Using the sportsreference API, a dictionary is created with multiple statistics for each NCAA basketball team from 2011 - 2015. The dictionary is then converted into the DataFrame, team_df.

team_dict = {'year': [], 'team': [], 'pace': [], 'srs': [], 'sos': []}

for year in range(2011, 2016):
    for team in Teams(year): 
        pace = team.pace
        srs = team.simple_rating_system
        sos = team.strength_of_schedule
        
        team_dict['year'].append(year)
        team_dict['team'].append(team.abbreviation)
        team_dict['pace'].append(pace)
        team_dict['srs'].append(srs)
        team_dict['sos'].append(sos)

team_df = pd.DataFrame(team_dict)
team_df.set_index(['year', 'team'], drop=True, inplace=True)
team_df.head()

The following function calculates the age of an NCAA player on the date of the NBA draft using the sportsreference API.

def get_player_age(nba_id, draft_year):
    birth_date = Player_nba(nba_id).birth_date
    datetimeFormat = '%Y-%m-%d'
    draft_date = str(draft_year) + '-06-20'
    diff = datetime.strptime(draft_date, datetimeFormat) - birth_date
    days = diff.days
    return round(days/365, 2)

This function uses BeautifulSoup to return the RSCI (Recruiting Services Consensus Index) Ranking of an NCAA player from www.sports-reference.com.

The RSCI ranks the top 100 high school players each year. I have decided to give unranked players an RSCI ranking of 101.

def get_rsci_rank(ncaa_id):
    url = 'https://www.sports-reference.com/cbb/players/{}.html'.format(ncaa_id)
    html = urlopen(url)
    soup = BeautifulSoup(html)
    element = soup.find('strong', text='RSCI Top 100:')
    
    # Change the RSCI rank of unranked players to 101
    if element == None:
        return 101
    
    rsci_text = element.parent.text
    
    def extract_rsci_rank(str):
        slice = str[14:]
        space_idx = slice.index(' ')
        return int(slice[:space_idx])
    
    return extract_rsci_rank(rsci_text)

The source_data.csv file contains both the www.sports-reference.com NCAA ID and NBA ID of each player so that both their NCAA and NBA statistics can be retrieved. The file also includes other information such as ESPN pre-draft ranking and projected NBA position. The player height and weight data is taken from the NBA Draft Combine.

source_data.csv is converted into a DataFrame and then a list of dictionaries, source_dict_list.

source_df = pd.read_csv('source_data.csv')
source_dict_list = source_df.to_dict('records')
source_df.head()

For each dictionary (representing a player) in source_dict_list, a DataFrame is created containing biographical information and NCAA statistics from the player's final college season, with the use of the sportsreference API, team_df, get_player_age(), and get_rsci_rank().

A list is formed of all the player DataFrames, and then the DataFrames are all concatenated into one DataFrame, ncaa_df.

%%time
ncaa_list = []

for index in source_dict_list:
    player = index['player']
    ncaa_id = index['ncaa_id']
    nba_id = index['nba_id']
    draft_pick = index['draft_pick']
    draft_year = index['draft_year']
    pre_draft_rank = index['espn_pre_draft_rank']
    position_nba = index['projected_pos_nba']
    height = index['height']
    weight = index['weight']
    
    # Return a DataFrame of all of the player's college stats
    player_df = Player_ncaa(ncaa_id).dataframe
    
    # Remove the career total row of the DataFrame and return only the row of the last season
    player_df = player_df.drop(('Career',)).tail(1)
    
    # Assign the player's NBA ID to the player_id column for merging later
    player_df.player_id = nba_id
    
    # Reset the index, moving the original index of seasons to a column named 'season_ncaa'
    player_df = player_df.reset_index().rename(columns={'level_0': 'season_ncaa'})
    
    # Make new columns
    player_df['player'] = player
    player_df['draft_pick'] = draft_pick
    player_df['pre_draft_rank'] = pre_draft_rank
    player_df['position_nba'] = position_nba
    player_df['rsci_rank'] = get_rsci_rank(ncaa_id)
    player_df['age'] = get_player_age(nba_id, draft_year)
    
    # Get value of the player's college team abbreviation in uppercase
    team_ncaa = player_df.loc[0, 'team_abbreviation'].upper()
    
    # Change values to those from source file
    player_df['height'] = height
    player_df['weight'] = weight
    
    # Make new columns of stats from the player's college team
    player_df['pace'] = team_df.loc[(draft_year, team_ncaa), 'pace']
    player_df['sos'] = team_df.loc[(draft_year, team_ncaa), 'sos']
    player_df['srs'] = team_df.loc[(draft_year, team_ncaa), 'srs']
    
    # Append the player's DataFrame to ncaa_list
    ncaa_list.append(player_df)

# Create ncaa_df DataFrame from all the player DataFrames in ncaa_list
ncaa_df = pd.concat(ncaa_list)

# Reset the index of ncaa_df
ncaa_df.reset_index(drop=True, inplace=True)

# Slice ncaa_df so that it only includes the necessary columns
ncaa_df = ncaa_df.loc[:, ['player_id', 'player', 'season_ncaa', 'draft_pick', 'pre_draft_rank', 'rsci_rank', 'position_nba',
    'age', 'height', 'weight', 'assist_percentage', 'block_percentage', 'box_plus_minus', 'defensive_box_plus_minus', 
    'defensive_rebound_percentage', 'defensive_win_shares', 'effective_field_goal_percentage', 'field_goal_attempts', 
    'field_goal_percentage', 'field_goals', 'free_throw_attempt_rate', 'free_throw_percentage', 'games_played', 'games_started',
    'minutes_played', 'offensive_box_plus_minus', 'offensive_rebound_percentage', 'offensive_win_shares', 'personal_fouls', 
    'player_efficiency_rating', 'points', 'points_produced', 'steal_percentage', 'three_point_attempt_rate', 
    'three_point_percentage', 'total_rebound_percentage', 'true_shooting_percentage', 'turnover_percentage', 
    'two_point_percentage', 'usage_percentage', 'win_shares', 'win_shares_per_40_minutes', 'pace', 'sos', 'srs']]

ncaa_df.head()

CPU times: user 6min 16s, sys: 7.7 s, total: 6min 24s
Wall time: 8min 2s

New columns, such as minutes_per_game, are added to ncaa_df consisting of statistics derived from several other columns.

ncaa_df['minutes_per_game'] = round(ncaa_df.minutes_played / ncaa_df.games_played, 1)
ncaa_df['offensive_ws_per_40_minutes'] = round(ncaa_df.offensive_win_shares * 40 / ncaa_df.minutes_played, 3)
ncaa_df['defensive_ws_per_40_minutes'] = round(ncaa_df.defensive_win_shares * 40 / ncaa_df.minutes_played, 3)
ncaa_df['fouls_per_40_minutes'] = round(ncaa_df.personal_fouls * 40 / ncaa_df.minutes_played, 1)
ncaa_df['pts_per_100_poss'] = \
    round((ncaa_df.points * 40 * 100) / (ncaa_df.pace * ncaa_df.minutes_played), 1)
ncaa_df['fga_per_100_poss'] = \
    round((ncaa_df.field_goal_attempts * 40 * 100) / (ncaa_df.pace * ncaa_df.minutes_played), 1)
ncaa_df['pts_produced_per_100_poss'] = \
    round((ncaa_df.points_produced * 40 * 100) / (ncaa_df.pace * ncaa_df.minutes_played), 1)

ncaa_df.head()

This function removes any rows (seasons) from an NBA player's DataFrame that occur more than 5 years after their draft year.

def only_5_years_after_draft(df, draft_year):
    for i in range(len(df), 0, -1):   
        last_index = df.index[-1][0]
        last_year = int('20' + last_index[-2:])
        if last_year - draft_year > 5:
            df = df.drop(df.index[-1])
    return df

When the sportsreference API is used to get the NBA statistics of a drafted player who never plays in the NBA, an empty DataFrame is returned. If I then create a DataFrame for the player, and I want to concatenate it with DataFrames of other NBA players, it must have the same list of columns.

columns_nba contains the columns of the DataFrame returned for an NBA player by the sportsreference API.

columns_nba = ['season_nba', 'and_ones', 'assist_percentage', 'assists', 'block_percentage','blocking_fouls', 'blocks',
    'box_plus_minus', 'center_percentage', 'defensive_box_plus_minus', 'defensive_rebound_percentage', 'defensive_rebounds', 
    'defensive_win_shares', 'dunks', 'effective_field_goal_percentage', 'field_goal_attempts',
    'field_goal_perc_sixteen_foot_plus_two_pointers', 'field_goal_perc_ten_to_sixteen_feet', 
    'field_goal_perc_three_to_ten_feet', 'field_goal_perc_zero_to_three_feet', 'field_goal_percentage', 'field_goals', 
    'free_throw_attempt_rate', 'free_throw_attempts', 'free_throw_percentage', 'free_throws', 'games_played', 'games_started',
    'half_court_heaves', 'half_court_heaves_made', 'height', 'lost_ball_turnovers', 'minutes_played', 'nationality',
    'net_plus_minus', 'offensive_box_plus_minus', 'offensive_fouls', 'offensive_rebound_percentage', 'offensive_rebounds',
    'offensive_win_shares', 'on_court_plus_minus', 'other_turnovers', 'passing_turnovers', 'percentage_field_goals_as_dunks',
    'percentage_of_three_pointers_from_corner', 'percentage_shots_three_pointers', 'percentage_shots_two_pointers',
    'percentage_sixteen_foot_plus_two_pointers', 'percentage_ten_to_sixteen_footers', 'percentage_three_to_ten_footers',
    'percentage_zero_to_three_footers', 'personal_fouls', 'player_efficiency_rating', 'player_id', 'point_guard_percentage',
    'points', 'points_generated_by_assists', 'position', 'power_forward_percentage', 'salary', 'shooting_distance',
    'shooting_fouls', 'shooting_fouls_drawn', 'shooting_guard_percentage', 'shots_blocked', 'small_forward_percentage', 
    'steal_percentage', 'steals', 'take_fouls', 'team_abbreviation', 'three_point_attempt_rate', 'three_point_attempts', 
    'three_point_percentage', 'three_point_shot_percentage_from_corner', 'three_pointers', 'three_pointers_assisted_percentage',
    'total_rebound_percentage', 'total_rebounds', 'true_shooting_percentage', 'turnover_percentage', 'turnovers', 
    'two_point_attempts', 'two_point_percentage', 'two_pointers', 'two_pointers_assisted_percentage', 'usage_percentage', 
    'value_over_replacement_player', 'weight', 'win_shares', 'win_shares_per_48_minutes']

Now I will create the DataFrame, nba_df, which contains every player's relevant NBA statistics.

%%time
nba_list = []

for index in source_dict_list:
    nba_id = index['nba_id']
    draft_year = index['draft_year']
    team_games_2020 = index['team_games_2020']
    
    # DataFrame of all of the player's NBA stats
    player_df = Player_nba(nba_id).dataframe
    
    # If a player did not play in the NBA, an empty DataFrame is returned
    if player_df.empty:
        # Create a 1 row DataFrame for player with 0 values for all columns
        data = [tuple(0 for i in range(90))]
        player_df = pd.DataFrame.from_records(data, columns=columns_nba)
        
        # Assign the player's NBA ID to the player_id column for merging later
        player_df.player_id = nba_id
    else:
        # Get the index number of the row with index ('Career',)
        idx = player_df.index.get_loc(('Career',))
        
        # Select all rows of the DataFrame before the career total row, then select only the first 5 rows of that slice
        player_df = player_df.iloc[:idx,:].head(5)
        
        # Select only the rows(seasons) that occurred at most 5 years after the player was drafted
        player_df = only_5_years_after_draft(player_df, draft_year)
        
        # Reset the index, moving the original index of seasons to a column named 'season_nba'
        player_df = player_df.reset_index().rename(columns={'level_0': 'season_nba'})
    
    # Create column for team games in 2020
    player_df['team_games_2020'] = team_games_2020
    
    # Append the player's DataFrame to nba_list
    nba_list.append(player_df)

# Create nba_df DataFrame from all the player DataFrames in nba_list
nba_df = pd.concat(nba_list)

# Reset the index of nba_df
nba_df.reset_index(drop=True, inplace=True)

# Slice nba_df so that it only includes the necessary columns
nba_df = nba_df.loc[:,['player_id', 'season_nba', 'win_shares', 'team_games_2020']]

# Rename columns
nba_df.columns = ['player_id', 'season_nba', 'win_shares_nba', 'team_games_2020']

# Replace None values with 0
nba_df.fillna(0, inplace=True)

nba_df.head()

CPU times: user 4min 53s, sys: 4.91 s, total: 4min 58s
Wall time: 5min 45s

Now nba_df and ncaa_df are merged together.

df = pd.merge(ncaa_df, nba_df, on='player_id')

df.head()

If a player has negative win shares for an NBA season, the amount is converted to zero. This is necessary so that drafted NBA players who never play in the NBA (0 win shares) are not considered more valuable than NBA players with negative win shares.

For the shortened seasons in 2012 and 2020, I prorate win shares to 82 games.

df.win_shares_nba = df.win_shares_nba.apply(lambda x: 0 if x < 0 else x)

# The 2011-2012 NBA season was shortened to 66 games
# Prorate all win shares for the the 2011-2012 NBA season to 82 games
df.win_shares_nba = \
    df.apply(lambda x: round(82/66 * x.win_shares_nba, 1) if x.season_nba == '2011-12' else x.win_shares_nba, axis=1)

# Many of the 2015 draftees played on 2020 NBA teams that played less than 82 games 
# Prorate the 2020 win shares for those players to 82 games
df.win_shares_nba = \
    df.apply(lambda x: round((82/x.team_games_2020) * x.win_shares_nba, 1) \
        if (x.team_games_2020 > 0 and x.season_nba == '2019-20') else x.win_shares_nba, axis=1)

I now change the DataFrame so that each player has one row containing the statistics from both their final college season and their maximum win share NBA season.

index = df.groupby(['player_id'])['win_shares_nba'].transform(max) == df['win_shares_nba']

df = df[index].drop_duplicates('player_id')

df.reset_index(drop=True, inplace=True)

df.head()

Several unnecessary columns are removed.

df.drop(columns=['player_id','season_ncaa','season_nba','team_games_2020'], inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 51 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   player                           235 non-null    object 
 1   draft_pick                       235 non-null    int64  
 2   pre_draft_rank                   235 non-null    int64  
 3   rsci_rank                        235 non-null    int64  
 4   position_nba                     235 non-null    object 
 5   age                              235 non-null    float64
 6   height                           235 non-null    float64
 7   weight                           235 non-null    int64  
 8   assist_percentage                235 non-null    float64
 9   block_percentage                 235 non-null    float64
 10  box_plus_minus                   235 non-null    float64
 11  defensive_box_plus_minus         235 non-null    float64
 12  defensive_rebound_percentage     235 non-null    float64
 13  defensive_win_shares             235 non-null    float64
 14  effective_field_goal_percentage  235 non-null    float64
 15  field_goal_attempts              235 non-null    int64  
 16  field_goal_percentage            235 non-null    float64
 17  field_goals                      235 non-null    int64  
 18  free_throw_attempt_rate          235 non-null    float64
 19  free_throw_percentage            235 non-null    float64
 20  games_played                     235 non-null    int64  
 21  games_started                    235 non-null    int64  
 22  minutes_played                   235 non-null    int64  
 23  offensive_box_plus_minus         235 non-null    float64
 24  offensive_rebound_percentage     235 non-null    float64
 25  offensive_win_shares             235 non-null    float64
 26  personal_fouls                   235 non-null    int64  
 27  player_efficiency_rating         235 non-null    float64
 28  points                           235 non-null    int64  
 29  points_produced                  235 non-null    float64
 30  steal_percentage                 235 non-null    float64
 31  three_point_attempt_rate         235 non-null    float64
 32  three_point_percentage           214 non-null    object 
 33  total_rebound_percentage         235 non-null    float64
 34  true_shooting_percentage         235 non-null    float64
 35  turnover_percentage              235 non-null    float64
 36  two_point_percentage             235 non-null    float64
 37  usage_percentage                 235 non-null    float64
 38  win_shares                       235 non-null    float64
 39  win_shares_per_40_minutes        235 non-null    float64
 40  pace                             235 non-null    float64
 41  sos                              235 non-null    float64
 42  srs                              235 non-null    float64
 43  minutes_per_game                 235 non-null    float64
 44  offensive_ws_per_40_minutes      235 non-null    float64
 45  defensive_ws_per_40_minutes      235 non-null    float64
 46  fouls_per_40_minutes             235 non-null    float64
 47  pts_per_100_poss                 235 non-null    float64
 48  fga_per_100_poss                 235 non-null    float64
 49  pts_produced_per_100_poss        235 non-null    float64
 50  win_shares_nba                   235 non-null    float64
dtypes: float64(37), int64(11), object(3)
memory usage: 93.8+ KB

three_point_percentage is the only column with null values.

The null values are replaced with zeros.

df.three_point_percentage.fillna(0, inplace=True)

Now I confirm that the remaining 'object' columns do not contain multiple data types.

{col: set(map(type, df[col])) for col in df.select_dtypes(include=[object])}

{'player': {str}, 'position_nba': {str}}

Upon inspection, I found very few errors in the dataset.

Only 2 players, Andre Drummond and Jarnell Stokes, need their RSCI rankings corrected.

df.loc[df['player'] == 'Andre Drummond', 'rsci_rank'] = 2
df.loc[df['player'] == 'Jarnell Stokes', 'rsci_rank'] = 14

The data types of several columns are now converted to 'int64'.

convert_dict = {
    'field_goal_attempts': 'int64', 'field_goals': 'int64', 'games_played': 'int64', 'games_started': 'int64', 
    'minutes_played': 'int64', 'personal_fouls': 'int64', 'points': 'int64', 'points_produced': 'int64'
}
  
df = df.astype(convert_dict)

Before finalizing the dataset, I will remove redundant features in order to reduce multicollinearity as well as model complexity. Multicollinearity is problematic because it makes it difficult to determine the individual effects of features.

The redundant features consist of both linear combinations and nonlinear functions of other features. The former possess perfect multicollinearity, violating the assumptions of ordinary least squares regression.

# Linear combinations of features
lc_features = ['offensive_box_plus_minus', 'offensive_win_shares', 'defensive_rebound_percentage', 
               'offensive_ws_per_40_minutes']

# Nonlinear functions of features
nf_features = ['field_goal_attempts', 'win_shares', 'defensive_win_shares', 'field_goals', 'personal_fouls', 'points', 
               'points_produced', 'minutes_played', 'field_goal_percentage', 'effective_field_goal_percentage',
               'true_shooting_percentage', 'pts_per_100_poss']

# Drop duplicative features
df.drop(columns=lc_features + nf_features, inplace=True)

The final dataset is now saved as dataset.csv.

df.to_csv('dataset.csv')

Exploratory Data Analysis¶

Data Description¶

The file dataset.csv contains statistical and biographical data for each player. All statistics are assumed to be from a player's final college season unless otherwise stated. The columns of the dataset are as follows:

player: Name of the player.
draft_pick: NBA Draft position.
pre_draft_rank: ESPN pre-draft ranking (unranked players given value of 101).
rsci_rank: Recruiting Services Consensus Index ranking prior to entering college (unranked players given value of 101).
position_nba: Projected NBA position.
age: Age of player at time of NBA Draft.
height: Height of player at NBA Draft Combine.
weight: Weight of player at NBA Draft Combine.
assist_percentage: Percentage of teammate field goals the player assisted while on the floor (estimate).
block_percentage: Percentage of opponent field goal attempts blocked by the player while on the floor (estimate).
box_plus_minus: Points per 100 possessions the player contributed above a league-average player (estimate).
defensive_box_plus_minus: Defensive points per 100 possessions the player contributed above a league-average player (estimate).
free_throw_attempt_rate: Number of free throw attempts per field goal attempt.
free_throw_percentage: Free throw percentage.
games_played: Games played.
games_started: Games started.
offensive_rebound_percentage: Percentage of available offensive rebounds the player grabbed while on the floor (estimate).
player_efficiency_rating: Per-minute rating of the player's performance (developed by John Hollinger).
steal_percentage: Percentage of opponent possessions that end with a steal by the player while he was on the floor (estimate).
three_point_attempt_rate: Percentage of field goal attempts from 3-point range.
three_point_percentage: 3-point field goal percentage.
total_rebound_percentage: Percentage of available rebounds the player grabbed while on the floor (estimate).
turnover_percentage: Turnovers per 100 plays (estimate)
two_point_percentage: 2-point field goal percentage.
usage_percentage: Percentage of team plays used by the player while he was on the floor (estimate).
win_shares_per_40_minutes: Number of wins contributed by the player per 40 minutes (estimate).
pace: Number of possessions per 40 minutes by the player's NCAA team (estimate).
sos: Strength of schedule; a rating of the strength of schedule of the player's NCAA team.
srs: Simple rating system; a rating that takes into account average point differential and strength of schedule of the player's NCAA team.
minutes_per_game: Minutes played per game.
defensive_ws_per_40_minutes: Number of wins contributed by the player due to his defense per 40 minutes (estimate).
fouls_per_40_minutes: Personal fouls per 40 minutes.
fga_per_100_poss: Field goal attempts per 100 possessions.
pts_produced_per_100_poss: Offensive points produced by the player per 100 possessions (estimate).
win_shares_nba: Maximum number of single-season NBA wins contributed by the player during his first 5 years (estimate).

The dataset created in the previous section is imported as a DataFrame.

df = pd.read_csv('dataset.csv', index_col=0)

df.head()

The column containing player names is removed for now.

player = df.loc[:, 'player']

df.drop(columns='player', inplace=True)

The DataFrame now contains 235 rows (players) and 34 columns. The target variable is win_shares_nba, while the remaining 33 columns are features for prediction.

df.shape

(235, 34)

df.describe()

The value counts are shown for the only categorical variable, position_nba. The wing position is represented the most in the dataset. This makes sense given the NBA's trend toward positionless basketball.

df.position_nba.value_counts()

Wing     98
Big      78
Guard    59
Name: position_nba, dtype: int64

The following plot shows the 5-year peak performance distribution of the players in the NBA (measured in Win Shares).

fig, ax = plt.subplots(figsize=(10, 6))
sns.histplot(data=df, x="win_shares_nba", stat='probability', kde=True, bins=np.arange(0, 14.1, 1), color=blues[0])
ax.set_xlabel('# of NBA Win Shares')
ax.set_ylabel('% of Players')
ax.set_title('Distribution of NBA Win Shares - 5 Year Max', fontsize=16)
fig.tight_layout()

Nearly half the players have a 5-year NBA peak performance of less than 2 Win Shares. The number of players represented seems to decrease exponentially as Win Shares increase. This shows that only a small percentage of NCAA players drafted by the NBA have successful careers.

The top 50 players based on 5-year NBA peak performance are shown below. Nearly all the players who became All-Stars have more than 8 Win Shares, a level of performance that only 8% of the players were able to attain.

pd.concat([player, df.loc[:,['position_nba','win_shares_nba']]], axis=1).\
sort_values('win_shares_nba', ascending=False).reset_index(drop=True).head(50)

The following chart shows the 5-year peak performance distributions for each projected NBA position.

fig, axs = plt.subplots(nrows=3, figsize=(10, 9))

cats = ['Guard','Wing','Big']

for i, ax in enumerate(axs.ravel()):
    data = df[df.position_nba == cats[i]]
    sns.histplot(data=data, x="win_shares_nba", stat='probability', kde=True, bins=np.arange(0, 14.1, 1), ax=ax, color=blues[0])
    ax.set_title(cats[i])
    ax.set_xlim(-.7, 14.7)
    ax.set_ylim(0, .5)
    ax.set(xlabel='# of NBA Win Shares', ylabel='% of Players')
    ax.label_outer()
    
fig.suptitle('NBA Win Shares (5 Year Max) Distribution by Projected NBA Position', fontsize=16, x=.53, y=1)
fig.tight_layout()

The 5-year peak performance data of guards looks the least like that of the other positions. However, based on the p-value results of Kolmogorov-Smirnov tests, we cannot say with confidence that it comes from a different distribution than the others. Consequently, projected NBA position likely has little effect on 5-year peak performance.

ws_guard = df.loc[df.position_nba == 'Guard', 'win_shares_nba']
ws_wing = df.loc[df.position_nba == 'Wing', 'win_shares_nba']
ws_big = df.loc[df.position_nba == 'Big', 'win_shares_nba']

dist_list = [['Guard', ws_guard], ['Wing', ws_wing], ['Big', ws_big]]

for i in range(3):
    for j in range(i + 1, 3):
       print(dist_list[i][0], ' vs. ', dist_list[j][0], ': ', ks_2samp(dist_list[i][1], dist_list[j][1]), sep='')

Guard vs. Wing: KstestResult(statistic=0.1608439986163957, pvalue=0.26191204894868303)
Guard vs. Big: KstestResult(statistic=0.20034767492394612, pvalue=0.11386572075527435)
Wing vs. Big: KstestResult(statistic=0.09837781266352695, pvalue=0.7470057848155085)

Here is the correlation matrix heatmap of the dataset. I first changed the categorical feature, position_nba, into dummy/indicator variables so that the correlations of its values would also show.

plt.figure(figsize=(13, 11))

corr = pd.get_dummies(df).corr()
ax = sns.heatmap(corr, xticklabels=True, yticklabels=True, linewidths=.004, square=True, cmap='Blues')

The features are now shown sorted in descending order based on the strength of their correlation with the target variable.

corr.win_shares_nba.sort_values(ascending=False, key=abs)[1:]

draft_pick                     -0.512643
pre_draft_rank                 -0.489129
age                            -0.336388
box_plus_minus                  0.333081
defensive_box_plus_minus        0.321380
win_shares_per_40_minutes       0.298121
defensive_ws_per_40_minutes     0.290742
player_efficiency_rating        0.267926
total_rebound_percentage        0.266764
offensive_rebound_percentage    0.242456
two_point_percentage            0.194891
block_percentage                0.194293
three_point_attempt_rate       -0.177785
weight                          0.173677
free_throw_attempt_rate         0.162781
srs                             0.154976
height                          0.152969
rsci_rank                      -0.137169
three_point_percentage         -0.129148
fouls_per_40_minutes            0.120480
sos                             0.117944
position_nba_Guard             -0.115592
position_nba_Big                0.105165
steal_percentage                0.089391
free_throw_percentage          -0.085662
pace                           -0.071870
minutes_per_game               -0.065126
fga_per_100_poss               -0.060689
pts_produced_per_100_poss       0.056992
assist_percentage              -0.051067
usage_percentage               -0.035315
turnover_percentage            -0.031584
games_started                   0.020580
games_played                   -0.006514
position_nba_Wing               0.001220
Name: win_shares_nba, dtype: float64

This shows that NBA Draft position, ESPN pre-draft ranking, age, Box Plus/Minus, and Defensive Box Plus/Minus have the strongest correlation with 5-year peak performance. It also confirms that projected NBA position has little correlation with the target variable.

The correlations of these features with the target variable are shown below in scatterplots.

fig, axs = plt.subplots(ncols=5, figsize=(12.5, 3.2))

cols = ['draft_pick','pre_draft_rank','age','box_plus_minus','defensive_box_plus_minus']

for i, ax in enumerate(axs.ravel()):
    sns.scatterplot(x=cols[i], y='win_shares_nba', data=df, ax=ax)
    ax.set(ylabel='# of NBA Win Shares')
    ax.label_outer()
    
fig.suptitle('Feature Correlations with NBA Win Shares (5 Year Max)', fontsize=16, x=.53, y=1)
fig.tight_layout()

The plots show that draft position, pre-draft ranking, and age are negatively correlated with 5-year peak performance, while BPM and defensive BPM are positively correlated with the target variable.

The features with the strongest inter-feature correlations are now shown below.

cs = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
cs = cs.stack().sort_values(ascending=False)
cs[cs > .8]

usage_percentage              fga_per_100_poss               0.930851
usage_percentage              pts_produced_per_100_poss      0.913130
offensive_rebound_percentage  total_rebound_percentage       0.898053
player_efficiency_rating      win_shares_per_40_minutes      0.866715
box_plus_minus                win_shares_per_40_minutes      0.825948
defensive_box_plus_minus      defensive_ws_per_40_minutes    0.825058
draft_pick                    pre_draft_rank                 0.820958
fga_per_100_poss              pts_produced_per_100_poss      0.820459
dtype: float64

The strong positive correlations between these features all seem reasonable. Good rebounders are generally also good offensive rebounders, players are generally ranked close to their eventual draft position, and advanced performance statistics (i.e. WS/40, PER, and BPM) tend to rate players similarly.

Now I will create dummy variables for the lone categorical variable, position_nba. I will drop the first category/unique value of this feature, position_nba_Big, since only k - 1 indicator variables are needed to represent the k categories of a categorial variable.

# Convert feature into dummy variables
df = pd.get_dummies(df, columns=['position_nba'], drop_first=True)

The resulting dataset now has 34 dependent variables including 2 indicator variables.

The strong correlations shown previously between several of the features may indicate that there is still a significant amount of multicollinearity among the features of the dataset.

Multicollinearity is present when a feature in a multiple regression model can be linearly predicted from the other features with high accuracy. As I said previously, multicollinearity is problematic because it makes it difficult to determine the individual effects of features. High multicollinearity is particularly troublesome when it comes to ordinary least squares regression.

In the graph below, I show the amount of multicollinearity (as mearured by variance inflation factor) among the features as the number of features (ordered by strength of correlation with the target variable) varies.

# Filter dataset to only include features, ordered by strength of correlation with target
cols_corr = df.corr().win_shares_nba.abs().sort_values(ascending=False)[1:].index
df_corr = df.loc[:, cols_corr]

# Returns max VIF for each number of features via backward feature elimination
def VIF(data):
    result = {'features': [], 'vif': []}
    cols = data.shape[1]
    for i in range(cols):
        data = add_constant(data)
        max = 0
        for j in range(1, data.shape[1]):
            vif = variance_inflation_factor(data.values, j)
            if vif > max:
                max = vif
                imax = j
        result['vif'].append(max)
        result['features'].append(data.shape[1] - 1)
        data = data.iloc[:, 1:-1]
    return result

data = pd.DataFrame(VIF(df_corr))

fig, ax = plt.subplots(figsize=(10, 6))
sns.lineplot(x=data.features, y=data.vif, data=data, linewidth=2)
ax.axhline(10, .04, .96, color='g', linewidth=1.4)
ax.set_xlabel('# of Features')
ax.set_ylabel('Variance Inflation Factor')
ax.set_title('Multicollinearity by Number of Features', fontsize=16)
ax.text(31, 56, 'High \n Multicollinearity \n (VIF > 10)', transform=ax.transData, color='g', horizontalalignment='center',
       fontsize=13)
fig.tight_layout()

The multicollinearity of a dataset is generally considered high when the maximum VIF of the features is greater than 10. Based on this rule of thumb, the graph shows that multicollinearity becomes high approximately after the first 10 features (most correlated with the target variable) are included in the model. This information will guide some of my modeling decisions.

Model Selection and Assessment¶

During the modeling process, I will assess the performance of 9 different optimized modeling algorithms. The best performing algorithm will then be used to create the final model. The following regression models will be utilized: elastic net, support vector, random forest, gradient boosting, and linear regression. I will also incorporate Pearson's rank based feature selection. For each model but linear regression, 2 modeling algorithms will be created: one with feature selection and one without. Due to high multicollinearity, linear regression will only be used with feature selection.

I have chosen to use nested cross-validation during the modeling process. Normally when modeling, the dataset is split into a train and test set, cross-validation is performed on the train set, and the performance of the best model is assessed on the test set. However, nested cross-validation allows you to perform both model selection and assessment without the need to set aside a test set. This is extremely useful when your dataset is small like mine (only 235 samples), since you want as many samples for training as possible.

With nested cross-validation, the cross-validation procedure for model selection is nested inside the cross-validation procedure for model assessment. Instead of one tuned model being selected, several are selected, each is assessed on a different part of the dataset, and the scores are averaged to give a performance score. Nested cross-validation assesses the performance of the optimal cross-validatory chosen model, the model obtained by applying the inner cross-validation procedure to the entire dataset.

Both repetition and stratification will be applied when performing nested cross-validation. Repetition is necessary for reliable model assessment due to the high variance of any single nested cross-validation run. Meanwhile, stratification increases the likelihood that each cross-validation fold is representative of the dataset as a whole.

The dataset is now divided into y, the target variable, and X which contains the predictor variables.

X = df.drop(columns='win_shares_nba')
y = df.win_shares_nba

Now I calculate the MSE of a very simple baseline model (the mean of the target variable) as a reference point for which to compare the results of the more advanced algorithms.

baseline_mse = np.mean(np.square(y - np.mean(y)))
print('Baseline MSE:', round(baseline_mse, 2))

Baseline MSE: 9.75

The following functions are used to bin the continuous target variable, stratify the repeated cross-validation folds based on the binned target variable, and perform nested cross-validation with repetition and stratification.

def stratify_cv(series, n_splits, n_repeats, random_state):
    cv = []
    rskf = RepeatedStratifiedKFold(n_splits=n_splits, n_repeats=n_repeats, random_state=random_state)

    for train_index, test_index in rskf.split(series, series):
        cv.append((train_index, test_index))
    return cv

def bin_series(series, bins):
    lst = [min(series), max(series) + 1]
    
    for i in range(1, bins):
        lst.insert(-1, series.quantile(i / bins))
    
    bin_arr = np.array(lst)
    indices = np.digitize(series, bin_arr)
    
    return pd.Series(indices).astype(str)

def nested_cv(X, y, model, param_grid):
    y_binned = bin_series(y, 5)
    cv_outer = stratify_cv(y_binned, n_splits=5, n_repeats=10, random_state=1)

    result = {'outer_score': [], 'inner_score': [], 'params': []}

    for train_index, test_index in cv_outer:
        X_train, X_test = X.loc[train_index, :], X.loc[test_index, :]
        y_train, y_test = y.loc[train_index], y.loc[test_index]

        y_binned_train = y_binned.loc[train_index]
        cv_inner = stratify_cv(y_binned_train, n_splits=5, n_repeats=10, random_state=1)

        gs = GridSearchCV(
            estimator=model,
            param_grid=param_grid,
            scoring='neg_mean_squared_error',
            n_jobs=-1,
            cv=cv_inner
        )

        gs.fit(X_train, y_train)
        
        yhat = gs.best_estimator_.predict(X_test)
        mse = mean_squared_error(y_test, yhat)
        
        result['outer_score'].append(mse)
        result['inner_score'].append(-gs.best_score_)
        result['params'].append(gs.best_params_)
         
    return result

All of the modeling algorithms are defined below.

# Elastic net regression:

en = Pipeline([('scaler', StandardScaler()),
               ('model', ElasticNet())])

# Support vector regression:

sv = Pipeline([('scaler', StandardScaler()),
               ('model', SVR(kernel='rbf'))])

# Random forest regression:

rf = RandomForestRegressor(random_state=1)

# Gradient boosting regression:

gb = GradientBoostingRegressor(learning_rate=.1, random_state=1)

# Linear regression with feature selection:

lr_fs = Pipeline([('scaler', StandardScaler()),
                  ('selector', SelectKBest(score_func=f_regression)),
                  ('model', LinearRegression())])

# Elastic net regression with feature selection:

en_fs = Pipeline([('scaler', StandardScaler()),
                  ('selector', SelectKBest(score_func=f_regression)),
                  ('model', ElasticNet())])

# Support vector regression with feature selection:

sv_fs = Pipeline([('scaler', StandardScaler()),
                  ('selector', SelectKBest(score_func=f_regression)),
                  ('model', SVR(kernel='rbf'))])

# Random forest regression with feature selection:

rf_fs = Pipeline([('selector', SelectKBest(score_func=f_regression)),
                  ('model', RandomForestRegressor(random_state=1))])

# Random forest regression with feature selection:

gb_fs = Pipeline([('selector', SelectKBest(score_func=f_regression)),
                  ('model', GradientBoostingRegressor(learning_rate=.1, random_state=1))])

The hyperparameter grid of each algorithm is now defined. The grids were manually narrowed during the tuning process by observing the inner scores of repeated nested cross-validation.

# Elastic net regression:

grid_en = {
    'model__alpha': np.linspace(.1, 1, 10),
    'model__l1_ratio': np.linspace(.1, 1, 10)
}

# Support vector regression:

grid_sv = {
    'model__C': [1, 5, 10, 50, 100, 500, 1000, 5000, 10000],
    'model__gamma': [.0001, .0005, .001, .005, .01],
    'model__epsilon': [1, 2, 4],
}

# Random forest regression:

grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_features': [.2, .3, .4, .5, 1.0],
    'max_depth': [2, 3, 4, 5, 6]
}

# Gradient boosting regression:

grid_gb = {
    'n_estimators': [20, 25, 30, 35, 40, 45, 50],
    'max_depth': [1, 2, 3],
    'min_samples_leaf': [1, 2, 3, 4],
    'max_features': [.7, .8, .9, 1.0]
}

# Linear regression with feature selection:

grid_lr_fs = {
    'selector__k': range(7, 12)
}

# Elastic net regression with feature selection:

grid_en_fs = {
    'selector__k': range(7, 12),
    'model__alpha': np.linspace(.1, 1, 10),
    'model__l1_ratio': np.linspace(.1, 1, 10)
}

# Support vector regression with feature selection:

grid_sv_fs = {
    'selector__k': range(7, 12),
    'model__C': [1, 5, 10, 50, 100, 500, 1000, 5000, 10000],
    'model__gamma': [.0001, .0005, .001, .005, .01],
    'model__epsilon': [1, 2, 4],
}

# Random forest regression with feature selection:

grid_rf_fs = {
    'selector__k': range(7, 12),
    'model__n_estimators': [100, 200, 300],
    'model__max_features': [.2, .3, .4, .5, 1.0],
    'model__max_depth': [2, 3, 4, 5, 6]
}

# Gradient boosting regression with feature selection:

grid_gb_fs = {
    'selector__k': range(7, 12),
    'model__n_estimators': [20, 25, 30, 35, 40, 45, 50],
    'model__max_depth': [1, 2, 3],
    'model__min_samples_leaf': [1, 2, 3, 4],
    'model__max_features': [.7, .8, .9, 1.0]
}

Repeated nested cross-validation is performed for each algorithm, and the results are saved in the dictionary, ncv_results.

I used Google Colab Pro to run the code below due to the significant computation time (several hours with a TPU) and saved the results in the file ncv_results.csv. In order to save time, I suggest that you skip ahead and import the file.

ncv_results = {}

# Elastic net regression:

ncv_en = nested_cv(X, y, en, grid_en)
ncv_results['en'] = ncv_en['outer_score']

# Support vector regression:

ncv_sv = nested_cv(X, y, sv, grid_sv)
ncv_results['sv'] = ncv_sv['outer_score']

# Random forest regression:

ncv_rf = nested_cv(X, y, rf, grid_rf)
ncv_results['rf'] = ncv_rf['outer_score']

# Gradient boosting regression:

ncv_gb = nested_cv(X, y, gb, grid_gb)
ncv_results['gb'] = ncv_gb['outer_score']

# Linear regression with feature selection:

ncv_lr_fs = nested_cv(X, y, lr_fs, grid_lr_fs)
ncv_results['lr_fs'] = ncv_lr_fs['outer_score']

# Elastic net regression with feature selection:

ncv_en_fs = nested_cv(X, y, en_fs, grid_en_fs)
ncv_results['en_fs'] = ncv_en_fs['outer_score']

# Support vector regression with feature selection:

ncv_sv_fs = nested_cv(X, y, sv_fs, grid_sv_fs)
ncv_results['sv_fs'] = ncv_sv_fs['outer_score']

# Random forest regression with feature selection:

ncv_rf_fs = nested_cv(X, y, rf_fs, grid_rf_fs)
ncv_results['rf_fs'] = ncv_rf_fs['outer_score']

# Gradient boosting regression with feature selection:

ncv_gb_fs = nested_cv(X, y, gb_fs, grid_gb_fs)
ncv_results['gb_fs'] = ncv_gb_fs['outer_score']

ncv_results = pd.DataFrame(ncv_results)

The nested cross-validation results from above are imported as a DataFrame.

ncv_results = pd.read_csv('ncv_results.csv')

The following chart shows the repeated nested cross-validation score distributions of each algorithm (with MSE as the scoring metric). The mean and quartiles of each distribution are displayed.

ncv_results.columns = ['Elastic Net','SVM','Random Forest','GBM','Linear Regression\n with FS','Elastic Net\n with FS',
                       'SVM\n with FS','Random Forest\n with FS','GBM\n with FS']

fig, ax = plt.subplots(figsize=(10, 8))
sns.boxplot(data=ncv_results, orient='h', palette='Spectral', showmeans=True,
            meanprops={'marker':'s', 'markerfacecolor':'white', 'markeredgecolor':'black'})
ax.set_xlabel('Mean Squared Error')
ax.set_title('Nested Cross Validation Score Distribution by Algorithm', fontsize=16)
ax.set_xticks(range(3, 13))
fig.tight_layout()

Each algorithm performs significantly better than the baseline model (9.75 MSE). Elastic net regression with Pearson's rank based feature selection has the lowest mean and median nested cross-validation score. Meanwhile, gradient boosting with feature selection, which performs slightly worse, has less score variance.

The results also show that Pearson's rank based feature selection improves the performance of every algorithm (linear regression was only performed with feature selection due to multicollinearity concerns).

The numerical results are all shown in the DataFrame below.

ncv = pd.melt(ncv_results, var_name='algorithm')
ncv = ncv.groupby('algorithm').value.agg(['mean', ('std', lambda x: np.std(x)), 'median']).round(2).sort_values('mean')
ncv.style.format('{:.2f}').set_properties(**{'text-align': 'center'}).\
set_table_styles([dict(selector='th', props=[('text-align', 'left')])]).set_table_attributes('style="font-size: 14px"')

The following function is used to perform repeated cross-validation with stratification.

def cv(X, y, model, param_grid):
    y_binned = bin_series(y, 5)
    cv_inner = stratify_cv(y_binned, n_splits=5, n_repeats=10, random_state=1)
    
    gs = GridSearchCV(
        estimator=model,
        param_grid=param_grid,
        scoring='neg_mean_squared_error',
        n_jobs=-1,
        cv=cv_inner
    )

    gs.fit(X, y)
    
    return gs

To reiterate in more detail, nested cross-validation is the cross-validation assessment of large-sample performance of a model M chosen by a specific cross-validation protocol P. The cross-validation protocol is the procedure of selecting the optimal cross-validatory chosen model with pre-defined grid, number of folds and number of repeats.

For each algorithm, I will now find the hyperparameters of the model chosen by its cross-validation protocol. The inner loop of the nested cross-validation procedure is applied to the entire dataset, and the results are saved in the dictionary, cv_results.

In order to save time, I suggest you skip past the code below and import the file containing the pre-computed results, cv_results.csv.

cv_results = {}

alg_names = ncv.index.to_list()
algs = [en_fs, gb_fs, rf_fs, lr_fs, rf, gb, sv_fs, en, sv]
param_grids = [grid_en_fs, grid_gb_fs, grid_rf_fs, grid_lr_fs, grid_rf, grid_gb, grid_sv_fs, grid_en, grid_sv]

for name, alg, grid in zip(alg_names, algs, param_grids):
    cv_results[name] = [cv(X, y, alg, grid).best_params_]

The cross-validation results are imported as a DataFrame.

cv_results = pd.read_csv('cv_results.csv', index_col=0)

# Strings converted to dictionaries
for col in cv_results.columns:
    cv_results[col] = cv_results[col].apply(lambda x: ast.literal_eval(x))

The following DataFrame displays the hyperparameters of the optimal cross-validatory chosen model for each algorithm.

def edit_dict(x):
    x = {k: round(v, 1) if v > .01 else v for k, v in x.items()}
    if any('__' in k for k in x):
        return {k.split('__')[1]: v for k, v in x.items()}
    return x

cv_results = pd.DataFrame(cv_results).T

cv_results[0] = cv_results[0].apply(edit_dict)
cv_results.columns = ['parameters']
cv_results.index.name = 'algorithm'

with pd.option_context('display.max_colwidth', 200):
    display(cv_results.style.set_properties(**{'text-align': 'left'}).\
            set_table_styles([dict(selector='th', props=[('text-align', 'left')])]).\
            set_table_attributes('style="font-size: 14px"'))

For the best performing algorithm (elastic net regression with Pearson's rank based feature selection), I will now show how the optimal cross-validation score varies by number of features selected, k.

To accomplish this, I perform cross-validation with an expanded hyperparameter grid, where k takes on all values. As I previously stated, the hyperparameter grids were manually narrowed by checking the inner scores of nested cross-validation while tuning.

In order to save time, I suggest you skip past the code below and import the file containing the pre-computed results, cv_en_fs.csv.

grid_k_en_fs = {
    'selector__k': range(1, 35),
    'model__alpha': np.linspace(.1, 1, 10),
    'model__l1_ratio': np.linspace(.1, 1, 10)
}

cv_results_en_fs = cv(X, y, en_fs, grid_k_en_fs)

cv_en_fs = pd.DataFrame(cv_results_en_fs.cv_results_['params'])
cv_en_fs['mean_test_score'] = -cv_results_en_fs.cv_results_['mean_test_score']
cv_en_fs['std_test_score'] = cv_results_en_fs.cv_results_['std_test_score']

cv_en_fs = cv_en_fs.sort_values('mean_test_score').groupby('selector__k', as_index=False).first()

The cross-validation results from above are imported as a DataFrame.

cv_en_fs = pd.read_csv('cv_en_fs.csv', index_col=0)

The upper and lower bounds of the shaded area represent the standard deviation of the cross-validation scores of the optimal model for each value of k.

upper_bound = cv_en_fs.mean_test_score + cv_en_fs.std_test_score
lower_bound = cv_en_fs.mean_test_score - cv_en_fs.std_test_score

fig, ax = plt.subplots(figsize=(10, 6))
sns.lineplot(data=cv_en_fs, x='selector__k', y='mean_test_score', linewidth=2)
ax.fill_between(cv_en_fs.selector__k, upper_bound, lower_bound, alpha=.3)
ax.axvline(9, color='k', linestyle='--', alpha=.6)
ax.set_xlabel('Number of Features Selected (k)')
ax.set_ylabel('Mean Squared Error')
ax.set_title('Optimal Cross Validation Score by Features Selected (k)', fontsize=16)
fig.tight_layout()

This graphically shows that the value of the hyperparameter, k, of the optimal cross-validatory chosen model of elastic net regression with feature selection is indeed 9.

Now I will more closely analyze the repeated nested cross-validation results of elastic net regression with feature selection. The nested cross-validation results are imported as a DataFrame.

ncv_en_fs = pd.read_csv('ncv_en_fs.csv', index_col=0)

# Column of strings converted into dictionaries
ncv_en_fs['params'] = ncv_en_fs['params'].apply(lambda x: ast.literal_eval(x))

# Column of dictionaries converted into separate columns
ncv_en_fs = pd.DataFrame(ncv_en_fs, columns=['outer_score','inner_score']).join(pd.DataFrame(ncv_en_fs['params'].to_list()))

The average inner and outer scores of the repeated nested cross-validation are shown below.

The models are selected within the inner cross-validation, while they are assessed within the outer cross-validation of the procedure. The average inner score is slightly better than the average outer score. This is expected, since error estimation is somewhat positively biased when using cross-validation for model selection, whereas nested cross-validation gives an almost unbiased estimate of the true error. The small difference between the inner and outer scores indicates the unlikelihood of overfitting within the inner cross-validation, suggesting generalizability.

inner_score_en = round(ncv_en_fs.inner_score.mean(), 2)
outer_score_en = round(ncv_en_fs.outer_score.mean(), 2)
print('Inner CV Score:', inner_score_en)
print('Outer CV Score:', outer_score_en)

Inner CV Score: 6.79
Outer CV Score: 6.84

The following plots show the distribution of hyperparameters from the models selected within the inner cross-validation of the nested cross-validation procedure.

fig, axs = plt.subplots(ncols=3, figsize=(12.4, 3.5))

cols = ['selector__k','model__alpha','model__l1_ratio']
bins = [np.arange(6.5, 12, 1), np.arange(.05, 1.1, .1), np.arange(.05, 1.1, .1)]

for i, ax in enumerate(axs.ravel()):
    sns.histplot(data=ncv_en_fs, x=cols[i], stat='probability', bins=bins[i], ax=ax, color=blues[0])
    ax.set_ylim(0, 1)
    ax.set(ylabel='Frequency')
    ax.label_outer()
    
fig.suptitle('Hyperparameter Distributions from Nested Cross Validation', fontsize=16, x=.53, y=.97)
fig.tight_layout()

For each hyperparameter, there is one value present within at least 40% of the selected models. This shows that the optimal cross-validatory chosen model assessed via nested cross-validation is relatively stable, meaning that its predictions do not change much when the training data is slightly modified. This is good sign, since model stability is also associated with generalizability.

Elastic net and gradient boosting regression, each combined with Pearson's rank based feature selection, are the best performing algorithms based on the results of repeated nested cross-validation. Gradient boosting only slightly underperforms elastic net in terms of mean and median scores, doing so with less score variance. However, elastic net is a less complex model with far greater interpretability. Also, the nested cross-validation results of the elastic net model show that it likely did not overfit the training data, is relatively stable, and therefore likely generalizes well to new data.

For these reasons, I have chosen to use elastic net regression with Pearson's rank based feature selection to create my final model.

The full algorithm is shown below.

en_fs

Pipeline(steps=[('scaler', StandardScaler()),
                ('selector',
                 SelectKBest(score_func=<function f_regression at 0x7fe97f6e0050>)),
                ('model', ElasticNet())])

Now the hyperparameters of the optimal cross-validatory chosen model for elastic net regression with Pearson's rank based feature selection are displayed.

fm_params = cv_results.iloc[0,0]
fm_params

{'alpha': 0.3, 'l1_ratio': 0.1, 'k': 9}

The hyperparameters are added to the algorithm, and it is fit on the full dataset.

final_model = clone(en_fs)

final_model.set_params(selector__k=fm_params['k'], model__alpha=fm_params['alpha'], model__l1_ratio=fm_params['l1_ratio'])

final_model.fit(X, y)

Pipeline(steps=[('scaler', StandardScaler()),
                ('selector',
                 SelectKBest(k=9,
                             score_func=<function f_regression at 0x7fe97f6e0050>)),
                ('model', ElasticNet(alpha=0.3, l1_ratio=0.1))])

With an L1 ratio of 0.1, the performance of the model is likley very similar to that of ridge regression (L1 ratio of zero).

The model works as follows: first the data is standardized, then 9 features are selected, and lastly the model multiplies each selected feature by a given coefficient and sums the results to make predictions.

The features and corresponding coefficients of the final model sorted by absolute value are displayed in the DataFrame below. Since the model standardizes the features, the impact of each feature on the prediction of the model is proportional to the absolute value of its coefficient.

mask = final_model.named_steps['selector'].get_support()
features_selected = X.loc[:, mask].columns.to_list()

coefs = final_model.named_steps['model'].coef_

coef_dict = {}
for f, c in zip(features_selected, coefs):
    coef_dict[f] = c
    
coef_dict = {'feature': features_selected, 'coefficient': coefs}
coef_df = pd.DataFrame(coef_dict).set_index('feature').sort_values('coefficient', ascending=False, key=abs)
coef_df.style.format('{:.2f}').set_properties(**{'text-align': 'center'}).\
set_table_styles([dict(selector='th', props=[('text-align', 'left')])]).set_table_attributes('style="font-size: 14px"')

The datset initially contained 35 features, but they were reduced to 9 for the model via feature selection. When the model is fitted on the full dataset, the coefficent of Defensive WS/40 is reduced to zero, effectively giving the final model only 8 features. Since correlation rank based feature selection is incorporated, the 8 features of the final model are those most correlated with 5-year peak performance.

According to my final model, the NBA Draft position of an NCAA player has the most influence on the player's 5-year NBA peak performance. Also, the 5 most important features of the model: draft position, pre-draft ranking, rebound percentage, Defensive BPM, and age, provide 80% of the influence for its predictions. It feels great to arrive at such a simple, interpretable final model after all this work!

Next Steps¶

With a performance of 2.62 RMSE and a high likelihood of generalizability, I am very pleased with the results of my model, but it can certainly be improved. There are several avenues for improvement including: increasing the size of the dataset, adding more advanced features based on player tracking data, using more sophisticated feature selection methods such as sequential feature selection, and using Bayesian optimization for hyperparamter tuning. While these improvements may be complicated by data availability and computing power limitations, I am optimistic that I can implement them going forward. I may also consider using VORP or RAPTOR WAR instead of Win Shares to measure peak NBA performance in future modeling iterations, as they seem to more accurately value player contributions.

In the end, this project has taught me that it's no small feat to accurately predict the performance of NBA prospects, so maybe Celtics fans should go a little easier on Danny Ainge for drafting James Young after all...

		pace	srs	sos
year	team
2011	AIR-FORCE	62.5	2.71	3.31
2011	AKRON	67.0	2.15	-1.02
2011	ALABAMA-AM	71.6	-15.19	-11.75
2011	ALABAMA-BIRMINGHAM	63.1	8.55	2.78
2011	ALABAMA-STATE	66.5	-13.37	-10.31

	player	ncaa_id	nba_id	draft_year	draft_pick	college	years_nba	espn_pre_draft_rank	projected_pos_nba	height	weight
0	Kyrie Irving	kyrie-irving-1	irvinky01	2011	1	Duke	9	1	Guard	73.75	191
1	Derrick Williams	derrick-williams-2	willide02	2011	2	Arizona	7	2	Wing	79.25	248
2	Tristan Thompson	tristan-thompson-2	thomptr01	2011	4	Texas	9	6	Big	79.50	227
3	Brandon Knight	brandon-knight-1	knighbr03	2011	8	Kentucky	8	5	Guard	73.50	177
4	Kemba Walker	kemba-walker-1	walkeke02	2011	9	UConn	9	9	Guard	71.50	184

	player_id	player	season_ncaa	draft_pick	pre_draft_rank	rsci_rank	position_nba	age	height	weight	assist_percentage	block_percentage	box_plus_minus	defensive_box_plus_minus	defensive_rebound_percentage	defensive_win_shares	effective_field_goal_percentage	field_goal_attempts	field_goal_percentage	field_goals	free_throw_attempt_rate	free_throw_percentage	games_played	games_started	minutes_played	offensive_box_plus_minus	offensive_rebound_percentage	offensive_win_shares	personal_fouls	player_efficiency_rating	points	points_produced	steal_percentage	three_point_attempt_rate	three_point_percentage	total_rebound_percentage	true_shooting_percentage	turnover_percentage	two_point_percentage	usage_percentage	win_shares	win_shares_per_40_minutes	pace	sos	srs
0	irvinky01	Kyrie Irving	2010-11	1	1	2	Guard	19.25	73.75	191	29.8	1.8	14.8	5.0	10.6	0.6	0.615	104	0.529	55	0.683	0.901	11	8	303	9.8	2.3	1.8	23	32.5	192	189.0	3.0	0.375	0.462	6.7	0.697	16.4	0.569	26.4	2.4	0.320	70.2	8.32	24.51
1	willide02	Derrick Williams	2010-11	2	2	100	Wing	20.08	79.25	248	8.7	2.3	12.1	2.5	21.9	1.9	0.650	380	0.595	226	0.871	0.746	38	38	1141	9.6	11.8	6.0	106	32.8	741	675.0	1.9	0.195	0.568	17.0	0.690	15.7	0.601	28.9	7.9	0.279	66.9	7.99	16.54
2	thomptr01	Tristan Thompson	2010-11	4	6	9	Big	20.28	79.50	227	8.1	7.2	8.9	4.5	13.7	2.7	0.546	315	0.546	172	0.829	0.487	36	34	1104	4.4	14.0	2.8	100	24.2	471	475.0	1.8	0.000	None	13.8	0.536	12.7	0.546	22.4	5.5	0.200	67.8	7.96	20.82
3	knighbr03	Brandon Knight	2010-11	8	5	4	Guard	19.56	73.50	177	23.4	0.6	5.5	0.9	10.2	1.8	0.508	513	0.423	217	0.333	0.795	38	38	1363	4.5	2.2	3.6	83	19.3	657	652.0	1.1	0.450	0.377	6.3	0.553	16.8	0.461	27.0	5.3	0.157	66.2	8.91	20.30
4	walkeke02	Kemba Walker	2010-11	9	9	15	Guard	21.13	71.50	184	28.0	0.5	13.1	3.9	11.5	2.4	0.478	739	0.428	316	0.426	0.819	41	41	1543	9.2	3.9	7.3	56	29.9	965	930.0	3.1	0.307	0.33	7.9	0.543	9.5	0.471	32.4	9.7	0.251	65.1	10.44	17.95

	player_id	player	season_ncaa	draft_pick	pre_draft_rank	rsci_rank	position_nba	age	height	weight	assist_percentage	block_percentage	box_plus_minus	defensive_box_plus_minus	defensive_rebound_percentage	defensive_win_shares	effective_field_goal_percentage	field_goal_attempts	field_goal_percentage	field_goals	free_throw_attempt_rate	free_throw_percentage	games_played	games_started	minutes_played	offensive_box_plus_minus	offensive_rebound_percentage	offensive_win_shares	personal_fouls	player_efficiency_rating	points	points_produced	steal_percentage	three_point_attempt_rate	three_point_percentage	total_rebound_percentage	true_shooting_percentage	turnover_percentage	two_point_percentage	usage_percentage	win_shares	win_shares_per_40_minutes	pace	sos	srs	minutes_per_game	offensive_ws_per_40_minutes	defensive_ws_per_40_minutes	fouls_per_40_minutes	pts_per_100_poss	fga_per_100_poss	pts_produced_per_100_poss
0	irvinky01	Kyrie Irving	2010-11	1	1	2	Guard	19.25	73.75	191	29.8	1.8	14.8	5.0	10.6	0.6	0.615	104	0.529	55	0.683	0.901	11	8	303	9.8	2.3	1.8	23	32.5	192	189.0	3.0	0.375	0.462	6.7	0.697	16.4	0.569	26.4	2.4	0.320	70.2	8.32	24.51	27.5	0.238	0.079	3.0	36.1	19.6	35.5
1	willide02	Derrick Williams	2010-11	2	2	100	Wing	20.08	79.25	248	8.7	2.3	12.1	2.5	21.9	1.9	0.650	380	0.595	226	0.871	0.746	38	38	1141	9.6	11.8	6.0	106	32.8	741	675.0	1.9	0.195	0.568	17.0	0.690	15.7	0.601	28.9	7.9	0.279	66.9	7.99	16.54	30.0	0.210	0.067	3.7	38.8	19.9	35.4
2	thomptr01	Tristan Thompson	2010-11	4	6	9	Big	20.28	79.50	227	8.1	7.2	8.9	4.5	13.7	2.7	0.546	315	0.546	172	0.829	0.487	36	34	1104	4.4	14.0	2.8	100	24.2	471	475.0	1.8	0.000	None	13.8	0.536	12.7	0.546	22.4	5.5	0.200	67.8	7.96	20.82	30.7	0.101	0.098	3.6	25.2	16.8	25.4
3	knighbr03	Brandon Knight	2010-11	8	5	4	Guard	19.56	73.50	177	23.4	0.6	5.5	0.9	10.2	1.8	0.508	513	0.423	217	0.333	0.795	38	38	1363	4.5	2.2	3.6	83	19.3	657	652.0	1.1	0.450	0.377	6.3	0.553	16.8	0.461	27.0	5.3	0.157	66.2	8.91	20.30	35.9	0.106	0.053	2.4	29.1	22.7	28.9
4	walkeke02	Kemba Walker	2010-11	9	9	15	Guard	21.13	71.50	184	28.0	0.5	13.1	3.9	11.5	2.4	0.478	739	0.428	316	0.426	0.819	41	41	1543	9.2	3.9	7.3	56	29.9	965	930.0	3.1	0.307	0.33	7.9	0.543	9.5	0.471	32.4	9.7	0.251	65.1	10.44	17.95	37.6	0.189	0.062	1.5	38.4	29.4	37.0

	player_id	season_nba	win_shares_nba
0	irvinky01	2011-12	4.1
1	irvinky01	2012-13	5.3
2	irvinky01	2013-14	6.7
3	irvinky01	2014-15	10.4
4	irvinky01	2015-16	5.0

	draft_pick	pre_draft_rank	rsci_rank	age	height	weight	assist_percentage	block_percentage	box_plus_minus	defensive_box_plus_minus	free_throw_attempt_rate	free_throw_percentage	games_played	games_started	offensive_rebound_percentage	player_efficiency_rating	steal_percentage	three_point_attempt_rate	three_point_percentage	total_rebound_percentage	turnover_percentage	two_point_percentage	usage_percentage	win_shares_per_40_minutes	pace	sos	srs	minutes_per_game	defensive_ws_per_40_minutes	fouls_per_40_minutes	fga_per_100_poss	pts_produced_per_100_poss	win_shares_nba
count	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000	235.000000
mean	28.774468	31.919149	55.778723	21.466553	77.844681	215.880851	15.763830	3.279149	8.414468	2.834043	0.435621	0.733430	34.310638	32.229787	6.985957	24.116170	2.191064	0.261860	0.316655	11.959149	13.553191	0.522872	25.068085	0.197426	66.295319	6.672851	15.407702	31.546383	0.073860	2.853617	21.155319	28.027660	3.004681
std	16.686789	22.294020	40.394872	1.400964	3.255226	24.859319	9.099277	3.040628	2.858210	1.778994	0.146937	0.096246	4.621727	7.436222	4.136839	4.456551	0.956336	0.178939	0.151382	4.699413	3.424488	0.051298	4.410395	0.045069	2.820533	3.263630	6.906938	3.884521	0.024112	0.787576	4.214404	4.875295	3.128833
min	1.000000	1.000000	1.000000	18.650000	68.750000	160.000000	2.400000	0.000000	0.600000	-3.500000	0.143000	0.295000	8.000000	0.000000	0.500000	11.900000	0.400000	0.000000	0.000000	4.200000	5.700000	0.376000	12.500000	0.098000	57.300000	-5.390000	-6.390000	16.300000	0.010000	1.100000	10.400000	16.800000	0.000000
25%	14.000000	15.000000	16.000000	20.375000	75.500000	196.000000	8.900000	1.000000	6.450000	1.600000	0.328500	0.688000	33.000000	32.000000	3.000000	21.150000	1.500000	0.098000	0.279500	7.800000	10.950000	0.487500	22.350000	0.167000	64.500000	5.835000	11.390000	29.300000	0.058000	2.350000	18.100000	24.650000	0.300000
50%	29.000000	28.000000	48.000000	21.620000	78.000000	215.000000	12.800000	2.400000	8.500000	2.700000	0.414000	0.750000	35.000000	34.000000	6.800000	23.800000	2.000000	0.294000	0.354000	12.100000	13.300000	0.517000	24.900000	0.198000	66.400000	7.620000	16.540000	32.200000	0.071000	2.800000	20.900000	28.000000	2.100000
75%	42.500000	44.500000	101.000000	22.425000	80.250000	233.500000	20.750000	4.400000	10.250000	3.950000	0.511000	0.799500	37.000000	36.000000	10.350000	26.900000	2.750000	0.405000	0.396000	15.400000	15.700000	0.556000	28.250000	0.226500	68.100000	8.410000	20.300000	34.600000	0.091500	3.300000	24.350000	31.000000	4.800000
max	60.000000	101.000000	101.000000	27.380000	85.000000	279.000000	45.100000	13.700000	17.200000	8.100000	0.923000	0.925000	41.000000	41.000000	20.000000	36.200000	5.000000	0.794000	1.000000	25.600000	27.800000	0.664000	37.800000	0.320000	72.800000	12.130000	28.720000	38.100000	0.141000	5.800000	32.500000	41.900000	14.000000

	player	position_nba	win_shares_nba
0	Anthony Davis	Big	14.0
1	Karl-Anthony Towns	Big	14.0
2	Kawhi Leonard	Wing	13.7
3	Jimmy Butler	Wing	11.2
4	Draymond Green	Wing	11.1
5	Damian Lillard	Guard	10.6
6	Kyrie Irving	Guard	10.4
7	Andre Drummond	Big	9.9
8	Kemba Walker	Guard	9.9
9	Isaiah Thomas	Guard	9.7
10	Steven Adams	Big	9.7
11	Otto Porter	Wing	9.4
12	Klay Thompson	Wing	8.8
13	Tristan Thompson	Big	8.7
14	Montrezl Harrell	Big	8.7
15	Joel Embiid	Big	8.7
16	Bradley Beal	Wing	8.5
17	Devin Booker	Wing	8.4
18	Victor Oladipo	Wing	8.2
19	Myles Turner	Big	8.0
20	Kenneth Faried	Big	7.8
21	CJ McCollum	Guard	7.6
22	Chandler Parsons	Wing	7.6
23	Dwight Powell	Big	7.5
24	Terrence Jones	Big	7.3
25	Jerami Grant	Wing	7.3
26	Jae Crowder	Wing	7.3
27	Nikola Vucevic	Big	7.0
28	Reggie Jackson	Guard	6.9
29	Harrison Barnes	Wing	6.7
30	Willie Cauley-Stein	Big	6.7
31	Khris Middleton	Wing	6.7
32	Julius Randle	Big	6.6
33	Tobias Harris	Wing	6.6
34	Tyler Zeller	Big	6.5
35	Markieff Morris	Big	6.4
36	Cody Zeller	Big	6.3
37	Kyle Anderson	Wing	6.2
38	Mason Plumlee	Big	6.1
39	Rodney Hood	Wing	6.0
40	Kevon Looney	Big	6.0
41	Gorgui Dieng	Big	5.9
42	Larry Nance	Big	5.8
43	Michael Kidd-Gilchrist	Wing	5.8
44	Spencer Dinwiddie	Guard	5.7
45	Josh Richardson	Wing	5.6
46	Richaun Holmes	Big	5.6
47	Terry Rozier	Guard	5.6
48	Gary Harris	Wing	5.5
49	Joe Harris	Wing	5.5

	mean	std	median
algorithm
Elastic Net with FS	6.84	1.47	6.62
GBM with FS	6.89	1.37	6.68
Random Forest with FS	6.91	1.37	6.87
Linear Regression with FS	7.01	1.56	6.85
Random Forest	7.05	1.38	6.96
GBM	7.09	1.44	6.94
SVM with FS	7.11	1.41	6.89
Elastic Net	7.14	1.48	6.97
SVM	7.27	1.55	7.01

	parameters
algorithm
Elastic Net with FS	{'alpha': 0.3, 'l1_ratio': 0.1, 'k': 9}
GBM with FS	{'max_depth': 2, 'max_features': 0.8, 'min_samples_leaf': 2, 'n_estimators': 25, 'k': 10}
Random Forest with FS	{'max_depth': 3, 'max_features': 0.3, 'n_estimators': 100, 'k': 10}
Linear Regression with FS	{'k': 9}
Random Forest	{'max_depth': 5, 'max_features': 0.4, 'n_estimators': 200}
GBM	{'max_depth': 2, 'max_features': 0.9, 'min_samples_leaf': 3, 'n_estimators': 30}
SVM with FS	{'C': 1000, 'epsilon': 2, 'gamma': 0.001, 'k': 9}
Elastic Net	{'alpha': 0.4, 'l1_ratio': 0.5}
SVM	{'C': 10, 'epsilon': 2, 'gamma': 0.005}

	coefficient
feature
draft_pick	-0.58
pre_draft_rank	-0.48
total_rebound_percentage	0.34
defensive_box_plus_minus	0.31
age	-0.29
box_plus_minus	0.22
win_shares_per_40_minutes	0.17
player_efficiency_rating	0.10
defensive_ws_per_40_minutes	0.00