I want to build a model to predict the 5-year peak performance of NCAA basketball players drafted by the NBA, measured by their maximum single-season Win Shares (an advanced statistic that assigns credit for team success to individuals on the team). The majority of an NBA player's development takes place during this 5-year time frame. Statistics from a player's final NCAA season along with biographical information will be used as input for the model. Accurate performance prediction of NBA prospects is highly valued by the ownership and management of NBA teams. It helps them effectively determine which players to acquire and significantly impacts team success, which often relates directly to the organization's financial success.
The necessary dataset will be created by gathering NCAA and NBA player statistics from www.sports-reference.com with the use of sportsreference, a third-party open-source API, and BeautifulSoup. Only players drafted by the NBA from 2011-2015 will be included in the dataset, since several NCAA advanced statistics (i.e. Box Plus Minus) are unavailable for prior seasons. I will use repeated stratified nested cross-validation for model assessment, due to the limited size of the dataset. Several different statistical and supervised machine learning techniques will be used, coupled with Pearson's rank based variable selection, to model the data. As an evaluation metric, I have chosen to use mean squared error.
All necessary packages are imported, and display options and plotting styles are set.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tools.tools import add_constant
from statsmodels.stats.outliers_influence import variance_inflation_factor
from datetime import datetime, timedelta
from urllib.request import urlopen
from bs4 import BeautifulSoup
from scipy.stats import ks_2samp
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, Ridge, ElasticNet
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.pipeline import Pipeline
from sklearn.base import clone
import ast
from sportsreference.ncaab.roster import Player as Player_ncaa
from sportsreference.nba.roster import Player as Player_nba
from sportsreference.ncaab.teams import Teams
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 100)
pd.options.display.multi_sparse = False
sns.set(style='darkgrid')
blues = sns.color_palette('Blues_r')
The sportsreference API pulls statistics from the relevant websites of www.sports-reference.com: www.sports-reference.com/cbb (NCAA stats) and www.basketball-reference.com (NBA stats). sportsreference.ncaab.teams.Teams (imported as Teams) retreives NCAA team statistics, while sportsreference.ncaab.roster.Player (imported as Player_ncaa) and sportsreference.nba.roster.Player (imported as Player_nba) retreive NCAA and NBA player statistics respectively.
Using the sportsreference API, a dictionary is created with multiple statistics for each NCAA basketball team from 2011 - 2015. The dictionary is then converted into the DataFrame, team_df.
team_dict = {'year': [], 'team': [], 'pace': [], 'srs': [], 'sos': []}
for year in range(2011, 2016):
for team in Teams(year):
pace = team.pace
srs = team.simple_rating_system
sos = team.strength_of_schedule
team_dict['year'].append(year)
team_dict['team'].append(team.abbreviation)
team_dict['pace'].append(pace)
team_dict['srs'].append(srs)
team_dict['sos'].append(sos)
team_df = pd.DataFrame(team_dict)
team_df.set_index(['year', 'team'], drop=True, inplace=True)
team_df.head()
The following function calculates the age of an NCAA player on the date of the NBA draft using the sportsreference API.
def get_player_age(nba_id, draft_year):
birth_date = Player_nba(nba_id).birth_date
datetimeFormat = '%Y-%m-%d'
draft_date = str(draft_year) + '-06-20'
diff = datetime.strptime(draft_date, datetimeFormat) - birth_date
days = diff.days
return round(days/365, 2)
This function uses BeautifulSoup to return the RSCI (Recruiting Services Consensus Index) Ranking of an NCAA player from www.sports-reference.com.
The RSCI ranks the top 100 high school players each year. I have decided to give unranked players an RSCI ranking of 101.
def get_rsci_rank(ncaa_id):
url = 'https://www.sports-reference.com/cbb/players/{}.html'.format(ncaa_id)
html = urlopen(url)
soup = BeautifulSoup(html)
element = soup.find('strong', text='RSCI Top 100:')
# Change the RSCI rank of unranked players to 101
if element == None:
return 101
rsci_text = element.parent.text
def extract_rsci_rank(str):
slice = str[14:]
space_idx = slice.index(' ')
return int(slice[:space_idx])
return extract_rsci_rank(rsci_text)
The source_data.csv file contains both the www.sports-reference.com NCAA ID and NBA ID of each player so that both their NCAA and NBA statistics can be retrieved. The file also includes other information such as ESPN pre-draft ranking and projected NBA position. The player height and weight data is taken from the NBA Draft Combine.
source_data.csv is converted into a DataFrame and then a list of dictionaries, source_dict_list.
source_df = pd.read_csv('source_data.csv')
source_dict_list = source_df.to_dict('records')
source_df.head()
For each dictionary (representing a player) in source_dict_list, a DataFrame is created containing biographical information and NCAA statistics from the player's final college season, with the use of the sportsreference API, team_df, get_player_age(), and get_rsci_rank().
A list is formed of all the player DataFrames, and then the DataFrames are all concatenated into one DataFrame, ncaa_df.
%%time
ncaa_list = []
for index in source_dict_list:
player = index['player']
ncaa_id = index['ncaa_id']
nba_id = index['nba_id']
draft_pick = index['draft_pick']
draft_year = index['draft_year']
pre_draft_rank = index['espn_pre_draft_rank']
position_nba = index['projected_pos_nba']
height = index['height']
weight = index['weight']
# Return a DataFrame of all of the player's college stats
player_df = Player_ncaa(ncaa_id).dataframe
# Remove the career total row of the DataFrame and return only the row of the last season
player_df = player_df.drop(('Career',)).tail(1)
# Assign the player's NBA ID to the player_id column for merging later
player_df.player_id = nba_id
# Reset the index, moving the original index of seasons to a column named 'season_ncaa'
player_df = player_df.reset_index().rename(columns={'level_0': 'season_ncaa'})
# Make new columns
player_df['player'] = player
player_df['draft_pick'] = draft_pick
player_df['pre_draft_rank'] = pre_draft_rank
player_df['position_nba'] = position_nba
player_df['rsci_rank'] = get_rsci_rank(ncaa_id)
player_df['age'] = get_player_age(nba_id, draft_year)
# Get value of the player's college team abbreviation in uppercase
team_ncaa = player_df.loc[0, 'team_abbreviation'].upper()
# Change values to those from source file
player_df['height'] = height
player_df['weight'] = weight
# Make new columns of stats from the player's college team
player_df['pace'] = team_df.loc[(draft_year, team_ncaa), 'pace']
player_df['sos'] = team_df.loc[(draft_year, team_ncaa), 'sos']
player_df['srs'] = team_df.loc[(draft_year, team_ncaa), 'srs']
# Append the player's DataFrame to ncaa_list
ncaa_list.append(player_df)
# Create ncaa_df DataFrame from all the player DataFrames in ncaa_list
ncaa_df = pd.concat(ncaa_list)
# Reset the index of ncaa_df
ncaa_df.reset_index(drop=True, inplace=True)
# Slice ncaa_df so that it only includes the necessary columns
ncaa_df = ncaa_df.loc[:, ['player_id', 'player', 'season_ncaa', 'draft_pick', 'pre_draft_rank', 'rsci_rank', 'position_nba',
'age', 'height', 'weight', 'assist_percentage', 'block_percentage', 'box_plus_minus', 'defensive_box_plus_minus',
'defensive_rebound_percentage', 'defensive_win_shares', 'effective_field_goal_percentage', 'field_goal_attempts',
'field_goal_percentage', 'field_goals', 'free_throw_attempt_rate', 'free_throw_percentage', 'games_played', 'games_started',
'minutes_played', 'offensive_box_plus_minus', 'offensive_rebound_percentage', 'offensive_win_shares', 'personal_fouls',
'player_efficiency_rating', 'points', 'points_produced', 'steal_percentage', 'three_point_attempt_rate',
'three_point_percentage', 'total_rebound_percentage', 'true_shooting_percentage', 'turnover_percentage',
'two_point_percentage', 'usage_percentage', 'win_shares', 'win_shares_per_40_minutes', 'pace', 'sos', 'srs']]
ncaa_df.head()
New columns, such as minutes_per_game, are added to ncaa_df consisting of statistics derived from several other columns.
ncaa_df['minutes_per_game'] = round(ncaa_df.minutes_played / ncaa_df.games_played, 1)
ncaa_df['offensive_ws_per_40_minutes'] = round(ncaa_df.offensive_win_shares * 40 / ncaa_df.minutes_played, 3)
ncaa_df['defensive_ws_per_40_minutes'] = round(ncaa_df.defensive_win_shares * 40 / ncaa_df.minutes_played, 3)
ncaa_df['fouls_per_40_minutes'] = round(ncaa_df.personal_fouls * 40 / ncaa_df.minutes_played, 1)
ncaa_df['pts_per_100_poss'] = \
round((ncaa_df.points * 40 * 100) / (ncaa_df.pace * ncaa_df.minutes_played), 1)
ncaa_df['fga_per_100_poss'] = \
round((ncaa_df.field_goal_attempts * 40 * 100) / (ncaa_df.pace * ncaa_df.minutes_played), 1)
ncaa_df['pts_produced_per_100_poss'] = \
round((ncaa_df.points_produced * 40 * 100) / (ncaa_df.pace * ncaa_df.minutes_played), 1)
ncaa_df.head()
This function removes any rows (seasons) from an NBA player's DataFrame that occur more than 5 years after their draft year.
def only_5_years_after_draft(df, draft_year):
for i in range(len(df), 0, -1):
last_index = df.index[-1][0]
last_year = int('20' + last_index[-2:])
if last_year - draft_year > 5:
df = df.drop(df.index[-1])
return df
When the sportsreference API is used to get the NBA statistics of a drafted player who never plays in the NBA, an empty DataFrame is returned. If I then create a DataFrame for the player, and I want to concatenate it with DataFrames of other NBA players, it must have the same list of columns.
columns_nba contains the columns of the DataFrame returned for an NBA player by the sportsreference API.
columns_nba = ['season_nba', 'and_ones', 'assist_percentage', 'assists', 'block_percentage','blocking_fouls', 'blocks',
'box_plus_minus', 'center_percentage', 'defensive_box_plus_minus', 'defensive_rebound_percentage', 'defensive_rebounds',
'defensive_win_shares', 'dunks', 'effective_field_goal_percentage', 'field_goal_attempts',
'field_goal_perc_sixteen_foot_plus_two_pointers', 'field_goal_perc_ten_to_sixteen_feet',
'field_goal_perc_three_to_ten_feet', 'field_goal_perc_zero_to_three_feet', 'field_goal_percentage', 'field_goals',
'free_throw_attempt_rate', 'free_throw_attempts', 'free_throw_percentage', 'free_throws', 'games_played', 'games_started',
'half_court_heaves', 'half_court_heaves_made', 'height', 'lost_ball_turnovers', 'minutes_played', 'nationality',
'net_plus_minus', 'offensive_box_plus_minus', 'offensive_fouls', 'offensive_rebound_percentage', 'offensive_rebounds',
'offensive_win_shares', 'on_court_plus_minus', 'other_turnovers', 'passing_turnovers', 'percentage_field_goals_as_dunks',
'percentage_of_three_pointers_from_corner', 'percentage_shots_three_pointers', 'percentage_shots_two_pointers',
'percentage_sixteen_foot_plus_two_pointers', 'percentage_ten_to_sixteen_footers', 'percentage_three_to_ten_footers',
'percentage_zero_to_three_footers', 'personal_fouls', 'player_efficiency_rating', 'player_id', 'point_guard_percentage',
'points', 'points_generated_by_assists', 'position', 'power_forward_percentage', 'salary', 'shooting_distance',
'shooting_fouls', 'shooting_fouls_drawn', 'shooting_guard_percentage', 'shots_blocked', 'small_forward_percentage',
'steal_percentage', 'steals', 'take_fouls', 'team_abbreviation', 'three_point_attempt_rate', 'three_point_attempts',
'three_point_percentage', 'three_point_shot_percentage_from_corner', 'three_pointers', 'three_pointers_assisted_percentage',
'total_rebound_percentage', 'total_rebounds', 'true_shooting_percentage', 'turnover_percentage', 'turnovers',
'two_point_attempts', 'two_point_percentage', 'two_pointers', 'two_pointers_assisted_percentage', 'usage_percentage',
'value_over_replacement_player', 'weight', 'win_shares', 'win_shares_per_48_minutes']
Now I will create the DataFrame, nba_df, which contains every player's relevant NBA statistics.
%%time
nba_list = []
for index in source_dict_list:
nba_id = index['nba_id']
draft_year = index['draft_year']
team_games_2020 = index['team_games_2020']
# DataFrame of all of the player's NBA stats
player_df = Player_nba(nba_id).dataframe
# If a player did not play in the NBA, an empty DataFrame is returned
if player_df.empty:
# Create a 1 row DataFrame for player with 0 values for all columns
data = [tuple(0 for i in range(90))]
player_df = pd.DataFrame.from_records(data, columns=columns_nba)
# Assign the player's NBA ID to the player_id column for merging later
player_df.player_id = nba_id
else:
# Get the index number of the row with index ('Career',)
idx = player_df.index.get_loc(('Career',))
# Select all rows of the DataFrame before the career total row, then select only the first 5 rows of that slice
player_df = player_df.iloc[:idx,:].head(5)
# Select only the rows(seasons) that occurred at most 5 years after the player was drafted
player_df = only_5_years_after_draft(player_df, draft_year)
# Reset the index, moving the original index of seasons to a column named 'season_nba'
player_df = player_df.reset_index().rename(columns={'level_0': 'season_nba'})
# Create column for team games in 2020
player_df['team_games_2020'] = team_games_2020
# Append the player's DataFrame to nba_list
nba_list.append(player_df)
# Create nba_df DataFrame from all the player DataFrames in nba_list
nba_df = pd.concat(nba_list)
# Reset the index of nba_df
nba_df.reset_index(drop=True, inplace=True)
# Slice nba_df so that it only includes the necessary columns
nba_df = nba_df.loc[:,['player_id', 'season_nba', 'win_shares', 'team_games_2020']]
# Rename columns
nba_df.columns = ['player_id', 'season_nba', 'win_shares_nba', 'team_games_2020']
# Replace None values with 0
nba_df.fillna(0, inplace=True)
nba_df.head()
Now nba_df and ncaa_df are merged together.
df = pd.merge(ncaa_df, nba_df, on='player_id')
df.head()
If a player has negative win shares for an NBA season, the amount is converted to zero. This is necessary so that drafted NBA players who never play in the NBA (0 win shares) are not considered more valuable than NBA players with negative win shares.
For the shortened seasons in 2012 and 2020, I prorate win shares to 82 games.
df.win_shares_nba = df.win_shares_nba.apply(lambda x: 0 if x < 0 else x)
# The 2011-2012 NBA season was shortened to 66 games
# Prorate all win shares for the the 2011-2012 NBA season to 82 games
df.win_shares_nba = \
df.apply(lambda x: round(82/66 * x.win_shares_nba, 1) if x.season_nba == '2011-12' else x.win_shares_nba, axis=1)
# Many of the 2015 draftees played on 2020 NBA teams that played less than 82 games
# Prorate the 2020 win shares for those players to 82 games
df.win_shares_nba = \
df.apply(lambda x: round((82/x.team_games_2020) * x.win_shares_nba, 1) \
if (x.team_games_2020 > 0 and x.season_nba == '2019-20') else x.win_shares_nba, axis=1)
I now change the DataFrame so that each player has one row containing the statistics from both their final college season and their maximum win share NBA season.
index = df.groupby(['player_id'])['win_shares_nba'].transform(max) == df['win_shares_nba']
df = df[index].drop_duplicates('player_id')
df.reset_index(drop=True, inplace=True)
df.head()
Several unnecessary columns are removed.
df.drop(columns=['player_id','season_ncaa','season_nba','team_games_2020'], inplace=True)
df.info()
three_point_percentage is the only column with null values.
The null values are replaced with zeros.
df.three_point_percentage.fillna(0, inplace=True)
Now I confirm that the remaining 'object' columns do not contain multiple data types.
{col: set(map(type, df[col])) for col in df.select_dtypes(include=[object])}
Upon inspection, I found very few errors in the dataset.
Only 2 players, Andre Drummond and Jarnell Stokes, need their RSCI rankings corrected.
df.loc[df['player'] == 'Andre Drummond', 'rsci_rank'] = 2
df.loc[df['player'] == 'Jarnell Stokes', 'rsci_rank'] = 14
The data types of several columns are now converted to 'int64'.
convert_dict = {
'field_goal_attempts': 'int64', 'field_goals': 'int64', 'games_played': 'int64', 'games_started': 'int64',
'minutes_played': 'int64', 'personal_fouls': 'int64', 'points': 'int64', 'points_produced': 'int64'
}
df = df.astype(convert_dict)
Before finalizing the dataset, I will remove redundant features in order to reduce multicollinearity as well as model complexity. Multicollinearity is problematic because it makes it difficult to determine the individual effects of features.
The redundant features consist of both linear combinations and nonlinear functions of other features. The former possess perfect multicollinearity, violating the assumptions of ordinary least squares regression.
# Linear combinations of features
lc_features = ['offensive_box_plus_minus', 'offensive_win_shares', 'defensive_rebound_percentage',
'offensive_ws_per_40_minutes']
# Nonlinear functions of features
nf_features = ['field_goal_attempts', 'win_shares', 'defensive_win_shares', 'field_goals', 'personal_fouls', 'points',
'points_produced', 'minutes_played', 'field_goal_percentage', 'effective_field_goal_percentage',
'true_shooting_percentage', 'pts_per_100_poss']
# Drop duplicative features
df.drop(columns=lc_features + nf_features, inplace=True)
The final dataset is now saved as dataset.csv.
df.to_csv('dataset.csv')
The file dataset.csv contains statistical and biographical data for each player. All statistics are assumed to be from a player's final college season unless otherwise stated. The columns of the dataset are as follows:
The dataset created in the previous section is imported as a DataFrame.
df = pd.read_csv('dataset.csv', index_col=0)
df.head()
The column containing player names is removed for now.
player = df.loc[:, 'player']
df.drop(columns='player', inplace=True)
The DataFrame now contains 235 rows (players) and 34 columns. The target variable is win_shares_nba, while the remaining 33 columns are features for prediction.
df.shape
df.describe()
The value counts are shown for the only categorical variable, position_nba. The wing position is represented the most in the dataset. This makes sense given the NBA's trend toward positionless basketball.
df.position_nba.value_counts()
The following plot shows the 5-year peak performance distribution of the players in the NBA (measured in Win Shares).
fig, ax = plt.subplots(figsize=(10, 6))
sns.histplot(data=df, x="win_shares_nba", stat='probability', kde=True, bins=np.arange(0, 14.1, 1), color=blues[0])
ax.set_xlabel('# of NBA Win Shares')
ax.set_ylabel('% of Players')
ax.set_title('Distribution of NBA Win Shares - 5 Year Max', fontsize=16)
fig.tight_layout()
Nearly half the players have a 5-year NBA peak performance of less than 2 Win Shares. The number of players represented seems to decrease exponentially as Win Shares increase. This shows that only a small percentage of NCAA players drafted by the NBA have successful careers.
The top 50 players based on 5-year NBA peak performance are shown below. Nearly all the players who became All-Stars have more than 8 Win Shares, a level of performance that only 8% of the players were able to attain.
pd.concat([player, df.loc[:,['position_nba','win_shares_nba']]], axis=1).\
sort_values('win_shares_nba', ascending=False).reset_index(drop=True).head(50)
The following chart shows the 5-year peak performance distributions for each projected NBA position.
fig, axs = plt.subplots(nrows=3, figsize=(10, 9))
cats = ['Guard','Wing','Big']
for i, ax in enumerate(axs.ravel()):
data = df[df.position_nba == cats[i]]
sns.histplot(data=data, x="win_shares_nba", stat='probability', kde=True, bins=np.arange(0, 14.1, 1), ax=ax, color=blues[0])
ax.set_title(cats[i])
ax.set_xlim(-.7, 14.7)
ax.set_ylim(0, .5)
ax.set(xlabel='# of NBA Win Shares', ylabel='% of Players')
ax.label_outer()
fig.suptitle('NBA Win Shares (5 Year Max) Distribution by Projected NBA Position', fontsize=16, x=.53, y=1)
fig.tight_layout()
The 5-year peak performance data of guards looks the least like that of the other positions. However, based on the p-value results of Kolmogorov-Smirnov tests, we cannot say with confidence that it comes from a different distribution than the others. Consequently, projected NBA position likely has little effect on 5-year peak performance.
ws_guard = df.loc[df.position_nba == 'Guard', 'win_shares_nba']
ws_wing = df.loc[df.position_nba == 'Wing', 'win_shares_nba']
ws_big = df.loc[df.position_nba == 'Big', 'win_shares_nba']
dist_list = [['Guard', ws_guard], ['Wing', ws_wing], ['Big', ws_big]]
for i in range(3):
for j in range(i + 1, 3):
print(dist_list[i][0], ' vs. ', dist_list[j][0], ': ', ks_2samp(dist_list[i][1], dist_list[j][1]), sep='')
Here is the correlation matrix heatmap of the dataset. I first changed the categorical feature, position_nba, into dummy/indicator variables so that the correlations of its values would also show.
plt.figure(figsize=(13, 11))
corr = pd.get_dummies(df).corr()
ax = sns.heatmap(corr, xticklabels=True, yticklabels=True, linewidths=.004, square=True, cmap='Blues')
The features are now shown sorted in descending order based on the strength of their correlation with the target variable.
corr.win_shares_nba.sort_values(ascending=False, key=abs)[1:]
This shows that NBA Draft position, ESPN pre-draft ranking, age, Box Plus/Minus, and Defensive Box Plus/Minus have the strongest correlation with 5-year peak performance. It also confirms that projected NBA position has little correlation with the target variable.
The correlations of these features with the target variable are shown below in scatterplots.
fig, axs = plt.subplots(ncols=5, figsize=(12.5, 3.2))
cols = ['draft_pick','pre_draft_rank','age','box_plus_minus','defensive_box_plus_minus']
for i, ax in enumerate(axs.ravel()):
sns.scatterplot(x=cols[i], y='win_shares_nba', data=df, ax=ax)
ax.set(ylabel='# of NBA Win Shares')
ax.label_outer()
fig.suptitle('Feature Correlations with NBA Win Shares (5 Year Max)', fontsize=16, x=.53, y=1)
fig.tight_layout()
The plots show that draft position, pre-draft ranking, and age are negatively correlated with 5-year peak performance, while BPM and defensive BPM are positively correlated with the target variable.
The features with the strongest inter-feature correlations are now shown below.
cs = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
cs = cs.stack().sort_values(ascending=False)
cs[cs > .8]
The strong positive correlations between these features all seem reasonable. Good rebounders are generally also good offensive rebounders, players are generally ranked close to their eventual draft position, and advanced performance statistics (i.e. WS/40, PER, and BPM) tend to rate players similarly.
Now I will create dummy variables for the lone categorical variable, position_nba. I will drop the first category/unique value of this feature, position_nba_Big, since only k - 1 indicator variables are needed to represent the k categories of a categorial variable.
# Convert feature into dummy variables
df = pd.get_dummies(df, columns=['position_nba'], drop_first=True)
The resulting dataset now has 34 dependent variables including 2 indicator variables.
The strong correlations shown previously between several of the features may indicate that there is still a significant amount of multicollinearity among the features of the dataset.
Multicollinearity is present when a feature in a multiple regression model can be linearly predicted from the other features with high accuracy. As I said previously, multicollinearity is problematic because it makes it difficult to determine the individual effects of features. High multicollinearity is particularly troublesome when it comes to ordinary least squares regression.
In the graph below, I show the amount of multicollinearity (as mearured by variance inflation factor) among the features as the number of features (ordered by strength of correlation with the target variable) varies.
# Filter dataset to only include features, ordered by strength of correlation with target
cols_corr = df.corr().win_shares_nba.abs().sort_values(ascending=False)[1:].index
df_corr = df.loc[:, cols_corr]
# Returns max VIF for each number of features via backward feature elimination
def VIF(data):
result = {'features': [], 'vif': []}
cols = data.shape[1]
for i in range(cols):
data = add_constant(data)
max = 0
for j in range(1, data.shape[1]):
vif = variance_inflation_factor(data.values, j)
if vif > max:
max = vif
imax = j
result['vif'].append(max)
result['features'].append(data.shape[1] - 1)
data = data.iloc[:, 1:-1]
return result
data = pd.DataFrame(VIF(df_corr))
fig, ax = plt.subplots(figsize=(10, 6))
sns.lineplot(x=data.features, y=data.vif, data=data, linewidth=2)
ax.axhline(10, .04, .96, color='g', linewidth=1.4)
ax.set_xlabel('# of Features')
ax.set_ylabel('Variance Inflation Factor')
ax.set_title('Multicollinearity by Number of Features', fontsize=16)
ax.text(31, 56, 'High \n Multicollinearity \n (VIF > 10)', transform=ax.transData, color='g', horizontalalignment='center',
fontsize=13)
fig.tight_layout()
The multicollinearity of a dataset is generally considered high when the maximum VIF of the features is greater than 10. Based on this rule of thumb, the graph shows that multicollinearity becomes high approximately after the first 10 features (most correlated with the target variable) are included in the model. This information will guide some of my modeling decisions.
During the modeling process, I will assess the performance of 9 different optimized modeling algorithms. The best performing algorithm will then be used to create the final model. The following regression models will be utilized: elastic net, support vector, random forest, gradient boosting, and linear regression. I will also incorporate Pearson's rank based feature selection. For each model but linear regression, 2 modeling algorithms will be created: one with feature selection and one without. Due to high multicollinearity, linear regression will only be used with feature selection.
I have chosen to use nested cross-validation during the modeling process. Normally when modeling, the dataset is split into a train and test set, cross-validation is performed on the train set, and the performance of the best model is assessed on the test set. However, nested cross-validation allows you to perform both model selection and assessment without the need to set aside a test set. This is extremely useful when your dataset is small like mine (only 235 samples), since you want as many samples for training as possible.
With nested cross-validation, the cross-validation procedure for model selection is nested inside the cross-validation procedure for model assessment. Instead of one tuned model being selected, several are selected, each is assessed on a different part of the dataset, and the scores are averaged to give a performance score. Nested cross-validation assesses the performance of the optimal cross-validatory chosen model, the model obtained by applying the inner cross-validation procedure to the entire dataset.
Both repetition and stratification will be applied when performing nested cross-validation. Repetition is necessary for reliable model assessment due to the high variance of any single nested cross-validation run. Meanwhile, stratification increases the likelihood that each cross-validation fold is representative of the dataset as a whole.
The dataset is now divided into y, the target variable, and X which contains the predictor variables.
X = df.drop(columns='win_shares_nba')
y = df.win_shares_nba
Now I calculate the MSE of a very simple baseline model (the mean of the target variable) as a reference point for which to compare the results of the more advanced algorithms.
baseline_mse = np.mean(np.square(y - np.mean(y)))
print('Baseline MSE:', round(baseline_mse, 2))
The following functions are used to bin the continuous target variable, stratify the repeated cross-validation folds based on the binned target variable, and perform nested cross-validation with repetition and stratification.
def stratify_cv(series, n_splits, n_repeats, random_state):
cv = []
rskf = RepeatedStratifiedKFold(n_splits=n_splits, n_repeats=n_repeats, random_state=random_state)
for train_index, test_index in rskf.split(series, series):
cv.append((train_index, test_index))
return cv
def bin_series(series, bins):
lst = [min(series), max(series) + 1]
for i in range(1, bins):
lst.insert(-1, series.quantile(i / bins))
bin_arr = np.array(lst)
indices = np.digitize(series, bin_arr)
return pd.Series(indices).astype(str)
def nested_cv(X, y, model, param_grid):
y_binned = bin_series(y, 5)
cv_outer = stratify_cv(y_binned, n_splits=5, n_repeats=10, random_state=1)
result = {'outer_score': [], 'inner_score': [], 'params': []}
for train_index, test_index in cv_outer:
X_train, X_test = X.loc[train_index, :], X.loc[test_index, :]
y_train, y_test = y.loc[train_index], y.loc[test_index]
y_binned_train = y_binned.loc[train_index]
cv_inner = stratify_cv(y_binned_train, n_splits=5, n_repeats=10, random_state=1)
gs = GridSearchCV(
estimator=model,
param_grid=param_grid,
scoring='neg_mean_squared_error',
n_jobs=-1,
cv=cv_inner
)
gs.fit(X_train, y_train)
yhat = gs.best_estimator_.predict(X_test)
mse = mean_squared_error(y_test, yhat)
result['outer_score'].append(mse)
result['inner_score'].append(-gs.best_score_)
result['params'].append(gs.best_params_)
return result
All of the modeling algorithms are defined below.
# Elastic net regression:
en = Pipeline([('scaler', StandardScaler()),
('model', ElasticNet())])
# Support vector regression:
sv = Pipeline([('scaler', StandardScaler()),
('model', SVR(kernel='rbf'))])
# Random forest regression:
rf = RandomForestRegressor(random_state=1)
# Gradient boosting regression:
gb = GradientBoostingRegressor(learning_rate=.1, random_state=1)
# Linear regression with feature selection:
lr_fs = Pipeline([('scaler', StandardScaler()),
('selector', SelectKBest(score_func=f_regression)),
('model', LinearRegression())])
# Elastic net regression with feature selection:
en_fs = Pipeline([('scaler', StandardScaler()),
('selector', SelectKBest(score_func=f_regression)),
('model', ElasticNet())])
# Support vector regression with feature selection:
sv_fs = Pipeline([('scaler', StandardScaler()),
('selector', SelectKBest(score_func=f_regression)),
('model', SVR(kernel='rbf'))])
# Random forest regression with feature selection:
rf_fs = Pipeline([('selector', SelectKBest(score_func=f_regression)),
('model', RandomForestRegressor(random_state=1))])
# Random forest regression with feature selection:
gb_fs = Pipeline([('selector', SelectKBest(score_func=f_regression)),
('model', GradientBoostingRegressor(learning_rate=.1, random_state=1))])
The hyperparameter grid of each algorithm is now defined. The grids were manually narrowed during the tuning process by observing the inner scores of repeated nested cross-validation.
# Elastic net regression:
grid_en = {
'model__alpha': np.linspace(.1, 1, 10),
'model__l1_ratio': np.linspace(.1, 1, 10)
}
# Support vector regression:
grid_sv = {
'model__C': [1, 5, 10, 50, 100, 500, 1000, 5000, 10000],
'model__gamma': [.0001, .0005, .001, .005, .01],
'model__epsilon': [1, 2, 4],
}
# Random forest regression:
grid_rf = {
'n_estimators': [100, 200, 300],
'max_features': [.2, .3, .4, .5, 1.0],
'max_depth': [2, 3, 4, 5, 6]
}
# Gradient boosting regression:
grid_gb = {
'n_estimators': [20, 25, 30, 35, 40, 45, 50],
'max_depth': [1, 2, 3],
'min_samples_leaf': [1, 2, 3, 4],
'max_features': [.7, .8, .9, 1.0]
}
# Linear regression with feature selection:
grid_lr_fs = {
'selector__k': range(7, 12)
}
# Elastic net regression with feature selection:
grid_en_fs = {
'selector__k': range(7, 12),
'model__alpha': np.linspace(.1, 1, 10),
'model__l1_ratio': np.linspace(.1, 1, 10)
}
# Support vector regression with feature selection:
grid_sv_fs = {
'selector__k': range(7, 12),
'model__C': [1, 5, 10, 50, 100, 500, 1000, 5000, 10000],
'model__gamma': [.0001, .0005, .001, .005, .01],
'model__epsilon': [1, 2, 4],
}
# Random forest regression with feature selection:
grid_rf_fs = {
'selector__k': range(7, 12),
'model__n_estimators': [100, 200, 300],
'model__max_features': [.2, .3, .4, .5, 1.0],
'model__max_depth': [2, 3, 4, 5, 6]
}
# Gradient boosting regression with feature selection:
grid_gb_fs = {
'selector__k': range(7, 12),
'model__n_estimators': [20, 25, 30, 35, 40, 45, 50],
'model__max_depth': [1, 2, 3],
'model__min_samples_leaf': [1, 2, 3, 4],
'model__max_features': [.7, .8, .9, 1.0]
}
Repeated nested cross-validation is performed for each algorithm, and the results are saved in the dictionary, ncv_results.
I used Google Colab Pro to run the code below due to the significant computation time (several hours with a TPU) and saved the results in the file ncv_results.csv. In order to save time, I suggest that you skip ahead and import the file.
ncv_results = {}
# Elastic net regression:
ncv_en = nested_cv(X, y, en, grid_en)
ncv_results['en'] = ncv_en['outer_score']
# Support vector regression:
ncv_sv = nested_cv(X, y, sv, grid_sv)
ncv_results['sv'] = ncv_sv['outer_score']
# Random forest regression:
ncv_rf = nested_cv(X, y, rf, grid_rf)
ncv_results['rf'] = ncv_rf['outer_score']
# Gradient boosting regression:
ncv_gb = nested_cv(X, y, gb, grid_gb)
ncv_results['gb'] = ncv_gb['outer_score']
# Linear regression with feature selection:
ncv_lr_fs = nested_cv(X, y, lr_fs, grid_lr_fs)
ncv_results['lr_fs'] = ncv_lr_fs['outer_score']
# Elastic net regression with feature selection:
ncv_en_fs = nested_cv(X, y, en_fs, grid_en_fs)
ncv_results['en_fs'] = ncv_en_fs['outer_score']
# Support vector regression with feature selection:
ncv_sv_fs = nested_cv(X, y, sv_fs, grid_sv_fs)
ncv_results['sv_fs'] = ncv_sv_fs['outer_score']
# Random forest regression with feature selection:
ncv_rf_fs = nested_cv(X, y, rf_fs, grid_rf_fs)
ncv_results['rf_fs'] = ncv_rf_fs['outer_score']
# Gradient boosting regression with feature selection:
ncv_gb_fs = nested_cv(X, y, gb_fs, grid_gb_fs)
ncv_results['gb_fs'] = ncv_gb_fs['outer_score']
ncv_results = pd.DataFrame(ncv_results)
The nested cross-validation results from above are imported as a DataFrame.
ncv_results = pd.read_csv('ncv_results.csv')
The following chart shows the repeated nested cross-validation score distributions of each algorithm (with MSE as the scoring metric). The mean and quartiles of each distribution are displayed.
ncv_results.columns = ['Elastic Net','SVM','Random Forest','GBM','Linear Regression\n with FS','Elastic Net\n with FS',
'SVM\n with FS','Random Forest\n with FS','GBM\n with FS']
fig, ax = plt.subplots(figsize=(10, 8))
sns.boxplot(data=ncv_results, orient='h', palette='Spectral', showmeans=True,
meanprops={'marker':'s', 'markerfacecolor':'white', 'markeredgecolor':'black'})
ax.set_xlabel('Mean Squared Error')
ax.set_title('Nested Cross Validation Score Distribution by Algorithm', fontsize=16)
ax.set_xticks(range(3, 13))
fig.tight_layout()
Each algorithm performs significantly better than the baseline model (9.75 MSE). Elastic net regression with Pearson's rank based feature selection has the lowest mean and median nested cross-validation score. Meanwhile, gradient boosting with feature selection, which performs slightly worse, has less score variance.
The results also show that Pearson's rank based feature selection improves the performance of every algorithm (linear regression was only performed with feature selection due to multicollinearity concerns).
The numerical results are all shown in the DataFrame below.
ncv = pd.melt(ncv_results, var_name='algorithm')
ncv = ncv.groupby('algorithm').value.agg(['mean', ('std', lambda x: np.std(x)), 'median']).round(2).sort_values('mean')
ncv.style.format('{:.2f}').set_properties(**{'text-align': 'center'}).\
set_table_styles([dict(selector='th', props=[('text-align', 'left')])]).set_table_attributes('style="font-size: 14px"')
The following function is used to perform repeated cross-validation with stratification.
def cv(X, y, model, param_grid):
y_binned = bin_series(y, 5)
cv_inner = stratify_cv(y_binned, n_splits=5, n_repeats=10, random_state=1)
gs = GridSearchCV(
estimator=model,
param_grid=param_grid,
scoring='neg_mean_squared_error',
n_jobs=-1,
cv=cv_inner
)
gs.fit(X, y)
return gs
To reiterate in more detail, nested cross-validation is the cross-validation assessment of large-sample performance of a model M chosen by a specific cross-validation protocol P. The cross-validation protocol is the procedure of selecting the optimal cross-validatory chosen model with pre-defined grid, number of folds and number of repeats.
For each algorithm, I will now find the hyperparameters of the model chosen by its cross-validation protocol. The inner loop of the nested cross-validation procedure is applied to the entire dataset, and the results are saved in the dictionary, cv_results.
In order to save time, I suggest you skip past the code below and import the file containing the pre-computed results, cv_results.csv.
cv_results = {}
alg_names = ncv.index.to_list()
algs = [en_fs, gb_fs, rf_fs, lr_fs, rf, gb, sv_fs, en, sv]
param_grids = [grid_en_fs, grid_gb_fs, grid_rf_fs, grid_lr_fs, grid_rf, grid_gb, grid_sv_fs, grid_en, grid_sv]
for name, alg, grid in zip(alg_names, algs, param_grids):
cv_results[name] = [cv(X, y, alg, grid).best_params_]
The cross-validation results are imported as a DataFrame.
cv_results = pd.read_csv('cv_results.csv', index_col=0)
# Strings converted to dictionaries
for col in cv_results.columns:
cv_results[col] = cv_results[col].apply(lambda x: ast.literal_eval(x))
The following DataFrame displays the hyperparameters of the optimal cross-validatory chosen model for each algorithm.
def edit_dict(x):
x = {k: round(v, 1) if v > .01 else v for k, v in x.items()}
if any('__' in k for k in x):
return {k.split('__')[1]: v for k, v in x.items()}
return x
cv_results = pd.DataFrame(cv_results).T
cv_results[0] = cv_results[0].apply(edit_dict)
cv_results.columns = ['parameters']
cv_results.index.name = 'algorithm'
with pd.option_context('display.max_colwidth', 200):
display(cv_results.style.set_properties(**{'text-align': 'left'}).\
set_table_styles([dict(selector='th', props=[('text-align', 'left')])]).\
set_table_attributes('style="font-size: 14px"'))
For the best performing algorithm (elastic net regression with Pearson's rank based feature selection), I will now show how the optimal cross-validation score varies by number of features selected, k.
To accomplish this, I perform cross-validation with an expanded hyperparameter grid, where k takes on all values. As I previously stated, the hyperparameter grids were manually narrowed by checking the inner scores of nested cross-validation while tuning.
In order to save time, I suggest you skip past the code below and import the file containing the pre-computed results, cv_en_fs.csv.
grid_k_en_fs = {
'selector__k': range(1, 35),
'model__alpha': np.linspace(.1, 1, 10),
'model__l1_ratio': np.linspace(.1, 1, 10)
}
cv_results_en_fs = cv(X, y, en_fs, grid_k_en_fs)
cv_en_fs = pd.DataFrame(cv_results_en_fs.cv_results_['params'])
cv_en_fs['mean_test_score'] = -cv_results_en_fs.cv_results_['mean_test_score']
cv_en_fs['std_test_score'] = cv_results_en_fs.cv_results_['std_test_score']
cv_en_fs = cv_en_fs.sort_values('mean_test_score').groupby('selector__k', as_index=False).first()
The cross-validation results from above are imported as a DataFrame.
cv_en_fs = pd.read_csv('cv_en_fs.csv', index_col=0)
The upper and lower bounds of the shaded area represent the standard deviation of the cross-validation scores of the optimal model for each value of k.
upper_bound = cv_en_fs.mean_test_score + cv_en_fs.std_test_score
lower_bound = cv_en_fs.mean_test_score - cv_en_fs.std_test_score
fig, ax = plt.subplots(figsize=(10, 6))
sns.lineplot(data=cv_en_fs, x='selector__k', y='mean_test_score', linewidth=2)
ax.fill_between(cv_en_fs.selector__k, upper_bound, lower_bound, alpha=.3)
ax.axvline(9, color='k', linestyle='--', alpha=.6)
ax.set_xlabel('Number of Features Selected (k)')
ax.set_ylabel('Mean Squared Error')
ax.set_title('Optimal Cross Validation Score by Features Selected (k)', fontsize=16)
fig.tight_layout()
This graphically shows that the value of the hyperparameter, k, of the optimal cross-validatory chosen model of elastic net regression with feature selection is indeed 9.
Now I will more closely analyze the repeated nested cross-validation results of elastic net regression with feature selection. The nested cross-validation results are imported as a DataFrame.
ncv_en_fs = pd.read_csv('ncv_en_fs.csv', index_col=0)
# Column of strings converted into dictionaries
ncv_en_fs['params'] = ncv_en_fs['params'].apply(lambda x: ast.literal_eval(x))
# Column of dictionaries converted into separate columns
ncv_en_fs = pd.DataFrame(ncv_en_fs, columns=['outer_score','inner_score']).join(pd.DataFrame(ncv_en_fs['params'].to_list()))
The average inner and outer scores of the repeated nested cross-validation are shown below.
The models are selected within the inner cross-validation, while they are assessed within the outer cross-validation of the procedure. The average inner score is slightly better than the average outer score. This is expected, since error estimation is somewhat positively biased when using cross-validation for model selection, whereas nested cross-validation gives an almost unbiased estimate of the true error. The small difference between the inner and outer scores indicates the unlikelihood of overfitting within the inner cross-validation, suggesting generalizability.
inner_score_en = round(ncv_en_fs.inner_score.mean(), 2)
outer_score_en = round(ncv_en_fs.outer_score.mean(), 2)
print('Inner CV Score:', inner_score_en)
print('Outer CV Score:', outer_score_en)
The following plots show the distribution of hyperparameters from the models selected within the inner cross-validation of the nested cross-validation procedure.
fig, axs = plt.subplots(ncols=3, figsize=(12.4, 3.5))
cols = ['selector__k','model__alpha','model__l1_ratio']
bins = [np.arange(6.5, 12, 1), np.arange(.05, 1.1, .1), np.arange(.05, 1.1, .1)]
for i, ax in enumerate(axs.ravel()):
sns.histplot(data=ncv_en_fs, x=cols[i], stat='probability', bins=bins[i], ax=ax, color=blues[0])
ax.set_ylim(0, 1)
ax.set(ylabel='Frequency')
ax.label_outer()
fig.suptitle('Hyperparameter Distributions from Nested Cross Validation', fontsize=16, x=.53, y=.97)
fig.tight_layout()
For each hyperparameter, there is one value present within at least 40% of the selected models. This shows that the optimal cross-validatory chosen model assessed via nested cross-validation is relatively stable, meaning that its predictions do not change much when the training data is slightly modified. This is good sign, since model stability is also associated with generalizability.
Elastic net and gradient boosting regression, each combined with Pearson's rank based feature selection, are the best performing algorithms based on the results of repeated nested cross-validation. Gradient boosting only slightly underperforms elastic net in terms of mean and median scores, doing so with less score variance. However, elastic net is a less complex model with far greater interpretability. Also, the nested cross-validation results of the elastic net model show that it likely did not overfit the training data, is relatively stable, and therefore likely generalizes well to new data.
For these reasons, I have chosen to use elastic net regression with Pearson's rank based feature selection to create my final model.
The full algorithm is shown below.
en_fs
Now the hyperparameters of the optimal cross-validatory chosen model for elastic net regression with Pearson's rank based feature selection are displayed.
fm_params = cv_results.iloc[0,0]
fm_params
The hyperparameters are added to the algorithm, and it is fit on the full dataset.
final_model = clone(en_fs)
final_model.set_params(selector__k=fm_params['k'], model__alpha=fm_params['alpha'], model__l1_ratio=fm_params['l1_ratio'])
final_model.fit(X, y)
With an L1 ratio of 0.1, the performance of the model is likley very similar to that of ridge regression (L1 ratio of zero).
The model works as follows: first the data is standardized, then 9 features are selected, and lastly the model multiplies each selected feature by a given coefficient and sums the results to make predictions.
The features and corresponding coefficients of the final model sorted by absolute value are displayed in the DataFrame below. Since the model standardizes the features, the impact of each feature on the prediction of the model is proportional to the absolute value of its coefficient.
mask = final_model.named_steps['selector'].get_support()
features_selected = X.loc[:, mask].columns.to_list()
coefs = final_model.named_steps['model'].coef_
coef_dict = {}
for f, c in zip(features_selected, coefs):
coef_dict[f] = c
coef_dict = {'feature': features_selected, 'coefficient': coefs}
coef_df = pd.DataFrame(coef_dict).set_index('feature').sort_values('coefficient', ascending=False, key=abs)
coef_df.style.format('{:.2f}').set_properties(**{'text-align': 'center'}).\
set_table_styles([dict(selector='th', props=[('text-align', 'left')])]).set_table_attributes('style="font-size: 14px"')
The datset initially contained 35 features, but they were reduced to 9 for the model via feature selection. When the model is fitted on the full dataset, the coefficent of Defensive WS/40 is reduced to zero, effectively giving the final model only 8 features. Since correlation rank based feature selection is incorporated, the 8 features of the final model are those most correlated with 5-year peak performance.
According to my final model, the NBA Draft position of an NCAA player has the most influence on the player's 5-year NBA peak performance. Also, the 5 most important features of the model: draft position, pre-draft ranking, rebound percentage, Defensive BPM, and age, provide 80% of the influence for its predictions. It feels great to arrive at such a simple, interpretable final model after all this work!
With a performance of 2.62 RMSE and a high likelihood of generalizability, I am very pleased with the results of my model, but it can certainly be improved. There are several avenues for improvement including: increasing the size of the dataset, adding more advanced features based on player tracking data, using more sophisticated feature selection methods such as sequential feature selection, and using Bayesian optimization for hyperparamter tuning. While these improvements may be complicated by data availability and computing power limitations, I am optimistic that I can implement them going forward. I may also consider using VORP or RAPTOR WAR instead of Win Shares to measure peak NBA performance in future modeling iterations, as they seem to more accurately value player contributions.
In the end, this project has taught me that it's no small feat to accurately predict the performance of NBA prospects, so maybe Celtics fans should go a little easier on Danny Ainge for drafting James Young after all...