NBA Player Statistics: Exploratory Data AnalysisΒΆ

This notebook explores a dataset of NBA player statistics to uncover insights about player performance, draft outcomes, physical attributes, and more.

1. Setup and Data LoadingΒΆ

InΒ [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandasql import sqldf  # For running SQL queries on pandas DataFrames
InΒ [25]:
# Setup for pandasql
pysqldf = lambda q: sqldf(q, globals())

# Plotting preferences
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 7)
plt.rcParams['font.size'] = 12
InΒ [26]:
# Load the dataset
try:
    df = pd.read_csv('../data/PlayerIndex_nba_stats.csv')
except FileNotFoundError:
    df = pd.read_csv('data/PlayerIndex_nba_stats.csv') # Fallback for running directly in notebooks folder

print("Dataset Loaded. Shape:", df.shape)
df.head()
Dataset Loaded. Shape: (5025, 26)
Out[26]:
PERSON_ID PLAYER_LAST_NAME PLAYER_FIRST_NAME PLAYER_SLUG TEAM_ID TEAM_SLUG IS_DEFUNCT TEAM_CITY TEAM_NAME TEAM_ABBREVIATION ... DRAFT_YEAR DRAFT_ROUND DRAFT_NUMBER ROSTER_STATUS PTS REB AST STATS_TIMEFRAME FROM_YEAR TO_YEAR
0 76001 Abdelnaby Alaa alaa-abdelnaby 1610612757 blazers 0 Portland Trail Blazers POR ... 1990.0 1.0 25.0 NaN 5.7 3.3 0.3 Career 1990 1994
1 76002 Abdul-Aziz Zaid zaid-abdul-aziz 1610612745 rockets 0 Houston Rockets HOU ... 1968.0 1.0 5.0 NaN 9.0 8.0 1.2 Career 1968 1977
2 76003 Abdul-Jabbar Kareem kareem-abdul-jabbar 1610612747 lakers 0 Los Angeles Lakers LAL ... 1969.0 1.0 1.0 NaN 24.6 11.2 3.6 Career 1969 1988
3 51 Abdul-Rauf Mahmoud mahmoud-abdul-rauf 1610612743 nuggets 0 Denver Nuggets DEN ... 1990.0 1.0 3.0 NaN 14.6 1.9 3.5 Career 1990 2000
4 1505 Abdul-Wahad Tariq tariq-abdul-wahad 1610612758 kings 0 Sacramento Kings SAC ... 1997.0 1.0 11.0 NaN 7.8 3.3 1.1 Career 1997 2003

5 rows Γ— 26 columns

2. Data Cleaning and PreprocessingΒΆ

InΒ [27]:
print("Initial data types:\n", df.dtypes)
print("\nMissing values before cleaning:\n", df.isnull().sum())


# Combine First and Last Name
df['PLAYER_NAME'] = df['PLAYER_FIRST_NAME'] + ' ' + df['PLAYER_LAST_NAME']

# Function to convert height (e.g., "6-10") to inches
def height_to_inches(height_str):
    if pd.isna(height_str) or not isinstance(height_str, str) or '-' not in height_str:
        return np.nan
    try:
        feet, inches = map(int, height_str.split('-'))
        return feet * 12 + inches
    except ValueError:
        return np.nan

df['HEIGHT_INCHES'] = df['HEIGHT'].apply(height_to_inches)

# Convert WEIGHT to numeric, handling potential non-numeric entries
df['WEIGHT_LBS'] = pd.to_numeric(df['WEIGHT'], errors='coerce')

# Convert PTS, REB, AST to numeric, coercing errors
for col in ['PTS', 'REB', 'AST']:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Handle missing numerical stats (PTS, REB, AST) - fill with 0 or mean/median
# For this EDA, filling with 0 might be acceptable if we assume missing means no recorded stat.
# Alternatively, drop rows or impute. Let's fill with 0 for simplicity in aggregation.
df[['PTS', 'REB', 'AST']] = df[['PTS', 'REB', 'AST']].fillna(0)

# Handle missing DRAFT_NUMBER - important for draft analysis
# We can fill NaN DRAFT_NUMBER with a high value (e.g., max_draft_pick + 1) or a specific category for "Undrafted"
# For numerical analysis, a high number might skew things. Let's create an 'IS_UNDRAFTED' column.
df['DRAFT_NUMBER'] = pd.to_numeric(df['DRAFT_NUMBER'], errors='coerce')
df['IS_UNDRAFTED'] = df['DRAFT_NUMBER'].isnull()
# Fill NaN DRAFT_NUMBER with a value that indicates undrafted if we want to keep it numeric for some plots
# Or, for draft-specific analysis, we might drop these NaNs.
# For now, let's keep NaNs for DRAFT_NUMBER and use IS_UNDRAFTED for categorical.

# Clean up categorical columns
for col in ['POSITION', 'COLLEGE', 'COUNTRY']:
    df[col] = df[col].fillna('Unknown')

# Drop original HEIGHT and WEIGHT columns as we have numeric versions
df.drop(columns=['HEIGHT', 'WEIGHT'], inplace=True)

print("\nMissing values after cleaning:\n", df.isnull().sum())
df.describe(include=[np.number])


df.info()
Initial data types:
 PERSON_ID              int64
PLAYER_LAST_NAME      object
PLAYER_FIRST_NAME     object
PLAYER_SLUG           object
TEAM_ID                int64
TEAM_SLUG             object
IS_DEFUNCT             int64
TEAM_CITY             object
TEAM_NAME             object
TEAM_ABBREVIATION     object
JERSEY_NUMBER         object
POSITION              object
HEIGHT                object
WEIGHT               float64
COLLEGE               object
COUNTRY               object
DRAFT_YEAR           float64
DRAFT_ROUND          float64
DRAFT_NUMBER         float64
ROSTER_STATUS        float64
PTS                  float64
REB                  float64
AST                  float64
STATS_TIMEFRAME       object
FROM_YEAR              int64
TO_YEAR                int64
dtype: object

Missing values before cleaning:
 PERSON_ID               0
PLAYER_LAST_NAME        0
PLAYER_FIRST_NAME       1
PLAYER_SLUG             0
TEAM_ID                 0
TEAM_SLUG             266
IS_DEFUNCT              0
TEAM_CITY               0
TEAM_NAME               0
TEAM_ABBREVIATION       0
JERSEY_NUMBER         351
POSITION               48
HEIGHT                 47
WEIGHT                 53
COLLEGE                 1
COUNTRY                 0
DRAFT_YEAR           1325
DRAFT_ROUND          1523
DRAFT_NUMBER         1591
ROSTER_STATUS        4491
PTS                    24
REB                   316
AST                    24
STATS_TIMEFRAME         0
FROM_YEAR               0
TO_YEAR                 0
dtype: int64

Missing values after cleaning:
 PERSON_ID               0
PLAYER_LAST_NAME        0
PLAYER_FIRST_NAME       1
PLAYER_SLUG             0
TEAM_ID                 0
TEAM_SLUG             266
IS_DEFUNCT              0
TEAM_CITY               0
TEAM_NAME               0
TEAM_ABBREVIATION       0
JERSEY_NUMBER         351
POSITION                0
COLLEGE                 0
COUNTRY                 0
DRAFT_YEAR           1325
DRAFT_ROUND          1523
DRAFT_NUMBER         1591
ROSTER_STATUS        4491
PTS                     0
REB                     0
AST                     0
STATS_TIMEFRAME         0
FROM_YEAR               0
TO_YEAR                 0
PLAYER_NAME             1
HEIGHT_INCHES          47
WEIGHT_LBS             53
IS_UNDRAFTED            0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5025 entries, 0 to 5024
Data columns (total 28 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   PERSON_ID          5025 non-null   int64  
 1   PLAYER_LAST_NAME   5025 non-null   object 
 2   PLAYER_FIRST_NAME  5024 non-null   object 
 3   PLAYER_SLUG        5025 non-null   object 
 4   TEAM_ID            5025 non-null   int64  
 5   TEAM_SLUG          4759 non-null   object 
 6   IS_DEFUNCT         5025 non-null   int64  
 7   TEAM_CITY          5025 non-null   object 
 8   TEAM_NAME          5025 non-null   object 
 9   TEAM_ABBREVIATION  5025 non-null   object 
 10  JERSEY_NUMBER      4674 non-null   object 
 11  POSITION           5025 non-null   object 
 12  COLLEGE            5025 non-null   object 
 13  COUNTRY            5025 non-null   object 
 14  DRAFT_YEAR         3700 non-null   float64
 15  DRAFT_ROUND        3502 non-null   float64
 16  DRAFT_NUMBER       3434 non-null   float64
 17  ROSTER_STATUS      534 non-null    float64
 18  PTS                5025 non-null   float64
 19  REB                5025 non-null   float64
 20  AST                5025 non-null   float64
 21  STATS_TIMEFRAME    5025 non-null   object 
 22  FROM_YEAR          5025 non-null   int64  
 23  TO_YEAR            5025 non-null   int64  
 24  PLAYER_NAME        5024 non-null   object 
 25  HEIGHT_INCHES      4978 non-null   float64
 26  WEIGHT_LBS         4972 non-null   float64
 27  IS_UNDRAFTED       5025 non-null   bool   
dtypes: bool(1), float64(9), int64(5), object(13)
memory usage: 1.0+ MB

3. Exploratory Data Analysis (EDA)ΒΆ

3.1 Overall Player Statistics (Focus on Career Stats for broader view)ΒΆ

For a general overview, let's look at players with 'Career' stats. For active players, 'Season' reflects their latest season, which is also interesting.

InΒ [28]:
career_df = df[df['STATS_TIMEFRAME'] == 'Career'].copy()
season_df = df[df['STATS_TIMEFRAME'] == 'Season'].copy() # Latest season for active players


# Distributions of key career stats
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
sns.histplot(career_df['PTS'], kde=True, ax=axes[0], bins=30)
axes[0].set_title('Distribution of Career Points')
sns.histplot(career_df['REB'], kde=True, ax=axes[1], bins=30)
axes[1].set_title('Distribution of Career Rebounds')
sns.histplot(career_df['AST'], kde=True, ax=axes[2], bins=30)
axes[2].set_title('Distribution of Career Assists')
plt.tight_layout()
plt.show()


# Relationship between PTS, REB, AST (Career)
sns.pairplot(career_df[['PTS', 'REB', 'AST']].dropna(), kind='scatter', plot_kws={'alpha':0.5})
plt.suptitle('Pairplot of Career PTS, REB, AST', y=1.02)
plt.show()
No description has been provided for this image
No description has been provided for this image

3.2 Positional Analysis

InΒ [29]:
# Average stats by position (using all data, acknowledging Career/Season mix)
position_stats = df.groupby('POSITION')[['PTS', 'REB', 'AST', 'HEIGHT_INCHES', 'WEIGHT_LBS']].mean().sort_values(by='PTS', ascending=False)
print("Average Stats by Position:\n", position_stats)
Average Stats by Position:
                PTS       REB       AST  HEIGHT_INCHES  WEIGHT_LBS
POSITION                                                         
F-G       8.519753  3.185185  1.725926      78.666667  217.370370
C-F       7.537719  5.187719  1.120175      82.359649  249.850877
G-F       7.106633  2.586224  1.501531      78.061224  211.622449
F-C       6.872857  4.467143  1.023571      82.035714  242.878571
G         6.461704  1.773255  2.089784      74.859343  189.695786
F         5.994951  3.077223  0.977827      79.246019  217.992303
C         5.760947  4.078698  0.843047      82.766272  242.146884
Unknown   3.391667  0.302083  0.750000      78.500000  215.000000
InΒ [30]:
# Average stats by position (using all data, acknowledging Career/Season mix)
position_stats = df.groupby('POSITION')[['PTS', 'REB', 'AST', 'HEIGHT_INCHES', 'WEIGHT_LBS']].mean().sort_values(by='PTS', ascending=False)
print("Average Stats by Position:\n", position_stats)

# Boxplot of Points by Position
plt.figure(figsize=(10, 6))
sns.boxplot(x='POSITION', y='PTS', data=df, order=position_stats.index)
plt.title('Points Distribution by Position')
plt.show()


# Boxplot of Rebounds by Position
plt.figure(figsize=(10, 6))
sns.boxplot(x='POSITION', y='REB', data=df, order=df.groupby('POSITION')['REB'].median().sort_values(ascending=False).index)
plt.title('Rebounds Distribution by Position')
plt.show()


# Boxplot of Assists by Position
plt.figure(figsize=(10, 6))
sns.boxplot(x='POSITION', y='AST', data=df, order=df.groupby('POSITION')['AST'].median().sort_values(ascending=False).index)
plt.title('Assists Distribution by Position')
plt.show()


# Height distribution by position
plt.figure(figsize=(12, 7))
sns.violinplot(x='POSITION', y='HEIGHT_INCHES', data=df[df['HEIGHT_INCHES'].notna()], order=df.groupby('POSITION')['HEIGHT_INCHES'].median().sort_values(ascending=False).index)
plt.title('Height Distribution by Position')
plt.ylabel('Height (Inches)')
plt.show()
Average Stats by Position:
                PTS       REB       AST  HEIGHT_INCHES  WEIGHT_LBS
POSITION                                                         
F-G       8.519753  3.185185  1.725926      78.666667  217.370370
C-F       7.537719  5.187719  1.120175      82.359649  249.850877
G-F       7.106633  2.586224  1.501531      78.061224  211.622449
F-C       6.872857  4.467143  1.023571      82.035714  242.878571
G         6.461704  1.773255  2.089784      74.859343  189.695786
F         5.994951  3.077223  0.977827      79.246019  217.992303
C         5.760947  4.078698  0.843047      82.766272  242.146884
Unknown   3.391667  0.302083  0.750000      78.500000  215.000000
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

3.3 Draft Analysis (Focus on 'Career' stats for players with completed/longer careers)ΒΆ

InΒ [31]:
# Relationship between Draft Number and Career Points
# Filter out obvious outliers or undrafted players for a cleaner plot if DRAFT_NUMBER is high
drafted_career_df = career_df[career_df['DRAFT_NUMBER'].notna() & (career_df['DRAFT_NUMBER'] <= 60)] # Typical draft has 60 picks

plt.figure(figsize=(12, 7))
sns.scatterplot(x='DRAFT_NUMBER', y='PTS', data=drafted_career_df, alpha=0.5)
# Add a trend line (lowess for non-linear trends)
sns.regplot(x='DRAFT_NUMBER', y='PTS', data=drafted_career_df, scatter=False, lowess=True, color='red', line_kws={'linewidth': 2})
plt.title('Draft Pick Number vs. Average Career Points')
plt.xlabel('Draft Pick Number (Lower is better)')
plt.ylabel('Average Career Points Per Game')
plt.show()
print("Note: This plot focuses on players with 'Career' stats and a valid draft number (<=60).")


# Average Career Points by Draft Round
drafted_career_df['DRAFT_ROUND'] = pd.to_numeric(drafted_career_df['DRAFT_ROUND'], errors='coerce').fillna(0).astype(int)
avg_pts_by_round = drafted_career_df[drafted_career_df['DRAFT_ROUND'] > 0].groupby('DRAFT_ROUND')['PTS'].mean().sort_values(ascending=False)

plt.figure(figsize=(8, 5))
avg_pts_by_round.plot(kind='bar')
plt.title('Average Career Points by Draft Round')
plt.xlabel('Draft Round')
plt.ylabel('Average Career PTS')
plt.xticks(rotation=0)
plt.show()
No description has been provided for this image
Note: This plot focuses on players with 'Career' stats and a valid draft number (<=60).
C:\Users\Dean\AppData\Local\Temp\ipykernel_56840\2394375739.py:17: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  drafted_career_df['DRAFT_ROUND'] = pd.to_numeric(drafted_career_df['DRAFT_ROUND'], errors='coerce').fillna(0).astype(int)
No description has been provided for this image

3.4 Physical Attributes vs. PerformanceΒΆ

InΒ [32]:
# Height vs. Rebounds
plt.figure(figsize=(10,6))
sns.scatterplot(x='HEIGHT_INCHES', y='REB', data=df[df['HEIGHT_INCHES'].notna()], hue='POSITION', alpha=0.6)
plt.title('Height vs. Rebounds')
plt.xlabel('Height (Inches)')
plt.ylabel('Rebounds Per Game')
plt.show()


# Weight vs. Points
plt.figure(figsize=(10,6))
sns.scatterplot(x='WEIGHT_LBS', y='PTS', data=df[df['WEIGHT_LBS'].notna()], hue='POSITION', alpha=0.6)
plt.title('Weight vs. Points')
plt.xlabel('Weight (LBS)')
plt.ylabel('Points Per Game')
plt.show()
No description has been provided for this image
No description has been provided for this image

3.5 Geographical Diversity & College ImpactΒΆ

InΒ [33]:
# Top Countries (excluding USA for diversity view)
top_countries = df[df['COUNTRY'] != 'USA']['COUNTRY'].value_counts().nlargest(10)
plt.figure(figsize=(12, 7))
sns.barplot(x=top_countries.index, y=top_countries.values)
plt.title('Top 10 Non-USA Countries Producing NBA Players')
plt.xlabel('Country')
plt.ylabel('Number of Players')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


# Average career points for players from top non-USA countries
country_avg_pts = df[(df['COUNTRY'].isin(top_countries.index)) & (df['STATS_TIMEFRAME'] == 'Career')].groupby('COUNTRY')['PTS'].mean().sort_values(ascending=False)
plt.figure(figsize=(12, 7))
country_avg_pts.plot(kind='bar', color='skyblue')
plt.title('Average Career Points for Top 10 Non-USA Countries')
plt.xlabel('Country')
plt.ylabel('Average Career PTS')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


# Top Colleges producing NBA players
top_colleges = df[df['COLLEGE'] != 'Unknown']['COLLEGE'].value_counts().nlargest(15)
plt.figure(figsize=(14, 8))
sns.barplot(x=top_colleges.index, y=top_colleges.values)
plt.title('Top 15 Colleges Producing NBA Players')
plt.xlabel('College')
plt.ylabel('Number of Players')
plt.xticks(rotation=60, ha='right')
plt.tight_layout()
plt.show()


# Average career points for players from top colleges
college_avg_pts = df[(df['COLLEGE'].isin(top_colleges.index)) & (df['STATS_TIMEFRAME'] == 'Career')].groupby('COLLEGE')['PTS'].mean().sort_values(ascending=False)
plt.figure(figsize=(14, 8))
college_avg_pts.plot(kind='bar', color='coral')
plt.title('Average Career Points for Players from Top 15 Colleges')
plt.xlabel('College')
plt.ylabel('Average Career PTS')
plt.xticks(rotation=60, ha='right')
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

3.6 Current Season Standouts (Active Players - STATS_TIMEFRAME == 'Season')ΒΆ

InΒ [34]:
# Top 10 active players by PTS in their latest season
top_season_pts = season_df.sort_values(by='PTS', ascending=False).head(10)
plt.figure(figsize=(12, 7))
sns.barplot(x='PTS', y='PLAYER_NAME', data=top_season_pts, palette='viridis', hue='PLAYER_NAME', dodge=False, legend=False)
plt.title('Top 10 Active Players by Latest Season Points')
plt.xlabel('Points Per Game (Latest Season)')
plt.ylabel('Player')
plt.tight_layout()
plt.show()


# Top 10 active players by REB in their latest season
top_season_reb = season_df.sort_values(by='REB', ascending=False).head(10)
plt.figure(figsize=(12, 7))
sns.barplot(x='REB', y='PLAYER_NAME', data=top_season_reb, palette='magma', hue='PLAYER_NAME', dodge=False, legend=False)
plt.title('Top 10 Active Players by Latest Season Rebounds')
plt.xlabel('Rebounds Per Game (Latest Season)')
plt.ylabel('Player')
plt.tight_layout()
plt.show()


# Top 10 active players by AST in their latest season
top_season_ast = season_df.sort_values(by='AST', ascending=False).head(10)
plt.figure(figsize=(12, 7))
sns.barplot(x='AST', y='PLAYER_NAME', data=top_season_ast, palette='coolwarm', hue='PLAYER_NAME', dodge=False, legend=False)
plt.title('Top 10 Active Players by Latest Season Assists')
plt.xlabel('Assists Per Game (Latest Season)')
plt.ylabel('Player')
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

4. SQL Query Demonstrations (using pandasql)ΒΆ

InΒ [35]:
# Query 1: Top 10 players by career points
query1 = """
SELECT PLAYER_NAME, PTS
FROM career_df
ORDER BY PTS DESC
LIMIT 10;
"""
top_career_scorers_sql = pysqldf(query1)
print("Top 10 Career Scorers (SQL):\n", top_career_scorers_sql)


# Query 2: Average points per game by position (all players)
query2 = """
SELECT POSITION, AVG(PTS) AS AVG_PTS
FROM df
WHERE POSITION != 'Unknown'
GROUP BY POSITION
ORDER BY AVG_PTS DESC;
"""
avg_pts_by_position_sql = pysqldf(query2)
print("\nAverage PTS by Position (SQL):\n", avg_pts_by_position_sql)


# Query 3: Players from Duke with > 15 career PPG
query3 = """
SELECT PLAYER_NAME, COLLEGE, PTS
FROM career_df
WHERE COLLEGE = 'Duke' AND PTS > 15
ORDER BY PTS DESC;
"""
duke_high_scorers_sql = pysqldf(query3)
print("\nDuke Players with > 15 Career PPG (SQL):\n", duke_high_scorers_sql)


# Query 4: Number of players and average DRAFT_NUMBER by DRAFT_YEAR (for players with 'Career' stats)
query4 = """
SELECT DRAFT_YEAR, COUNT(PERSON_ID) AS Num_Players, AVG(DRAFT_NUMBER) AS Avg_Draft_Pick
FROM career_df
WHERE DRAFT_YEAR IS NOT NULL AND DRAFT_NUMBER IS NOT NULL
GROUP BY DRAFT_YEAR
ORDER BY DRAFT_YEAR DESC
LIMIT 10;
"""
draft_year_summary_sql = pysqldf(query4)
print("\nDraft Year Summary (SQL, last 10 available years with career data):\n", draft_year_summary_sql)
Top 10 Career Scorers (SQL):
         PLAYER_NAME   PTS
0  Wilt Chamberlain  30.1
1    Michael Jordan  30.1
2      Elgin Baylor  27.4
3        Jerry West  27.0
4     Allen Iverson  26.7
5        Bob Pettit  26.4
6     George Gervin  26.2
7   Oscar Robertson  25.7
8       Kobe Bryant  25.0
9       Karl Malone  25.0

Average PTS by Position (SQL):
   POSITION   AVG_PTS
0      F-G  8.519753
1      C-F  7.537719
2      G-F  7.106633
3      F-C  6.872857
4        G  6.461704
5        F  5.994951
6        C  5.760947

Duke Players with > 15 Career PPG (SQL):
       PLAYER_NAME COLLEGE   PTS
0      Grant Hill    Duke  16.7
1   Carlos Boozer    Duke  16.2
2    Jeff Mullins    Duke  16.2
3  Corey Maggette    Duke  16.0
4     Elton Brand    Duke  15.9

Draft Year Summary (SQL, last 10 available years with career data):
    DRAFT_YEAR  Num_Players  Avg_Draft_Pick
0      2024.0            2       17.000000
1      2023.0            4       47.750000
2      2022.0            4       36.000000
3      2021.0           14       39.714286
4      2020.0           26       39.076923
5      2019.0           27       38.222222
6      2018.0           25       40.480000
7      2017.0           34       34.411765
8      2016.0           37       34.675676
9      2015.0           25       26.200000

5. Predictive Modeling: Career Points Per Game (PTS)ΒΆ

In this section, we'll attempt to build a model to predict a player's Career Points Per Game (PTS) based on information available around the time they entered the league (draft details, college, position, physical attributes).

Target Variable: PTS (for players with STATS_TIMEFRAME == 'Career')

Features: DRAFT_NUMBER, DRAFT_ROUND, HEIGHT_INCHES, WEIGHT_LBS, POSITION, COLLEGE.

InΒ [36]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer

5.1 Data Preparation for ModelingΒΆ

InΒ [37]:
# Use the 'career_df' created earlier, which is df[df['STATS_TIMEFRAME'] == 'Career']
model_df = career_df.copy()
InΒ [38]:
# Select features and target
features = ['DRAFT_NUMBER', 'DRAFT_ROUND', 'HEIGHT_INCHES', 'WEIGHT_LBS', 'POSITION', 'COLLEGE']
target = 'PTS'
InΒ [39]:
# Filter out rows where essential features or target are missing for modeling
# DRAFT_NUMBER is crucial. Let's focus on drafted players.
model_df = model_df[model_df['DRAFT_NUMBER'].notna()]
model_df = model_df[model_df[target].notna()] # PTS should already be handled, but good check
InΒ [40]:
# For simplicity, we'll drop rows if HEIGHT_INCHES or WEIGHT_LBS are missing in this subset
model_df.dropna(subset=['HEIGHT_INCHES', 'WEIGHT_LBS'], inplace=True)

# Handle DRAFT_ROUND: Ensure it's numeric; fill NaNs for undrafted (though we filtered by DRAFT_NUMBER)
model_df['DRAFT_ROUND'] = pd.to_numeric(model_df['DRAFT_ROUND'], errors='coerce').fillna(0) # 0 for undrafted/unknown

# Reset index for clean processing
model_df.reset_index(drop=True, inplace=True)

X = model_df[features]
y = model_df[target]

print(f"Shape of X: {X.shape}, Shape of y: {y.shape}")
X.head()
Shape of X: (3010, 6), Shape of y: (3010,)
Out[40]:
DRAFT_NUMBER DRAFT_ROUND HEIGHT_INCHES WEIGHT_LBS POSITION COLLEGE
0 25.0 1.0 82.0 240.0 F Duke
1 5.0 1.0 81.0 235.0 C Iowa State
2 1.0 1.0 86.0 225.0 C UCLA
3 3.0 1.0 73.0 162.0 G Louisiana State
4 11.0 1.0 78.0 235.0 F-G San Jose State
InΒ [41]:
X.isnull().sum() # Check for NaNs in features before preprocessing
Out[41]:
DRAFT_NUMBER     0
DRAFT_ROUND      0
HEIGHT_INCHES    0
WEIGHT_LBS       0
POSITION         0
COLLEGE          0
dtype: int64

5.2 Feature Preprocessing and Pipeline SetupΒΆ

InΒ [42]:
# Identify categorical and numerical features
categorical_features = ['POSITION', 'COLLEGE']
numerical_features = ['DRAFT_NUMBER', 'DRAFT_ROUND', 'HEIGHT_INCHES', 'WEIGHT_LBS']

# Preprocessing for numerical features: Impute NaNs (if any) and scale
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')), # Handles any stray NaNs
    ('scaler', StandardScaler())
])

# Preprocessing for categorical features: Impute NaNs and one-hot encode
# For 'COLLEGE', there are many unique values. We'll use handle_unknown='ignore'
# which means if a college seen in test wasn't in train, its OHE columns will be all zeros.
# A more robust approach for 'COLLEGE' might involve feature hashing or limiting to top N colleges.
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Create a column transformer to apply different transformations to different columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough' # Keep other columns (if any)
)

5.3 Train-Test SplitΒΆ

InΒ [43]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}, y_test shape: {y_test.shape}")
X_train shape: (2408, 6), X_test shape: (602, 6)
y_train shape: (2408,), y_test shape: (602,)

5.4 Model Training and EvaluationΒΆ

InΒ [44]:
# Define models to train
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(alpha=1.0),
    "Random Forest Regressor": RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1, max_depth=10, min_samples_split=10),
    "Gradient Boosting Regressor": GradientBoostingRegressor(n_estimators=100, random_state=42, max_depth=5)
}

results = {}

for name, model in models.items():
    print(f"Training {name}...")
    # Create the full pipeline: preprocessor + model
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('regressor', model)])
    
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    
    results[name] = {'RMSE': rmse, 'R2 Score': r2}
    print(f"{name} - RMSE: {rmse:.4f}, R2 Score: {r2:.4f}\n")


# Display results
results_df = pd.DataFrame(results).T.sort_values(by='R2 Score', ascending=False)
print("Model Performance Summary:")
print(results_df)
Training Linear Regression...
Linear Regression - RMSE: 4.4395, R2 Score: 0.0745

Training Ridge Regression...
Ridge Regression - RMSE: 4.3185, R2 Score: 0.1243

Training Random Forest Regressor...
Random Forest Regressor - RMSE: 3.9918, R2 Score: 0.2518

Training Gradient Boosting Regressor...
Gradient Boosting Regressor - RMSE: 4.0023, R2 Score: 0.2478

Model Performance Summary:
                                 RMSE  R2 Score
Random Forest Regressor      3.991784  0.251754
Gradient Boosting Regressor  4.002261  0.247821
Ridge Regression             4.318512  0.124254
Linear Regression            4.439528  0.074485

5.5 Feature Importances (for tree-based models)ΒΆ

Let's look at feature importances from the Random Forest Regressor, which often performs well.

InΒ [45]:
# Get the trained Random Forest pipeline
rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('regressor', RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1, max_depth=10, min_samples_split=10))])
rf_pipeline.fit(X_train, y_train) # Fit again to ensure we have the fitted preprocessor

# Get feature names after one-hot encoding
# The preprocessor must be fitted first to get feature names
# We access the 'onehot' step from the 'cat' transformer within the 'preprocessor'
try:
    ohe_feature_names = rf_pipeline.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features)
    all_feature_names = numerical_features + list(ohe_feature_names)
except Exception as e:
    print(f"Error getting OHE feature names: {e}")
    all_feature_names = None # Fallback

if all_feature_names and hasattr(rf_pipeline.named_steps['regressor'], 'feature_importances_'):
    importances = rf_pipeline.named_steps['regressor'].feature_importances_
    
    feature_importance_df = pd.DataFrame({'feature': all_feature_names, 'importance': importances})
    feature_importance_df = feature_importance_df.sort_values(by='importance', ascending=False)
    
    print("\nTop 20 Feature Importances (Random Forest):")
    print(feature_importance_df.head(20))
    
    plt.figure(figsize=(10, 8))
    sns.barplot(x='importance', y='feature', data=feature_importance_df.head(20), palette='crest', hue='feature', dodge=False, legend=False)
    plt.title('Top 20 Feature Importances for Predicting Career PTS')
    plt.tight_layout()
    plt.show()
else:
    print("Could not retrieve feature importances. Ensure the model is a tree-based model and preprocessor is fitted.")
Top 20 Feature Importances (Random Forest):
                      feature  importance
0                DRAFT_NUMBER    0.582012
2               HEIGHT_INCHES    0.039265
3                  WEIGHT_LBS    0.036991
1                 DRAFT_ROUND    0.017066
121  COLLEGE_Eastern Michigan    0.014176
9                  POSITION_G    0.012894
258    COLLEGE_North Carolina    0.009230
203    COLLEGE_Louisiana Tech    0.007947
176     COLLEGE_Indiana State    0.007173
182      COLLEGE_Jacksonville    0.007045
144      COLLEGE_Gardner-Webb    0.006637
274        COLLEGE_Notre Dame    0.005701
4                  POSITION_C    0.005602
156          COLLEGE_Guilford    0.005532
252  COLLEGE_New Mexico State    0.005111
65      COLLEGE_Buffalo State    0.004914
384         COLLEGE_Tennessee    0.004706
344    COLLEGE_South Carolina    0.004582
35             COLLEGE_Auburn    0.004376
223           COLLEGE_Memphis    0.004340
No description has been provided for this image

5.6 Model Interpretation and LimitationsΒΆ

Interpretation:

  • The R-squared value tells us the proportion of the variance in career PTS that can be explained by our features. Higher is better.
  • RMSE (Root Mean Squared Error) gives us an idea of the typical error in our PTS predictions, in the same units as PTS. Lower is better.
  • Feature importances highlight which factors (like draft number, height, specific colleges/positions) the model found most influential in predicting career points. DRAFT_NUMBER is often a very strong predictor.

Limitations:

  • Data Granularity: We are predicting career average PTS. This smooths out year-to-year variations and doesn't capture the peak performance or longevity aspects directly as a target.
  • College Feature: The COLLEGE feature has high cardinality. While OneHotEncoder with handle_unknown='ignore' is a start, more advanced techniques like target encoding (with careful cross-validation to prevent leakage), embedding layers, or grouping colleges by conference/tier could improve performance or interpretability. For this example, we used basic OHE.
  • "Talent Not Captured": Many intangible factors (work ethic, injury luck, coaching, team fit) that significantly impact a player's career are not present in this dataset.
  • Model Complexity: More complex models might yield slightly better R-squared values but could be harder to interpret and prone to overfitting if not carefully tuned.
  • Definition of "Performance": PTS is just one aspect of performance. A more holistic measure (like PER or Win Shares, if available) could be a more comprehensive target, but this dataset focuses on basic box score stats.

This predictive modeling exercise serves as a demonstration of applying machine learning techniques to sports data. The results should be viewed as exploratory rather than definitive predictions of player success.