NBA Player Statistics: Exploratory Data AnalysisΒΆ
This notebook explores a dataset of NBA player statistics to uncover insights about player performance, draft outcomes, physical attributes, and more.
1. Setup and Data LoadingΒΆ
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandasql import sqldf # For running SQL queries on pandas DataFrames
# Setup for pandasql
pysqldf = lambda q: sqldf(q, globals())
# Plotting preferences
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 7)
plt.rcParams['font.size'] = 12
# Load the dataset
try:
df = pd.read_csv('../data/PlayerIndex_nba_stats.csv')
except FileNotFoundError:
df = pd.read_csv('data/PlayerIndex_nba_stats.csv') # Fallback for running directly in notebooks folder
print("Dataset Loaded. Shape:", df.shape)
df.head()
Dataset Loaded. Shape: (5025, 26)
| PERSON_ID | PLAYER_LAST_NAME | PLAYER_FIRST_NAME | PLAYER_SLUG | TEAM_ID | TEAM_SLUG | IS_DEFUNCT | TEAM_CITY | TEAM_NAME | TEAM_ABBREVIATION | ... | DRAFT_YEAR | DRAFT_ROUND | DRAFT_NUMBER | ROSTER_STATUS | PTS | REB | AST | STATS_TIMEFRAME | FROM_YEAR | TO_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 76001 | Abdelnaby | Alaa | alaa-abdelnaby | 1610612757 | blazers | 0 | Portland | Trail Blazers | POR | ... | 1990.0 | 1.0 | 25.0 | NaN | 5.7 | 3.3 | 0.3 | Career | 1990 | 1994 |
| 1 | 76002 | Abdul-Aziz | Zaid | zaid-abdul-aziz | 1610612745 | rockets | 0 | Houston | Rockets | HOU | ... | 1968.0 | 1.0 | 5.0 | NaN | 9.0 | 8.0 | 1.2 | Career | 1968 | 1977 |
| 2 | 76003 | Abdul-Jabbar | Kareem | kareem-abdul-jabbar | 1610612747 | lakers | 0 | Los Angeles | Lakers | LAL | ... | 1969.0 | 1.0 | 1.0 | NaN | 24.6 | 11.2 | 3.6 | Career | 1969 | 1988 |
| 3 | 51 | Abdul-Rauf | Mahmoud | mahmoud-abdul-rauf | 1610612743 | nuggets | 0 | Denver | Nuggets | DEN | ... | 1990.0 | 1.0 | 3.0 | NaN | 14.6 | 1.9 | 3.5 | Career | 1990 | 2000 |
| 4 | 1505 | Abdul-Wahad | Tariq | tariq-abdul-wahad | 1610612758 | kings | 0 | Sacramento | Kings | SAC | ... | 1997.0 | 1.0 | 11.0 | NaN | 7.8 | 3.3 | 1.1 | Career | 1997 | 2003 |
5 rows Γ 26 columns
2. Data Cleaning and PreprocessingΒΆ
print("Initial data types:\n", df.dtypes)
print("\nMissing values before cleaning:\n", df.isnull().sum())
# Combine First and Last Name
df['PLAYER_NAME'] = df['PLAYER_FIRST_NAME'] + ' ' + df['PLAYER_LAST_NAME']
# Function to convert height (e.g., "6-10") to inches
def height_to_inches(height_str):
if pd.isna(height_str) or not isinstance(height_str, str) or '-' not in height_str:
return np.nan
try:
feet, inches = map(int, height_str.split('-'))
return feet * 12 + inches
except ValueError:
return np.nan
df['HEIGHT_INCHES'] = df['HEIGHT'].apply(height_to_inches)
# Convert WEIGHT to numeric, handling potential non-numeric entries
df['WEIGHT_LBS'] = pd.to_numeric(df['WEIGHT'], errors='coerce')
# Convert PTS, REB, AST to numeric, coercing errors
for col in ['PTS', 'REB', 'AST']:
df[col] = pd.to_numeric(df[col], errors='coerce')
# Handle missing numerical stats (PTS, REB, AST) - fill with 0 or mean/median
# For this EDA, filling with 0 might be acceptable if we assume missing means no recorded stat.
# Alternatively, drop rows or impute. Let's fill with 0 for simplicity in aggregation.
df[['PTS', 'REB', 'AST']] = df[['PTS', 'REB', 'AST']].fillna(0)
# Handle missing DRAFT_NUMBER - important for draft analysis
# We can fill NaN DRAFT_NUMBER with a high value (e.g., max_draft_pick + 1) or a specific category for "Undrafted"
# For numerical analysis, a high number might skew things. Let's create an 'IS_UNDRAFTED' column.
df['DRAFT_NUMBER'] = pd.to_numeric(df['DRAFT_NUMBER'], errors='coerce')
df['IS_UNDRAFTED'] = df['DRAFT_NUMBER'].isnull()
# Fill NaN DRAFT_NUMBER with a value that indicates undrafted if we want to keep it numeric for some plots
# Or, for draft-specific analysis, we might drop these NaNs.
# For now, let's keep NaNs for DRAFT_NUMBER and use IS_UNDRAFTED for categorical.
# Clean up categorical columns
for col in ['POSITION', 'COLLEGE', 'COUNTRY']:
df[col] = df[col].fillna('Unknown')
# Drop original HEIGHT and WEIGHT columns as we have numeric versions
df.drop(columns=['HEIGHT', 'WEIGHT'], inplace=True)
print("\nMissing values after cleaning:\n", df.isnull().sum())
df.describe(include=[np.number])
df.info()
Initial data types: PERSON_ID int64 PLAYER_LAST_NAME object PLAYER_FIRST_NAME object PLAYER_SLUG object TEAM_ID int64 TEAM_SLUG object IS_DEFUNCT int64 TEAM_CITY object TEAM_NAME object TEAM_ABBREVIATION object JERSEY_NUMBER object POSITION object HEIGHT object WEIGHT float64 COLLEGE object COUNTRY object DRAFT_YEAR float64 DRAFT_ROUND float64 DRAFT_NUMBER float64 ROSTER_STATUS float64 PTS float64 REB float64 AST float64 STATS_TIMEFRAME object FROM_YEAR int64 TO_YEAR int64 dtype: object Missing values before cleaning: PERSON_ID 0 PLAYER_LAST_NAME 0 PLAYER_FIRST_NAME 1 PLAYER_SLUG 0 TEAM_ID 0 TEAM_SLUG 266 IS_DEFUNCT 0 TEAM_CITY 0 TEAM_NAME 0 TEAM_ABBREVIATION 0 JERSEY_NUMBER 351 POSITION 48 HEIGHT 47 WEIGHT 53 COLLEGE 1 COUNTRY 0 DRAFT_YEAR 1325 DRAFT_ROUND 1523 DRAFT_NUMBER 1591 ROSTER_STATUS 4491 PTS 24 REB 316 AST 24 STATS_TIMEFRAME 0 FROM_YEAR 0 TO_YEAR 0 dtype: int64 Missing values after cleaning: PERSON_ID 0 PLAYER_LAST_NAME 0 PLAYER_FIRST_NAME 1 PLAYER_SLUG 0 TEAM_ID 0 TEAM_SLUG 266 IS_DEFUNCT 0 TEAM_CITY 0 TEAM_NAME 0 TEAM_ABBREVIATION 0 JERSEY_NUMBER 351 POSITION 0 COLLEGE 0 COUNTRY 0 DRAFT_YEAR 1325 DRAFT_ROUND 1523 DRAFT_NUMBER 1591 ROSTER_STATUS 4491 PTS 0 REB 0 AST 0 STATS_TIMEFRAME 0 FROM_YEAR 0 TO_YEAR 0 PLAYER_NAME 1 HEIGHT_INCHES 47 WEIGHT_LBS 53 IS_UNDRAFTED 0 dtype: int64 <class 'pandas.core.frame.DataFrame'> RangeIndex: 5025 entries, 0 to 5024 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PERSON_ID 5025 non-null int64 1 PLAYER_LAST_NAME 5025 non-null object 2 PLAYER_FIRST_NAME 5024 non-null object 3 PLAYER_SLUG 5025 non-null object 4 TEAM_ID 5025 non-null int64 5 TEAM_SLUG 4759 non-null object 6 IS_DEFUNCT 5025 non-null int64 7 TEAM_CITY 5025 non-null object 8 TEAM_NAME 5025 non-null object 9 TEAM_ABBREVIATION 5025 non-null object 10 JERSEY_NUMBER 4674 non-null object 11 POSITION 5025 non-null object 12 COLLEGE 5025 non-null object 13 COUNTRY 5025 non-null object 14 DRAFT_YEAR 3700 non-null float64 15 DRAFT_ROUND 3502 non-null float64 16 DRAFT_NUMBER 3434 non-null float64 17 ROSTER_STATUS 534 non-null float64 18 PTS 5025 non-null float64 19 REB 5025 non-null float64 20 AST 5025 non-null float64 21 STATS_TIMEFRAME 5025 non-null object 22 FROM_YEAR 5025 non-null int64 23 TO_YEAR 5025 non-null int64 24 PLAYER_NAME 5024 non-null object 25 HEIGHT_INCHES 4978 non-null float64 26 WEIGHT_LBS 4972 non-null float64 27 IS_UNDRAFTED 5025 non-null bool dtypes: bool(1), float64(9), int64(5), object(13) memory usage: 1.0+ MB
3. Exploratory Data Analysis (EDA)ΒΆ
3.1 Overall Player Statistics (Focus on Career Stats for broader view)ΒΆ
For a general overview, let's look at players with 'Career' stats. For active players, 'Season' reflects their latest season, which is also interesting.
career_df = df[df['STATS_TIMEFRAME'] == 'Career'].copy()
season_df = df[df['STATS_TIMEFRAME'] == 'Season'].copy() # Latest season for active players
# Distributions of key career stats
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
sns.histplot(career_df['PTS'], kde=True, ax=axes[0], bins=30)
axes[0].set_title('Distribution of Career Points')
sns.histplot(career_df['REB'], kde=True, ax=axes[1], bins=30)
axes[1].set_title('Distribution of Career Rebounds')
sns.histplot(career_df['AST'], kde=True, ax=axes[2], bins=30)
axes[2].set_title('Distribution of Career Assists')
plt.tight_layout()
plt.show()
# Relationship between PTS, REB, AST (Career)
sns.pairplot(career_df[['PTS', 'REB', 'AST']].dropna(), kind='scatter', plot_kws={'alpha':0.5})
plt.suptitle('Pairplot of Career PTS, REB, AST', y=1.02)
plt.show()
3.2 Positional Analysis
# Average stats by position (using all data, acknowledging Career/Season mix)
position_stats = df.groupby('POSITION')[['PTS', 'REB', 'AST', 'HEIGHT_INCHES', 'WEIGHT_LBS']].mean().sort_values(by='PTS', ascending=False)
print("Average Stats by Position:\n", position_stats)
Average Stats by Position:
PTS REB AST HEIGHT_INCHES WEIGHT_LBS
POSITION
F-G 8.519753 3.185185 1.725926 78.666667 217.370370
C-F 7.537719 5.187719 1.120175 82.359649 249.850877
G-F 7.106633 2.586224 1.501531 78.061224 211.622449
F-C 6.872857 4.467143 1.023571 82.035714 242.878571
G 6.461704 1.773255 2.089784 74.859343 189.695786
F 5.994951 3.077223 0.977827 79.246019 217.992303
C 5.760947 4.078698 0.843047 82.766272 242.146884
Unknown 3.391667 0.302083 0.750000 78.500000 215.000000
# Average stats by position (using all data, acknowledging Career/Season mix)
position_stats = df.groupby('POSITION')[['PTS', 'REB', 'AST', 'HEIGHT_INCHES', 'WEIGHT_LBS']].mean().sort_values(by='PTS', ascending=False)
print("Average Stats by Position:\n", position_stats)
# Boxplot of Points by Position
plt.figure(figsize=(10, 6))
sns.boxplot(x='POSITION', y='PTS', data=df, order=position_stats.index)
plt.title('Points Distribution by Position')
plt.show()
# Boxplot of Rebounds by Position
plt.figure(figsize=(10, 6))
sns.boxplot(x='POSITION', y='REB', data=df, order=df.groupby('POSITION')['REB'].median().sort_values(ascending=False).index)
plt.title('Rebounds Distribution by Position')
plt.show()
# Boxplot of Assists by Position
plt.figure(figsize=(10, 6))
sns.boxplot(x='POSITION', y='AST', data=df, order=df.groupby('POSITION')['AST'].median().sort_values(ascending=False).index)
plt.title('Assists Distribution by Position')
plt.show()
# Height distribution by position
plt.figure(figsize=(12, 7))
sns.violinplot(x='POSITION', y='HEIGHT_INCHES', data=df[df['HEIGHT_INCHES'].notna()], order=df.groupby('POSITION')['HEIGHT_INCHES'].median().sort_values(ascending=False).index)
plt.title('Height Distribution by Position')
plt.ylabel('Height (Inches)')
plt.show()
Average Stats by Position:
PTS REB AST HEIGHT_INCHES WEIGHT_LBS
POSITION
F-G 8.519753 3.185185 1.725926 78.666667 217.370370
C-F 7.537719 5.187719 1.120175 82.359649 249.850877
G-F 7.106633 2.586224 1.501531 78.061224 211.622449
F-C 6.872857 4.467143 1.023571 82.035714 242.878571
G 6.461704 1.773255 2.089784 74.859343 189.695786
F 5.994951 3.077223 0.977827 79.246019 217.992303
C 5.760947 4.078698 0.843047 82.766272 242.146884
Unknown 3.391667 0.302083 0.750000 78.500000 215.000000
3.3 Draft Analysis (Focus on 'Career' stats for players with completed/longer careers)ΒΆ
# Relationship between Draft Number and Career Points
# Filter out obvious outliers or undrafted players for a cleaner plot if DRAFT_NUMBER is high
drafted_career_df = career_df[career_df['DRAFT_NUMBER'].notna() & (career_df['DRAFT_NUMBER'] <= 60)] # Typical draft has 60 picks
plt.figure(figsize=(12, 7))
sns.scatterplot(x='DRAFT_NUMBER', y='PTS', data=drafted_career_df, alpha=0.5)
# Add a trend line (lowess for non-linear trends)
sns.regplot(x='DRAFT_NUMBER', y='PTS', data=drafted_career_df, scatter=False, lowess=True, color='red', line_kws={'linewidth': 2})
plt.title('Draft Pick Number vs. Average Career Points')
plt.xlabel('Draft Pick Number (Lower is better)')
plt.ylabel('Average Career Points Per Game')
plt.show()
print("Note: This plot focuses on players with 'Career' stats and a valid draft number (<=60).")
# Average Career Points by Draft Round
drafted_career_df['DRAFT_ROUND'] = pd.to_numeric(drafted_career_df['DRAFT_ROUND'], errors='coerce').fillna(0).astype(int)
avg_pts_by_round = drafted_career_df[drafted_career_df['DRAFT_ROUND'] > 0].groupby('DRAFT_ROUND')['PTS'].mean().sort_values(ascending=False)
plt.figure(figsize=(8, 5))
avg_pts_by_round.plot(kind='bar')
plt.title('Average Career Points by Draft Round')
plt.xlabel('Draft Round')
plt.ylabel('Average Career PTS')
plt.xticks(rotation=0)
plt.show()
Note: This plot focuses on players with 'Career' stats and a valid draft number (<=60).
C:\Users\Dean\AppData\Local\Temp\ipykernel_56840\2394375739.py:17: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy drafted_career_df['DRAFT_ROUND'] = pd.to_numeric(drafted_career_df['DRAFT_ROUND'], errors='coerce').fillna(0).astype(int)
3.4 Physical Attributes vs. PerformanceΒΆ
# Height vs. Rebounds
plt.figure(figsize=(10,6))
sns.scatterplot(x='HEIGHT_INCHES', y='REB', data=df[df['HEIGHT_INCHES'].notna()], hue='POSITION', alpha=0.6)
plt.title('Height vs. Rebounds')
plt.xlabel('Height (Inches)')
plt.ylabel('Rebounds Per Game')
plt.show()
# Weight vs. Points
plt.figure(figsize=(10,6))
sns.scatterplot(x='WEIGHT_LBS', y='PTS', data=df[df['WEIGHT_LBS'].notna()], hue='POSITION', alpha=0.6)
plt.title('Weight vs. Points')
plt.xlabel('Weight (LBS)')
plt.ylabel('Points Per Game')
plt.show()
3.5 Geographical Diversity & College ImpactΒΆ
# Top Countries (excluding USA for diversity view)
top_countries = df[df['COUNTRY'] != 'USA']['COUNTRY'].value_counts().nlargest(10)
plt.figure(figsize=(12, 7))
sns.barplot(x=top_countries.index, y=top_countries.values)
plt.title('Top 10 Non-USA Countries Producing NBA Players')
plt.xlabel('Country')
plt.ylabel('Number of Players')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
# Average career points for players from top non-USA countries
country_avg_pts = df[(df['COUNTRY'].isin(top_countries.index)) & (df['STATS_TIMEFRAME'] == 'Career')].groupby('COUNTRY')['PTS'].mean().sort_values(ascending=False)
plt.figure(figsize=(12, 7))
country_avg_pts.plot(kind='bar', color='skyblue')
plt.title('Average Career Points for Top 10 Non-USA Countries')
plt.xlabel('Country')
plt.ylabel('Average Career PTS')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
# Top Colleges producing NBA players
top_colleges = df[df['COLLEGE'] != 'Unknown']['COLLEGE'].value_counts().nlargest(15)
plt.figure(figsize=(14, 8))
sns.barplot(x=top_colleges.index, y=top_colleges.values)
plt.title('Top 15 Colleges Producing NBA Players')
plt.xlabel('College')
plt.ylabel('Number of Players')
plt.xticks(rotation=60, ha='right')
plt.tight_layout()
plt.show()
# Average career points for players from top colleges
college_avg_pts = df[(df['COLLEGE'].isin(top_colleges.index)) & (df['STATS_TIMEFRAME'] == 'Career')].groupby('COLLEGE')['PTS'].mean().sort_values(ascending=False)
plt.figure(figsize=(14, 8))
college_avg_pts.plot(kind='bar', color='coral')
plt.title('Average Career Points for Players from Top 15 Colleges')
plt.xlabel('College')
plt.ylabel('Average Career PTS')
plt.xticks(rotation=60, ha='right')
plt.tight_layout()
plt.show()
3.6 Current Season Standouts (Active Players - STATS_TIMEFRAME == 'Season')ΒΆ
# Top 10 active players by PTS in their latest season
top_season_pts = season_df.sort_values(by='PTS', ascending=False).head(10)
plt.figure(figsize=(12, 7))
sns.barplot(x='PTS', y='PLAYER_NAME', data=top_season_pts, palette='viridis', hue='PLAYER_NAME', dodge=False, legend=False)
plt.title('Top 10 Active Players by Latest Season Points')
plt.xlabel('Points Per Game (Latest Season)')
plt.ylabel('Player')
plt.tight_layout()
plt.show()
# Top 10 active players by REB in their latest season
top_season_reb = season_df.sort_values(by='REB', ascending=False).head(10)
plt.figure(figsize=(12, 7))
sns.barplot(x='REB', y='PLAYER_NAME', data=top_season_reb, palette='magma', hue='PLAYER_NAME', dodge=False, legend=False)
plt.title('Top 10 Active Players by Latest Season Rebounds')
plt.xlabel('Rebounds Per Game (Latest Season)')
plt.ylabel('Player')
plt.tight_layout()
plt.show()
# Top 10 active players by AST in their latest season
top_season_ast = season_df.sort_values(by='AST', ascending=False).head(10)
plt.figure(figsize=(12, 7))
sns.barplot(x='AST', y='PLAYER_NAME', data=top_season_ast, palette='coolwarm', hue='PLAYER_NAME', dodge=False, legend=False)
plt.title('Top 10 Active Players by Latest Season Assists')
plt.xlabel('Assists Per Game (Latest Season)')
plt.ylabel('Player')
plt.tight_layout()
plt.show()
4. SQL Query Demonstrations (using pandasql)ΒΆ
# Query 1: Top 10 players by career points
query1 = """
SELECT PLAYER_NAME, PTS
FROM career_df
ORDER BY PTS DESC
LIMIT 10;
"""
top_career_scorers_sql = pysqldf(query1)
print("Top 10 Career Scorers (SQL):\n", top_career_scorers_sql)
# Query 2: Average points per game by position (all players)
query2 = """
SELECT POSITION, AVG(PTS) AS AVG_PTS
FROM df
WHERE POSITION != 'Unknown'
GROUP BY POSITION
ORDER BY AVG_PTS DESC;
"""
avg_pts_by_position_sql = pysqldf(query2)
print("\nAverage PTS by Position (SQL):\n", avg_pts_by_position_sql)
# Query 3: Players from Duke with > 15 career PPG
query3 = """
SELECT PLAYER_NAME, COLLEGE, PTS
FROM career_df
WHERE COLLEGE = 'Duke' AND PTS > 15
ORDER BY PTS DESC;
"""
duke_high_scorers_sql = pysqldf(query3)
print("\nDuke Players with > 15 Career PPG (SQL):\n", duke_high_scorers_sql)
# Query 4: Number of players and average DRAFT_NUMBER by DRAFT_YEAR (for players with 'Career' stats)
query4 = """
SELECT DRAFT_YEAR, COUNT(PERSON_ID) AS Num_Players, AVG(DRAFT_NUMBER) AS Avg_Draft_Pick
FROM career_df
WHERE DRAFT_YEAR IS NOT NULL AND DRAFT_NUMBER IS NOT NULL
GROUP BY DRAFT_YEAR
ORDER BY DRAFT_YEAR DESC
LIMIT 10;
"""
draft_year_summary_sql = pysqldf(query4)
print("\nDraft Year Summary (SQL, last 10 available years with career data):\n", draft_year_summary_sql)
Top 10 Career Scorers (SQL):
PLAYER_NAME PTS
0 Wilt Chamberlain 30.1
1 Michael Jordan 30.1
2 Elgin Baylor 27.4
3 Jerry West 27.0
4 Allen Iverson 26.7
5 Bob Pettit 26.4
6 George Gervin 26.2
7 Oscar Robertson 25.7
8 Kobe Bryant 25.0
9 Karl Malone 25.0
Average PTS by Position (SQL):
POSITION AVG_PTS
0 F-G 8.519753
1 C-F 7.537719
2 G-F 7.106633
3 F-C 6.872857
4 G 6.461704
5 F 5.994951
6 C 5.760947
Duke Players with > 15 Career PPG (SQL):
PLAYER_NAME COLLEGE PTS
0 Grant Hill Duke 16.7
1 Carlos Boozer Duke 16.2
2 Jeff Mullins Duke 16.2
3 Corey Maggette Duke 16.0
4 Elton Brand Duke 15.9
Draft Year Summary (SQL, last 10 available years with career data):
DRAFT_YEAR Num_Players Avg_Draft_Pick
0 2024.0 2 17.000000
1 2023.0 4 47.750000
2 2022.0 4 36.000000
3 2021.0 14 39.714286
4 2020.0 26 39.076923
5 2019.0 27 38.222222
6 2018.0 25 40.480000
7 2017.0 34 34.411765
8 2016.0 37 34.675676
9 2015.0 25 26.200000
5. Predictive Modeling: Career Points Per Game (PTS)ΒΆ
In this section, we'll attempt to build a model to predict a player's Career Points Per Game (PTS) based on information available around the time they entered the league (draft details, college, position, physical attributes).
Target Variable: PTS (for players with STATS_TIMEFRAME == 'Career')
Features: DRAFT_NUMBER, DRAFT_ROUND, HEIGHT_INCHES, WEIGHT_LBS, POSITION, COLLEGE.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
5.1 Data Preparation for ModelingΒΆ
# Use the 'career_df' created earlier, which is df[df['STATS_TIMEFRAME'] == 'Career']
model_df = career_df.copy()
# Select features and target
features = ['DRAFT_NUMBER', 'DRAFT_ROUND', 'HEIGHT_INCHES', 'WEIGHT_LBS', 'POSITION', 'COLLEGE']
target = 'PTS'
# Filter out rows where essential features or target are missing for modeling
# DRAFT_NUMBER is crucial. Let's focus on drafted players.
model_df = model_df[model_df['DRAFT_NUMBER'].notna()]
model_df = model_df[model_df[target].notna()] # PTS should already be handled, but good check
# For simplicity, we'll drop rows if HEIGHT_INCHES or WEIGHT_LBS are missing in this subset
model_df.dropna(subset=['HEIGHT_INCHES', 'WEIGHT_LBS'], inplace=True)
# Handle DRAFT_ROUND: Ensure it's numeric; fill NaNs for undrafted (though we filtered by DRAFT_NUMBER)
model_df['DRAFT_ROUND'] = pd.to_numeric(model_df['DRAFT_ROUND'], errors='coerce').fillna(0) # 0 for undrafted/unknown
# Reset index for clean processing
model_df.reset_index(drop=True, inplace=True)
X = model_df[features]
y = model_df[target]
print(f"Shape of X: {X.shape}, Shape of y: {y.shape}")
X.head()
Shape of X: (3010, 6), Shape of y: (3010,)
| DRAFT_NUMBER | DRAFT_ROUND | HEIGHT_INCHES | WEIGHT_LBS | POSITION | COLLEGE | |
|---|---|---|---|---|---|---|
| 0 | 25.0 | 1.0 | 82.0 | 240.0 | F | Duke |
| 1 | 5.0 | 1.0 | 81.0 | 235.0 | C | Iowa State |
| 2 | 1.0 | 1.0 | 86.0 | 225.0 | C | UCLA |
| 3 | 3.0 | 1.0 | 73.0 | 162.0 | G | Louisiana State |
| 4 | 11.0 | 1.0 | 78.0 | 235.0 | F-G | San Jose State |
X.isnull().sum() # Check for NaNs in features before preprocessing
DRAFT_NUMBER 0 DRAFT_ROUND 0 HEIGHT_INCHES 0 WEIGHT_LBS 0 POSITION 0 COLLEGE 0 dtype: int64
5.2 Feature Preprocessing and Pipeline SetupΒΆ
# Identify categorical and numerical features
categorical_features = ['POSITION', 'COLLEGE']
numerical_features = ['DRAFT_NUMBER', 'DRAFT_ROUND', 'HEIGHT_INCHES', 'WEIGHT_LBS']
# Preprocessing for numerical features: Impute NaNs (if any) and scale
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')), # Handles any stray NaNs
('scaler', StandardScaler())
])
# Preprocessing for categorical features: Impute NaNs and one-hot encode
# For 'COLLEGE', there are many unique values. We'll use handle_unknown='ignore'
# which means if a college seen in test wasn't in train, its OHE columns will be all zeros.
# A more robust approach for 'COLLEGE' might involve feature hashing or limiting to top N colleges.
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
# Create a column transformer to apply different transformations to different columns
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
],
remainder='passthrough' # Keep other columns (if any)
)
5.3 Train-Test SplitΒΆ
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}, y_test shape: {y_test.shape}")
X_train shape: (2408, 6), X_test shape: (602, 6) y_train shape: (2408,), y_test shape: (602,)
5.4 Model Training and EvaluationΒΆ
# Define models to train
models = {
"Linear Regression": LinearRegression(),
"Ridge Regression": Ridge(alpha=1.0),
"Random Forest Regressor": RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1, max_depth=10, min_samples_split=10),
"Gradient Boosting Regressor": GradientBoostingRegressor(n_estimators=100, random_state=42, max_depth=5)
}
results = {}
for name, model in models.items():
print(f"Training {name}...")
# Create the full pipeline: preprocessor + model
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', model)])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
results[name] = {'RMSE': rmse, 'R2 Score': r2}
print(f"{name} - RMSE: {rmse:.4f}, R2 Score: {r2:.4f}\n")
# Display results
results_df = pd.DataFrame(results).T.sort_values(by='R2 Score', ascending=False)
print("Model Performance Summary:")
print(results_df)
Training Linear Regression...
Linear Regression - RMSE: 4.4395, R2 Score: 0.0745
Training Ridge Regression...
Ridge Regression - RMSE: 4.3185, R2 Score: 0.1243
Training Random Forest Regressor...
Random Forest Regressor - RMSE: 3.9918, R2 Score: 0.2518
Training Gradient Boosting Regressor...
Gradient Boosting Regressor - RMSE: 4.0023, R2 Score: 0.2478
Model Performance Summary:
RMSE R2 Score
Random Forest Regressor 3.991784 0.251754
Gradient Boosting Regressor 4.002261 0.247821
Ridge Regression 4.318512 0.124254
Linear Regression 4.439528 0.074485
5.5 Feature Importances (for tree-based models)ΒΆ
Let's look at feature importances from the Random Forest Regressor, which often performs well.
# Get the trained Random Forest pipeline
rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1, max_depth=10, min_samples_split=10))])
rf_pipeline.fit(X_train, y_train) # Fit again to ensure we have the fitted preprocessor
# Get feature names after one-hot encoding
# The preprocessor must be fitted first to get feature names
# We access the 'onehot' step from the 'cat' transformer within the 'preprocessor'
try:
ohe_feature_names = rf_pipeline.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features)
all_feature_names = numerical_features + list(ohe_feature_names)
except Exception as e:
print(f"Error getting OHE feature names: {e}")
all_feature_names = None # Fallback
if all_feature_names and hasattr(rf_pipeline.named_steps['regressor'], 'feature_importances_'):
importances = rf_pipeline.named_steps['regressor'].feature_importances_
feature_importance_df = pd.DataFrame({'feature': all_feature_names, 'importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='importance', ascending=False)
print("\nTop 20 Feature Importances (Random Forest):")
print(feature_importance_df.head(20))
plt.figure(figsize=(10, 8))
sns.barplot(x='importance', y='feature', data=feature_importance_df.head(20), palette='crest', hue='feature', dodge=False, legend=False)
plt.title('Top 20 Feature Importances for Predicting Career PTS')
plt.tight_layout()
plt.show()
else:
print("Could not retrieve feature importances. Ensure the model is a tree-based model and preprocessor is fitted.")
Top 20 Feature Importances (Random Forest):
feature importance
0 DRAFT_NUMBER 0.582012
2 HEIGHT_INCHES 0.039265
3 WEIGHT_LBS 0.036991
1 DRAFT_ROUND 0.017066
121 COLLEGE_Eastern Michigan 0.014176
9 POSITION_G 0.012894
258 COLLEGE_North Carolina 0.009230
203 COLLEGE_Louisiana Tech 0.007947
176 COLLEGE_Indiana State 0.007173
182 COLLEGE_Jacksonville 0.007045
144 COLLEGE_Gardner-Webb 0.006637
274 COLLEGE_Notre Dame 0.005701
4 POSITION_C 0.005602
156 COLLEGE_Guilford 0.005532
252 COLLEGE_New Mexico State 0.005111
65 COLLEGE_Buffalo State 0.004914
384 COLLEGE_Tennessee 0.004706
344 COLLEGE_South Carolina 0.004582
35 COLLEGE_Auburn 0.004376
223 COLLEGE_Memphis 0.004340
5.6 Model Interpretation and LimitationsΒΆ
Interpretation:
- The R-squared value tells us the proportion of the variance in career PTS that can be explained by our features. Higher is better.
- RMSE (Root Mean Squared Error) gives us an idea of the typical error in our PTS predictions, in the same units as PTS. Lower is better.
- Feature importances highlight which factors (like draft number, height, specific colleges/positions) the model found most influential in predicting career points.
DRAFT_NUMBERis often a very strong predictor.
Limitations:
- Data Granularity: We are predicting career average PTS. This smooths out year-to-year variations and doesn't capture the peak performance or longevity aspects directly as a target.
- College Feature: The
COLLEGEfeature has high cardinality. While OneHotEncoder withhandle_unknown='ignore'is a start, more advanced techniques like target encoding (with careful cross-validation to prevent leakage), embedding layers, or grouping colleges by conference/tier could improve performance or interpretability. For this example, we used basic OHE. - "Talent Not Captured": Many intangible factors (work ethic, injury luck, coaching, team fit) that significantly impact a player's career are not present in this dataset.
- Model Complexity: More complex models might yield slightly better R-squared values but could be harder to interpret and prone to overfitting if not carefully tuned.
- Definition of "Performance": PTS is just one aspect of performance. A more holistic measure (like PER or Win Shares, if available) could be a more comprehensive target, but this dataset focuses on basic box score stats.
This predictive modeling exercise serves as a demonstration of applying machine learning techniques to sports data. The results should be viewed as exploratory rather than definitive predictions of player success.