import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandasql import sqldf  # For running SQL queries on pandas DataFrames

# Setup for pandasql
pysqldf = lambda q: sqldf(q, globals())

# Plotting preferences
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 7)
plt.rcParams['font.size'] = 12

# Load the dataset
try:
    df = pd.read_csv('../data/PlayerIndex_nba_stats.csv')
except FileNotFoundError:
    df = pd.read_csv('data/PlayerIndex_nba_stats.csv') # Fallback for running directly in notebooks folder

print("Dataset Loaded. Shape:", df.shape)
df.head()

Dataset Loaded. Shape: (5025, 26)

print("Initial data types:\n", df.dtypes)
print("\nMissing values before cleaning:\n", df.isnull().sum())


# Combine First and Last Name
df['PLAYER_NAME'] = df['PLAYER_FIRST_NAME'] + ' ' + df['PLAYER_LAST_NAME']

# Function to convert height (e.g., "6-10") to inches
def height_to_inches(height_str):
    if pd.isna(height_str) or not isinstance(height_str, str) or '-' not in height_str:
        return np.nan
    try:
        feet, inches = map(int, height_str.split('-'))
        return feet * 12 + inches
    except ValueError:
        return np.nan

df['HEIGHT_INCHES'] = df['HEIGHT'].apply(height_to_inches)

# Convert WEIGHT to numeric, handling potential non-numeric entries
df['WEIGHT_LBS'] = pd.to_numeric(df['WEIGHT'], errors='coerce')

# Convert PTS, REB, AST to numeric, coercing errors
for col in ['PTS', 'REB', 'AST']:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Handle missing numerical stats (PTS, REB, AST) - fill with 0 or mean/median
# For this EDA, filling with 0 might be acceptable if we assume missing means no recorded stat.
# Alternatively, drop rows or impute. Let's fill with 0 for simplicity in aggregation.
df[['PTS', 'REB', 'AST']] = df[['PTS', 'REB', 'AST']].fillna(0)

# Handle missing DRAFT_NUMBER - important for draft analysis
# We can fill NaN DRAFT_NUMBER with a high value (e.g., max_draft_pick + 1) or a specific category for "Undrafted"
# For numerical analysis, a high number might skew things. Let's create an 'IS_UNDRAFTED' column.
df['DRAFT_NUMBER'] = pd.to_numeric(df['DRAFT_NUMBER'], errors='coerce')
df['IS_UNDRAFTED'] = df['DRAFT_NUMBER'].isnull()
# Fill NaN DRAFT_NUMBER with a value that indicates undrafted if we want to keep it numeric for some plots
# Or, for draft-specific analysis, we might drop these NaNs.
# For now, let's keep NaNs for DRAFT_NUMBER and use IS_UNDRAFTED for categorical.

# Clean up categorical columns
for col in ['POSITION', 'COLLEGE', 'COUNTRY']:
    df[col] = df[col].fillna('Unknown')

# Drop original HEIGHT and WEIGHT columns as we have numeric versions
df.drop(columns=['HEIGHT', 'WEIGHT'], inplace=True)

print("\nMissing values after cleaning:\n", df.isnull().sum())
df.describe(include=[np.number])


df.info()

Initial data types:
 PERSON_ID              int64
PLAYER_LAST_NAME      object
PLAYER_FIRST_NAME     object
PLAYER_SLUG           object
TEAM_ID                int64
TEAM_SLUG             object
IS_DEFUNCT             int64
TEAM_CITY             object
TEAM_NAME             object
TEAM_ABBREVIATION     object
JERSEY_NUMBER         object
POSITION              object
HEIGHT                object
WEIGHT               float64
COLLEGE               object
COUNTRY               object
DRAFT_YEAR           float64
DRAFT_ROUND          float64
DRAFT_NUMBER         float64
ROSTER_STATUS        float64
PTS                  float64
REB                  float64
AST                  float64
STATS_TIMEFRAME       object
FROM_YEAR              int64
TO_YEAR                int64
dtype: object

Missing values before cleaning:
 PERSON_ID               0
PLAYER_LAST_NAME        0
PLAYER_FIRST_NAME       1
PLAYER_SLUG             0
TEAM_ID                 0
TEAM_SLUG             266
IS_DEFUNCT              0
TEAM_CITY               0
TEAM_NAME               0
TEAM_ABBREVIATION       0
JERSEY_NUMBER         351
POSITION               48
HEIGHT                 47
WEIGHT                 53
COLLEGE                 1
COUNTRY                 0
DRAFT_YEAR           1325
DRAFT_ROUND          1523
DRAFT_NUMBER         1591
ROSTER_STATUS        4491
PTS                    24
REB                   316
AST                    24
STATS_TIMEFRAME         0
FROM_YEAR               0
TO_YEAR                 0
dtype: int64

Missing values after cleaning:
 PERSON_ID               0
PLAYER_LAST_NAME        0
PLAYER_FIRST_NAME       1
PLAYER_SLUG             0
TEAM_ID                 0
TEAM_SLUG             266
IS_DEFUNCT              0
TEAM_CITY               0
TEAM_NAME               0
TEAM_ABBREVIATION       0
JERSEY_NUMBER         351
POSITION                0
COLLEGE                 0
COUNTRY                 0
DRAFT_YEAR           1325
DRAFT_ROUND          1523
DRAFT_NUMBER         1591
ROSTER_STATUS        4491
PTS                     0
REB                     0
AST                     0
STATS_TIMEFRAME         0
FROM_YEAR               0
TO_YEAR                 0
PLAYER_NAME             1
HEIGHT_INCHES          47
WEIGHT_LBS             53
IS_UNDRAFTED            0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5025 entries, 0 to 5024
Data columns (total 28 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   PERSON_ID          5025 non-null   int64  
 1   PLAYER_LAST_NAME   5025 non-null   object 
 2   PLAYER_FIRST_NAME  5024 non-null   object 
 3   PLAYER_SLUG        5025 non-null   object 
 4   TEAM_ID            5025 non-null   int64  
 5   TEAM_SLUG          4759 non-null   object 
 6   IS_DEFUNCT         5025 non-null   int64  
 7   TEAM_CITY          5025 non-null   object 
 8   TEAM_NAME          5025 non-null   object 
 9   TEAM_ABBREVIATION  5025 non-null   object 
 10  JERSEY_NUMBER      4674 non-null   object 
 11  POSITION           5025 non-null   object 
 12  COLLEGE            5025 non-null   object 
 13  COUNTRY            5025 non-null   object 
 14  DRAFT_YEAR         3700 non-null   float64
 15  DRAFT_ROUND        3502 non-null   float64
 16  DRAFT_NUMBER       3434 non-null   float64
 17  ROSTER_STATUS      534 non-null    float64
 18  PTS                5025 non-null   float64
 19  REB                5025 non-null   float64
 20  AST                5025 non-null   float64
 21  STATS_TIMEFRAME    5025 non-null   object 
 22  FROM_YEAR          5025 non-null   int64  
 23  TO_YEAR            5025 non-null   int64  
 24  PLAYER_NAME        5024 non-null   object 
 25  HEIGHT_INCHES      4978 non-null   float64
 26  WEIGHT_LBS         4972 non-null   float64
 27  IS_UNDRAFTED       5025 non-null   bool   
dtypes: bool(1), float64(9), int64(5), object(13)
memory usage: 1.0+ MB

career_df = df[df['STATS_TIMEFRAME'] == 'Career'].copy()
season_df = df[df['STATS_TIMEFRAME'] == 'Season'].copy() # Latest season for active players


# Distributions of key career stats
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
sns.histplot(career_df['PTS'], kde=True, ax=axes[0], bins=30)
axes[0].set_title('Distribution of Career Points')
sns.histplot(career_df['REB'], kde=True, ax=axes[1], bins=30)
axes[1].set_title('Distribution of Career Rebounds')
sns.histplot(career_df['AST'], kde=True, ax=axes[2], bins=30)
axes[2].set_title('Distribution of Career Assists')
plt.tight_layout()
plt.show()


# Relationship between PTS, REB, AST (Career)
sns.pairplot(career_df[['PTS', 'REB', 'AST']].dropna(), kind='scatter', plot_kws={'alpha':0.5})
plt.suptitle('Pairplot of Career PTS, REB, AST', y=1.02)
plt.show()

# Average stats by position (using all data, acknowledging Career/Season mix)
position_stats = df.groupby('POSITION')[['PTS', 'REB', 'AST', 'HEIGHT_INCHES', 'WEIGHT_LBS']].mean().sort_values(by='PTS', ascending=False)
print("Average Stats by Position:\n", position_stats)

Average Stats by Position:
                PTS       REB       AST  HEIGHT_INCHES  WEIGHT_LBS
POSITION                                                         
F-G       8.519753  3.185185  1.725926      78.666667  217.370370
C-F       7.537719  5.187719  1.120175      82.359649  249.850877
G-F       7.106633  2.586224  1.501531      78.061224  211.622449
F-C       6.872857  4.467143  1.023571      82.035714  242.878571
G         6.461704  1.773255  2.089784      74.859343  189.695786
F         5.994951  3.077223  0.977827      79.246019  217.992303
C         5.760947  4.078698  0.843047      82.766272  242.146884
Unknown   3.391667  0.302083  0.750000      78.500000  215.000000

# Average stats by position (using all data, acknowledging Career/Season mix)
position_stats = df.groupby('POSITION')[['PTS', 'REB', 'AST', 'HEIGHT_INCHES', 'WEIGHT_LBS']].mean().sort_values(by='PTS', ascending=False)
print("Average Stats by Position:\n", position_stats)

# Boxplot of Points by Position
plt.figure(figsize=(10, 6))
sns.boxplot(x='POSITION', y='PTS', data=df, order=position_stats.index)
plt.title('Points Distribution by Position')
plt.show()


# Boxplot of Rebounds by Position
plt.figure(figsize=(10, 6))
sns.boxplot(x='POSITION', y='REB', data=df, order=df.groupby('POSITION')['REB'].median().sort_values(ascending=False).index)
plt.title('Rebounds Distribution by Position')
plt.show()


# Boxplot of Assists by Position
plt.figure(figsize=(10, 6))
sns.boxplot(x='POSITION', y='AST', data=df, order=df.groupby('POSITION')['AST'].median().sort_values(ascending=False).index)
plt.title('Assists Distribution by Position')
plt.show()


# Height distribution by position
plt.figure(figsize=(12, 7))
sns.violinplot(x='POSITION', y='HEIGHT_INCHES', data=df[df['HEIGHT_INCHES'].notna()], order=df.groupby('POSITION')['HEIGHT_INCHES'].median().sort_values(ascending=False).index)
plt.title('Height Distribution by Position')
plt.ylabel('Height (Inches)')
plt.show()

Average Stats by Position:
                PTS       REB       AST  HEIGHT_INCHES  WEIGHT_LBS
POSITION                                                         
F-G       8.519753  3.185185  1.725926      78.666667  217.370370
C-F       7.537719  5.187719  1.120175      82.359649  249.850877
G-F       7.106633  2.586224  1.501531      78.061224  211.622449
F-C       6.872857  4.467143  1.023571      82.035714  242.878571
G         6.461704  1.773255  2.089784      74.859343  189.695786
F         5.994951  3.077223  0.977827      79.246019  217.992303
C         5.760947  4.078698  0.843047      82.766272  242.146884
Unknown   3.391667  0.302083  0.750000      78.500000  215.000000

# Relationship between Draft Number and Career Points
# Filter out obvious outliers or undrafted players for a cleaner plot if DRAFT_NUMBER is high
drafted_career_df = career_df[career_df['DRAFT_NUMBER'].notna() & (career_df['DRAFT_NUMBER'] <= 60)] # Typical draft has 60 picks

plt.figure(figsize=(12, 7))
sns.scatterplot(x='DRAFT_NUMBER', y='PTS', data=drafted_career_df, alpha=0.5)
# Add a trend line (lowess for non-linear trends)
sns.regplot(x='DRAFT_NUMBER', y='PTS', data=drafted_career_df, scatter=False, lowess=True, color='red', line_kws={'linewidth': 2})
plt.title('Draft Pick Number vs. Average Career Points')
plt.xlabel('Draft Pick Number (Lower is better)')
plt.ylabel('Average Career Points Per Game')
plt.show()
print("Note: This plot focuses on players with 'Career' stats and a valid draft number (<=60).")


# Average Career Points by Draft Round
drafted_career_df['DRAFT_ROUND'] = pd.to_numeric(drafted_career_df['DRAFT_ROUND'], errors='coerce').fillna(0).astype(int)
avg_pts_by_round = drafted_career_df[drafted_career_df['DRAFT_ROUND'] > 0].groupby('DRAFT_ROUND')['PTS'].mean().sort_values(ascending=False)

plt.figure(figsize=(8, 5))
avg_pts_by_round.plot(kind='bar')
plt.title('Average Career Points by Draft Round')
plt.xlabel('Draft Round')
plt.ylabel('Average Career PTS')
plt.xticks(rotation=0)
plt.show()

Note: This plot focuses on players with 'Career' stats and a valid draft number (<=60).

C:\Users\Dean\AppData\Local\Temp\ipykernel_56840\2394375739.py:17: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  drafted_career_df['DRAFT_ROUND'] = pd.to_numeric(drafted_career_df['DRAFT_ROUND'], errors='coerce').fillna(0).astype(int)

# Height vs. Rebounds
plt.figure(figsize=(10,6))
sns.scatterplot(x='HEIGHT_INCHES', y='REB', data=df[df['HEIGHT_INCHES'].notna()], hue='POSITION', alpha=0.6)
plt.title('Height vs. Rebounds')
plt.xlabel('Height (Inches)')
plt.ylabel('Rebounds Per Game')
plt.show()


# Weight vs. Points
plt.figure(figsize=(10,6))
sns.scatterplot(x='WEIGHT_LBS', y='PTS', data=df[df['WEIGHT_LBS'].notna()], hue='POSITION', alpha=0.6)
plt.title('Weight vs. Points')
plt.xlabel('Weight (LBS)')
plt.ylabel('Points Per Game')
plt.show()

# Top Countries (excluding USA for diversity view)
top_countries = df[df['COUNTRY'] != 'USA']['COUNTRY'].value_counts().nlargest(10)
plt.figure(figsize=(12, 7))
sns.barplot(x=top_countries.index, y=top_countries.values)
plt.title('Top 10 Non-USA Countries Producing NBA Players')
plt.xlabel('Country')
plt.ylabel('Number of Players')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


# Average career points for players from top non-USA countries
country_avg_pts = df[(df['COUNTRY'].isin(top_countries.index)) & (df['STATS_TIMEFRAME'] == 'Career')].groupby('COUNTRY')['PTS'].mean().sort_values(ascending=False)
plt.figure(figsize=(12, 7))
country_avg_pts.plot(kind='bar', color='skyblue')
plt.title('Average Career Points for Top 10 Non-USA Countries')
plt.xlabel('Country')
plt.ylabel('Average Career PTS')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


# Top Colleges producing NBA players
top_colleges = df[df['COLLEGE'] != 'Unknown']['COLLEGE'].value_counts().nlargest(15)
plt.figure(figsize=(14, 8))
sns.barplot(x=top_colleges.index, y=top_colleges.values)
plt.title('Top 15 Colleges Producing NBA Players')
plt.xlabel('College')
plt.ylabel('Number of Players')
plt.xticks(rotation=60, ha='right')
plt.tight_layout()
plt.show()


# Average career points for players from top colleges
college_avg_pts = df[(df['COLLEGE'].isin(top_colleges.index)) & (df['STATS_TIMEFRAME'] == 'Career')].groupby('COLLEGE')['PTS'].mean().sort_values(ascending=False)
plt.figure(figsize=(14, 8))
college_avg_pts.plot(kind='bar', color='coral')
plt.title('Average Career Points for Players from Top 15 Colleges')
plt.xlabel('College')
plt.ylabel('Average Career PTS')
plt.xticks(rotation=60, ha='right')
plt.tight_layout()
plt.show()

# Top 10 active players by PTS in their latest season
top_season_pts = season_df.sort_values(by='PTS', ascending=False).head(10)
plt.figure(figsize=(12, 7))
sns.barplot(x='PTS', y='PLAYER_NAME', data=top_season_pts, palette='viridis', hue='PLAYER_NAME', dodge=False, legend=False)
plt.title('Top 10 Active Players by Latest Season Points')
plt.xlabel('Points Per Game (Latest Season)')
plt.ylabel('Player')
plt.tight_layout()
plt.show()


# Top 10 active players by REB in their latest season
top_season_reb = season_df.sort_values(by='REB', ascending=False).head(10)
plt.figure(figsize=(12, 7))
sns.barplot(x='REB', y='PLAYER_NAME', data=top_season_reb, palette='magma', hue='PLAYER_NAME', dodge=False, legend=False)
plt.title('Top 10 Active Players by Latest Season Rebounds')
plt.xlabel('Rebounds Per Game (Latest Season)')
plt.ylabel('Player')
plt.tight_layout()
plt.show()


# Top 10 active players by AST in their latest season
top_season_ast = season_df.sort_values(by='AST', ascending=False).head(10)
plt.figure(figsize=(12, 7))
sns.barplot(x='AST', y='PLAYER_NAME', data=top_season_ast, palette='coolwarm', hue='PLAYER_NAME', dodge=False, legend=False)
plt.title('Top 10 Active Players by Latest Season Assists')
plt.xlabel('Assists Per Game (Latest Season)')
plt.ylabel('Player')
plt.tight_layout()
plt.show()

# Query 1: Top 10 players by career points
query1 = """
SELECT PLAYER_NAME, PTS
FROM career_df
ORDER BY PTS DESC
LIMIT 10;
"""
top_career_scorers_sql = pysqldf(query1)
print("Top 10 Career Scorers (SQL):\n", top_career_scorers_sql)


# Query 2: Average points per game by position (all players)
query2 = """
SELECT POSITION, AVG(PTS) AS AVG_PTS
FROM df
WHERE POSITION != 'Unknown'
GROUP BY POSITION
ORDER BY AVG_PTS DESC;
"""
avg_pts_by_position_sql = pysqldf(query2)
print("\nAverage PTS by Position (SQL):\n", avg_pts_by_position_sql)


# Query 3: Players from Duke with > 15 career PPG
query3 = """
SELECT PLAYER_NAME, COLLEGE, PTS
FROM career_df
WHERE COLLEGE = 'Duke' AND PTS > 15
ORDER BY PTS DESC;
"""
duke_high_scorers_sql = pysqldf(query3)
print("\nDuke Players with > 15 Career PPG (SQL):\n", duke_high_scorers_sql)


# Query 4: Number of players and average DRAFT_NUMBER by DRAFT_YEAR (for players with 'Career' stats)
query4 = """
SELECT DRAFT_YEAR, COUNT(PERSON_ID) AS Num_Players, AVG(DRAFT_NUMBER) AS Avg_Draft_Pick
FROM career_df
WHERE DRAFT_YEAR IS NOT NULL AND DRAFT_NUMBER IS NOT NULL
GROUP BY DRAFT_YEAR
ORDER BY DRAFT_YEAR DESC
LIMIT 10;
"""
draft_year_summary_sql = pysqldf(query4)
print("\nDraft Year Summary (SQL, last 10 available years with career data):\n", draft_year_summary_sql)

Top 10 Career Scorers (SQL):
         PLAYER_NAME   PTS
0  Wilt Chamberlain  30.1
1    Michael Jordan  30.1
2      Elgin Baylor  27.4
3        Jerry West  27.0
4     Allen Iverson  26.7
5        Bob Pettit  26.4
6     George Gervin  26.2
7   Oscar Robertson  25.7
8       Kobe Bryant  25.0
9       Karl Malone  25.0

Average PTS by Position (SQL):
   POSITION   AVG_PTS
0      F-G  8.519753
1      C-F  7.537719
2      G-F  7.106633
3      F-C  6.872857
4        G  6.461704
5        F  5.994951
6        C  5.760947

Duke Players with > 15 Career PPG (SQL):
       PLAYER_NAME COLLEGE   PTS
0      Grant Hill    Duke  16.7
1   Carlos Boozer    Duke  16.2
2    Jeff Mullins    Duke  16.2
3  Corey Maggette    Duke  16.0
4     Elton Brand    Duke  15.9

Draft Year Summary (SQL, last 10 available years with career data):
    DRAFT_YEAR  Num_Players  Avg_Draft_Pick
0      2024.0            2       17.000000
1      2023.0            4       47.750000
2      2022.0            4       36.000000
3      2021.0           14       39.714286
4      2020.0           26       39.076923
5      2019.0           27       38.222222
6      2018.0           25       40.480000
7      2017.0           34       34.411765
8      2016.0           37       34.675676
9      2015.0           25       26.200000

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer

# Use the 'career_df' created earlier, which is df[df['STATS_TIMEFRAME'] == 'Career']
model_df = career_df.copy()

# Select features and target
features = ['DRAFT_NUMBER', 'DRAFT_ROUND', 'HEIGHT_INCHES', 'WEIGHT_LBS', 'POSITION', 'COLLEGE']
target = 'PTS'

# Filter out rows where essential features or target are missing for modeling
# DRAFT_NUMBER is crucial. Let's focus on drafted players.
model_df = model_df[model_df['DRAFT_NUMBER'].notna()]
model_df = model_df[model_df[target].notna()] # PTS should already be handled, but good check

# For simplicity, we'll drop rows if HEIGHT_INCHES or WEIGHT_LBS are missing in this subset
model_df.dropna(subset=['HEIGHT_INCHES', 'WEIGHT_LBS'], inplace=True)

# Handle DRAFT_ROUND: Ensure it's numeric; fill NaNs for undrafted (though we filtered by DRAFT_NUMBER)
model_df['DRAFT_ROUND'] = pd.to_numeric(model_df['DRAFT_ROUND'], errors='coerce').fillna(0) # 0 for undrafted/unknown

# Reset index for clean processing
model_df.reset_index(drop=True, inplace=True)

X = model_df[features]
y = model_df[target]

print(f"Shape of X: {X.shape}, Shape of y: {y.shape}")
X.head()

Shape of X: (3010, 6), Shape of y: (3010,)

X.isnull().sum() # Check for NaNs in features before preprocessing

DRAFT_NUMBER     0
DRAFT_ROUND      0
HEIGHT_INCHES    0
WEIGHT_LBS       0
POSITION         0
COLLEGE          0
dtype: int64

# Identify categorical and numerical features
categorical_features = ['POSITION', 'COLLEGE']
numerical_features = ['DRAFT_NUMBER', 'DRAFT_ROUND', 'HEIGHT_INCHES', 'WEIGHT_LBS']

# Preprocessing for numerical features: Impute NaNs (if any) and scale
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')), # Handles any stray NaNs
    ('scaler', StandardScaler())
])

# Preprocessing for categorical features: Impute NaNs and one-hot encode
# For 'COLLEGE', there are many unique values. We'll use handle_unknown='ignore'
# which means if a college seen in test wasn't in train, its OHE columns will be all zeros.
# A more robust approach for 'COLLEGE' might involve feature hashing or limiting to top N colleges.
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Create a column transformer to apply different transformations to different columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough' # Keep other columns (if any)
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}, y_test shape: {y_test.shape}")

X_train shape: (2408, 6), X_test shape: (602, 6)
y_train shape: (2408,), y_test shape: (602,)

# Define models to train
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(alpha=1.0),
    "Random Forest Regressor": RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1, max_depth=10, min_samples_split=10),
    "Gradient Boosting Regressor": GradientBoostingRegressor(n_estimators=100, random_state=42, max_depth=5)
}

results = {}

for name, model in models.items():
    print(f"Training {name}...")
    # Create the full pipeline: preprocessor + model
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('regressor', model)])
    
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    
    results[name] = {'RMSE': rmse, 'R2 Score': r2}
    print(f"{name} - RMSE: {rmse:.4f}, R2 Score: {r2:.4f}\n")


# Display results
results_df = pd.DataFrame(results).T.sort_values(by='R2 Score', ascending=False)
print("Model Performance Summary:")
print(results_df)

Training Linear Regression...
Linear Regression - RMSE: 4.4395, R2 Score: 0.0745

Training Ridge Regression...
Ridge Regression - RMSE: 4.3185, R2 Score: 0.1243

Training Random Forest Regressor...
Random Forest Regressor - RMSE: 3.9918, R2 Score: 0.2518

Training Gradient Boosting Regressor...
Gradient Boosting Regressor - RMSE: 4.0023, R2 Score: 0.2478

Model Performance Summary:
                                 RMSE  R2 Score
Random Forest Regressor      3.991784  0.251754
Gradient Boosting Regressor  4.002261  0.247821
Ridge Regression             4.318512  0.124254
Linear Regression            4.439528  0.074485

# Get the trained Random Forest pipeline
rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('regressor', RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1, max_depth=10, min_samples_split=10))])
rf_pipeline.fit(X_train, y_train) # Fit again to ensure we have the fitted preprocessor

# Get feature names after one-hot encoding
# The preprocessor must be fitted first to get feature names
# We access the 'onehot' step from the 'cat' transformer within the 'preprocessor'
try:
    ohe_feature_names = rf_pipeline.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features)
    all_feature_names = numerical_features + list(ohe_feature_names)
except Exception as e:
    print(f"Error getting OHE feature names: {e}")
    all_feature_names = None # Fallback

if all_feature_names and hasattr(rf_pipeline.named_steps['regressor'], 'feature_importances_'):
    importances = rf_pipeline.named_steps['regressor'].feature_importances_
    
    feature_importance_df = pd.DataFrame({'feature': all_feature_names, 'importance': importances})
    feature_importance_df = feature_importance_df.sort_values(by='importance', ascending=False)
    
    print("\nTop 20 Feature Importances (Random Forest):")
    print(feature_importance_df.head(20))
    
    plt.figure(figsize=(10, 8))
    sns.barplot(x='importance', y='feature', data=feature_importance_df.head(20), palette='crest', hue='feature', dodge=False, legend=False)
    plt.title('Top 20 Feature Importances for Predicting Career PTS')
    plt.tight_layout()
    plt.show()
else:
    print("Could not retrieve feature importances. Ensure the model is a tree-based model and preprocessor is fitted.")

Top 20 Feature Importances (Random Forest):
                      feature  importance
0                DRAFT_NUMBER    0.582012
2               HEIGHT_INCHES    0.039265
3                  WEIGHT_LBS    0.036991
1                 DRAFT_ROUND    0.017066
121  COLLEGE_Eastern Michigan    0.014176
9                  POSITION_G    0.012894
258    COLLEGE_North Carolina    0.009230
203    COLLEGE_Louisiana Tech    0.007947
176     COLLEGE_Indiana State    0.007173
182      COLLEGE_Jacksonville    0.007045
144      COLLEGE_Gardner-Webb    0.006637
274        COLLEGE_Notre Dame    0.005701
4                  POSITION_C    0.005602
156          COLLEGE_Guilford    0.005532
252  COLLEGE_New Mexico State    0.005111
65      COLLEGE_Buffalo State    0.004914
384         COLLEGE_Tennessee    0.004706
344    COLLEGE_South Carolina    0.004582
35             COLLEGE_Auburn    0.004376
223           COLLEGE_Memphis    0.004340

	PERSON_ID	PLAYER_LAST_NAME	PLAYER_FIRST_NAME	PLAYER_SLUG	TEAM_ID	TEAM_SLUG	TEAM_CITY	TEAM_NAME	TEAM_ABBREVIATION	...	DRAFT_YEAR	DRAFT_ROUND	DRAFT_NUMBER	ROSTER_STATUS	PTS	REB	AST	STATS_TIMEFRAME	FROM_YEAR	TO_YEAR
0	76001	Abdelnaby	Alaa	alaa-abdelnaby	1610612757	blazers	Portland	Trail Blazers	POR	...	1990.0	1.0	25.0	NaN	5.7	3.3	0.3	Career	1990	1994
1	76002	Abdul-Aziz	Zaid	zaid-abdul-aziz	1610612745	rockets	Houston	Rockets	HOU	...	1968.0	1.0	5.0	NaN	9.0	8.0	1.2	Career	1968	1977
2	76003	Abdul-Jabbar	Kareem	kareem-abdul-jabbar	1610612747	lakers	Los Angeles	Lakers	LAL	...	1969.0	1.0	1.0	NaN	24.6	11.2	3.6	Career	1969	1988
3	51	Abdul-Rauf	Mahmoud	mahmoud-abdul-rauf	1610612743	nuggets	Denver	Nuggets	DEN	...	1990.0	1.0	3.0	NaN	14.6	1.9	3.5	Career	1990	2000
4	1505	Abdul-Wahad	Tariq	tariq-abdul-wahad	1610612758	kings	Sacramento	Kings	SAC	...	1997.0	1.0	11.0	NaN	7.8	3.3	1.1	Career	1997	2003

NBA Player Statistics: Exploratory Data Analysis¶

1. Setup and Data Loading¶

2. Data Cleaning and Preprocessing¶

3. Exploratory Data Analysis (EDA)¶

3.1 Overall Player Statistics (Focus on Career Stats for broader view)¶

3.3 Draft Analysis (Focus on 'Career' stats for players with completed/longer careers)¶

3.4 Physical Attributes vs. Performance¶

3.5 Geographical Diversity & College Impact¶

3.6 Current Season Standouts (Active Players - `STATS_TIMEFRAME == 'Season'`)¶

4. SQL Query Demonstrations (using pandasql)¶

5. Predictive Modeling: Career Points Per Game (PTS)¶

5.1 Data Preparation for Modeling¶

5.2 Feature Preprocessing and Pipeline Setup¶

5.3 Train-Test Split¶

5.4 Model Training and Evaluation¶

5.5 Feature Importances (for tree-based models)¶

5.6 Model Interpretation and Limitations¶

	DRAFT_NUMBER	DRAFT_ROUND	HEIGHT_INCHES	WEIGHT_LBS	POSITION	COLLEGE
0	25.0	1.0	82.0	240.0	F	Duke
1	5.0	1.0	81.0	235.0	C	Iowa State
2	1.0	1.0	86.0	225.0	C	UCLA
3	3.0	1.0	73.0	162.0	G	Louisiana State
4	11.0	1.0	78.0	235.0	F-G	San Jose State

NBA Player Statistics: Exploratory Data Analysis¶

1. Setup and Data Loading¶

2. Data Cleaning and Preprocessing¶

3. Exploratory Data Analysis (EDA)¶

3.1 Overall Player Statistics (Focus on Career Stats for broader view)¶

3.3 Draft Analysis (Focus on 'Career' stats for players with completed/longer careers)¶

3.4 Physical Attributes vs. Performance¶

3.5 Geographical Diversity & College Impact¶

3.6 Current Season Standouts (Active Players - STATS_TIMEFRAME == 'Season')¶

4. SQL Query Demonstrations (using pandasql)¶

5. Predictive Modeling: Career Points Per Game (PTS)¶

5.1 Data Preparation for Modeling¶

5.2 Feature Preprocessing and Pipeline Setup¶

5.3 Train-Test Split¶

5.4 Model Training and Evaluation¶

5.5 Feature Importances (for tree-based models)¶

5.6 Model Interpretation and Limitations¶

3.6 Current Season Standouts (Active Players - `STATS_TIMEFRAME == 'Season'`)¶