이번포스팅은 Python(파이썬)을 이용한 데이터분석 프로젝트이다.
데이터분석, 데이터시각화를 위한 패키지인 pandas, matplotlib, numpy 위주로 사용해보았다.
컴공에 열중이지만 체대 출신인만큼 난 운동을 사랑한다.
그중에 농구를 가장 즐겨했는데 중학생때 NBA의 레이알렌(Ray Allen)이라는 선수에 매료되어
농구를 시작하게 되었다. NBA 역사상 3점슛을 가장 많이 넣고
3점슛 폼이 깔끔하고 아름다운것으로 유명한 전설적인 슈팅가드(SG)이다.
그래서 Python을 통해서 이 선수의
매시즌 항목별 경기실적 vs 동시즌 NBA 전체 평균경기실적 && 세계최고의 슈팅가드로 꼽히는 선수들 경기실적
데이터들을 얻어와 직접 그래프로 시각화(visualization)
해봄으로써 이선수가 왜 대단한선수인지
(참 추상적인 주제이다...ㅋㅋㅋㅋ) 증명해 보기로 하였다.
프로젝트의 모든것은 역시. *English*
아래는 내 개인 Github에 repository에 보관해놓은 프로젝트 설명과 내용을 그대로 가져와봤다.
=======================================================================
- 대략적인 한국어설명 -
제목 - 왜 NBA 선수 레이알렌은 최고의 슈터인가
데이터출처 - basketballreferences.com
데이터분석 그래프 / 사용된 패키지 및 메소드
Line plot / pandas
, matplotlib.pyplot
Scatter plot / pandas
, matplotlib.pyplot
Histogram plot / pandas
, matplotlib.pyplot
Heatmap / seaborn
, numpy
, matplotlib.pyplot
Plot Animation / matplotlib.patches
, matplotlib.path
, matplotlib.animation
, matplotlib.pyplot
참고문헌
- www.github.com
- Book('파이썬으로 데이터 주무르기') -----------------*** 개인적으로 강추하는 파이썬 데이터분석 책입니다.(따로 후기 올리겠음)
- www.stackoverflow.com
- www.basketball-reference.com
- https://matplotlib.org
- http://www.espn.com/nba/story/_/page/nbarankSGs/ranking-top-10-shooting-guards-ever
사용한 소프트웨어 - 쥬피터노트북
NBA-Ray-Allen-analysis
Description
Title
Why NBA player Ray Allen is the best shooter
Title Selection Reason
1) Love of NBA basketball
2) Favorite player in NBA is Ray Allen
3) Many people doesn't know about this player even though he his the one of the best shooter in NBA history
4) Wanted to show why he is the best shooter in NBA through NBA stats
Hypothesis
Ray Allen is the best shooter in NBA history
Attaining Internet Data Source
Through www.basketball-reference.com web, searched the wanted player(Ray Allen), NBA overall average per season stat and brought the data into xlsx file and read it using
pandas
package.Through http://www.espn.com/nba/story/ searched top ten shooting guards ever in NBA history and made the excel data by myself and sorted out the
Data Analysis / Visualization Method
- Line plot /
pandas
,matplotlib.pyplot
- Scatter plot /
pandas
,matplotlib.pyplot
- Histogram plot /
pandas
,matplotlib.pyplot
- Heatmap /
seaborn
,numpy
,matplotlib.pyplot
- Plot Animation /
matplotlib.patches
,matplotlib.path
,matplotlib.animation
,matplotlib.pyplot
Reference material
- www.github.com
- Book('파이썬으로 데이터 주무르기')
- www.stackoverflow.com
- www.basketball-reference.com
- https://matplotlib.org
- http://www.espn.com/nba/story/_/page/nbarankSGs/ranking-top-10-shooting-guards-ever
Selected Software
Jupyter Notebook - clearly show every steps of code
Effectiveness of NBA data Plot Analysis
NBA player plays can be explicitly shown by the stat data viusalization. Here a few an examples of how data visualization effectively
Let's take a look at the average weight changes of NBA players throughout the years
Import required packages using import ~ as ~
shorting the package name for easier usage and read the data using pandas
package
Sort out the excel data using pd.read_excel(file_location)
that are read with .sort_values
method
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
print("\n\n The Chart Below is NBA average stats\n\n")
NBA_stat = pd.read_excel('../data/NBA_stats.xlsx')
NBA_stat_sorted = NBA_stat.sort_values(by='Season', ascending = True)
print(NBA_stat)
print("\n\n The Chart Below is Ray Allen's NBA career stats\n\n")
Allen_stat = pd.read_excel('../data/Ray_Allen_average.xlsx')
print(Allen_stat)
print('\nFirstly I will extract important columns from NBA stats\n')
NBA_want = pd.read_excel('../data/NBA_stats.xlsx')
NBA_want_sorted = NBA_want.sort_values(by='Season', ascending = True)
print(NBA_want)
Bring out the wanted values by variable[index]
NBA_season = NBA_want_sorted['Season']
NBA_weight = NBA_want_sorted['Wt']
Draw the line plotplt.figure(size)
method draws the outline of a plot, plt.plot(details)
method helps you to put the wanted data in x,y axis and select the options inside the plot. plt.xlabel(label name)
method sets the lable name of each axis, plt.title(title name)
, puts the title of the plotplt.grid()
method shows the grid, which is the lines inside the plot
Last but not least plt.show()
method is the final code to visualize the plot
plt.figure(figsize=(40,20))
plt.plot(NBA_season, NBA_weight, color = 'red', linestyle='dashed', marker = 'o', markerfacecolor ='blue', markersize=10, lw=4)
plt.xlabel('Season', size =20)
plt.ylabel('Player Weight', size = 20)
plt.title('NBA Player Weight Change', size=30)
plt.xticks(rotation = 90)
plt.grid()
plt.show()
Plot shows almost a steady increase of weight throughout the seasons and a slight decrease in recent years.
Players needed more muscle than lung capacity.
Let's look at other important stat change by creating line plots. Shooting
Bring out the wanted values by variable[index]
method and draw two seperate line plots
NBA_FG_percent = NBA_want_sorted['FG%']
NBA_three_percent = NBA_want_sorted['3P%']
plt.figure(figsize = (80,25))
plt.plot(NBA_season, NBA_FG_percent, lw=10)
plt.xlabel('Season', size =50)
plt.ylabel('FieldGoal Percentage', size = 50)
plt.title('NBA Player FieldGoal Percentage', size=65)
plt.grid()
plt.show()
plt.figure(figsize=(80,25))
plt.plot(NBA_season, NBA_three_percent, color = 'purple', linestyle='-.', marker = 'p', markerfacecolor ='white', markersize=20, lw=13)
plt.xlabel('Season', size =50)
plt.ylabel('3point Percentage', size = 50)
plt.title('NBA Player 3point Percentage', size=65)
plt.grid()
plt.show()
Due to NBA Rule alteration and stragic changes, fieldgoal percentage showed a clear drop till late 1990s
Now reached an increase for the last 3 seasons.
3point percentage showed steady increase throughout the seasons.
Ray Allen Comparison Analysis
Now let's compare these Overall NBA average stats with Ray Allen's stat.
Ray Allen is a SG(Shooting Guard) famous for being specialized for 3 point shooting.
Start the NBA average stat data starting from 1996-97 till 2013-14 season to match it Ray Allen's career
NBA_short_stat = NBA_want_sorted[18:36]
NBA_short_stat_season = NBA_short_stat['Season']
NBA_short_stat_three_percent = NBA_short_stat['3P%']
NBA_short_stat_FG_percent = NBA_short_stat['FG%']
NBA_short_stat_freethrow_percent = NBA_short_stat['FT%']
Allen_FG_percent = Allen_stat['FG%']
Allen_season = Allen_stat['Season']
Allen_freethrow_percent = Allen_stat['FT%']
Bring out the wanted values again. This time use plt.scatter
to draw a scatter plotplt.legend
is the indicator of two different shape, located on right top
plt.figure(figsize = (40,18))
plt.scatter(NBA_short_stat_season, NBA_short_stat_FG_percent, marker = 'd', s = 200, label = 'NBA')
plt.scatter(Allen_season, Allen_FG_percent, marker = 'p', s = 200, label = 'Allen')
plt.xlabel('Season', size = 33)
plt.ylabel('FieldGoal Percentage', size = 30)
plt.title('NBA and Ray Allen FieldGoal Percentage Comparison per Season', size = 40)
plt.grid()
plt.legend(loc = 0, prop ={'size' : 32})
plt.show()
Allen_three_percent = Allen_stat['3P%']
plt.figure(figsize=(40,18))
plt.scatter(NBA_short_stat_season, NBA_short_stat_three_percent, color='b', label = 'NBA', marker = '>', s=200)
plt.scatter(Allen_season, Allen_three_percent,color='g', label = 'Allen', s=200)
plt.title('NBA and Ray Allen 3point Percentage Comparison per Season', size = 40)
plt.xlabel('Season', size = 33)
plt.ylabel('3Point Percentage', size = 30)
plt.grid()
plt.legend(loc=0, prop = {'size':32})
plt.show()
plt.figure(figsize=(40,18))
plt.scatter(NBA_short_stat_season, NBA_short_stat_freethrow_percent, color = 'purple', label = 'NBA', marker = '.', s = 450)
plt.scatter(Allen_season, Allen_freethrow_percent, color = 'r', label = 'Allen', marker = '^', s = 250)
plt.title('NBA and Ray Allen Freethrow Percentage Comparison per Season', size = 40)
plt.xlabel('Season', size = 30)
plt.ylabel('Freethrow Percentage', size = 30)
plt.grid()
plt.legend(loc = 0, prop = {'size':32})
plt.show()
Result shows mixed results on fieldgoal percentage
However, 3point and Freethrow percentage rate clearly shows(especially freethrow) that Ray Allen has higher success percentage than NBA average.
As a matter of fact Ray Allen is the all time leader of 3 point made in the entire NBA history
Let me show the top 20 ranking of NBA 3point all time leaders
Read the data of 10 NBA 3point leaders
NBA_three_leader = pd.read_excel('../data/NBA_3pt_leader.xlsx')
Sort the data into 3point index section and Player index to create the axis of plot
NBA_three_leader_point = NBA_three_leader['3P']
NBA_three_leader_name = NBA_three_leader['Player']
Bring out Ray Allen's stat independently to differentiate the color of the bar using .loc[]
method
NBA_three_leader_Allen = NBA_three_leader.loc[0]
NBA_three_leader_Allen_three = NBA_three_leader_Allen['3P']
NBA_three_leader_Allen_name = NBA_three_leader_Allen['Player']
Sort other 19 players
NBA_three_leader_else = NBA_three_leader[1:20]
NBA_three_leader_else_sort = NBA_three_leader_else.sort_values(by = '3P', ascending = True)
NBA_three_leader_else_three = NBA_three_leader_else['3P']
NBA_three_leader_else_sort_player = NBA_three_leader_else_sort['Player']
Categorize the data into ascending form to show the Rank from the highest score
NBA_three_leader_else_sort_player = NBA_three_leader_else_sort['Player']
NBA_three_leader_else_sort_three = NBA_three_leader_else_sort['3P']
Draw the histogram
Here this graph consists of two different histogram since I wanted to show different height and color of the bar for Ray Allen stat
plt.figure(figsize=(20,12))
plt.barh(NBA_three_leader_else_sort_player, NBA_three_leader_else_sort_three, height = 0.6)
plt.barh(NBA_three_leader_Allen_name, NBA_three_leader_Allen_three, height = 0.8, color = 'r', align= 'center')
plt.title('3point All Time NBA Leaders', size = 25)
plt.xlabel('3point Made', size = 20)
plt.ylabel('Player', size = 20, rotation = 60)
plt.show()
This bar plot of showing 3point all time NBA Leaders shows Ray Allen has the highest 3point made scores
which is an explicit indicator showing why Ray Allen can be called as the best shooter of NBA history
Use variable.mean()
to get the average of data
print('10 Leader Three Point Average')
print(NBA_three_leader_point.mean())
print('\nAllen Three Point Average')
print(NBA_three_leader_Allen_three.mean())
print('\nDiffence')
NBA_three_leader_Allen_three.mean() - NBA_three_leader_point.mean()
Showing 961 average difference
Top ten SG(Shooting Guard) in all NBA history mean they were one of the best shooters in NBA and Ray Allen is included in 8th
One of the main indicator for best SG will be the actual percentage of 3 point field goal and freethrow
Here I brought excel data of each top 10 SG's 3pt and freethrow percentage rate from 1~12 years
- Reference - www.basketball-reference.com
(Geroge Gervin , Jerry West wasn't considered since they were players who didn't have 3 point stats)
With these two dataset(3point percentage, freethrow percentage), I will draw a heatmap and a animation using seaborn
,numpy
and matplotlib.pyplot
package
Heatmap
Read the two datasets using pandas
print('\n\n Top SG 3point percentage(12 seasons)\n')
Top_SG_3p = pd.read_excel('../data/Top_SG_3percent.xlsx')
print(Top_SG_3p)
print('\n\n Top SG freethrow percentage(12 seasons)\n')
Top_SG_free = pd.read_excel('../data/Top_SG_Freepercent.xlsx')
print(Top_SG_free)
Create a list of 2D array to set axis and values in heatmap using np.array([])
method
player = ['Jordan','Bryant','Wade', 'Iverson', 'Drexler', 'Mcgrady', 'Allen', 'Miller', 'Carter', 'Ginobili']
season = ['1st','2nd','3rd','4th','5th','6th','7th','8th', '9th', '10th', '11th','12th']
three_heatmap_data = np.array([
Top_SG_3p['Jordan'],
Top_SG_3p['Bryant'],
Top_SG_3p['Wade'],
Top_SG_3p['Iverson'],
Top_SG_3p['Drexler'],
Top_SG_3p['Mcgrady'],
Top_SG_3p['Allen'],
Top_SG_3p['Miller'],
Top_SG_3p['Carter'],
Top_SG_3p['Ginobili']
])
Draw heatmap using seaborn
Used ax.invert_yaxis()
to change location of axis
fig, ax = plt.subplots(figsize = (13,10))
sns.heatmap(three_heatmap_data, annot =True, fmt = 'g', linewidths =1,
cmap = 'Oranges', vmin = 0.15, vmax = 0.5 )
ax.set_xticklabels(season)
ax.set_yticklabels(player, rotation = 360)
ax.invert_yaxis()
plt.ylabel('Player', size = 15)
plt.xlabel('Season', size = 15)
plt.title('3 Point make Percentage of NBA top 10 best Shooting Guards Heatmap', size = 20)
As you can see, the darker the higher 3pt success rate a player has
Jordan seemed to have a hot hand on his 10th season showing 50% 3pt success rate
But players with constant high percentage rates are Miller and Allen
This heatmap indicates that Ray Allen did not just made a lot of 3point but also successed with high success rate!
Use data.describe()
method to check details which is a quite useful method since it automatically shows 8 main statistically useful data results
Top_SG_3p.describe()
As you can see, Allen shows second highest percentage rate in Mean(average) and Max(maximum) among 10 greatest shooters
In min(minimum) section, Allen has the highest percentage.
Animation
Import the patches
, path
, animation
tools from matplotlib
package and set the data
using variable.iloc[]
method to sort out wanted values
import matplotlib.patches as patches
import matplotlib.path as path
import matplotlib.animation as animation
data = Top_SG_free.drop(['Season'],1)
first_season = data.iloc[0]
second_season = data.iloc[1]
third_season = data.iloc[2]
fourth_season = data.iloc[3]
fifth_season = data.iloc[4]
sixth_season = data.iloc[5]
seventh_season = data.iloc[6]
eighth_season = data.iloc[7]
ninth_season = data.iloc[8]
tenth_season = data.iloc[9]
eleventh_season = data.iloc[10]
twelveth_season = data.iloc[11]
Set the data using numpy.array
ani_data = np.array([
first_season,
second_season,
third_season,
fourth_season,
fifth_season,
sixth_season,
seventh_season,
eighth_season,
ninth_season,
tenth_season,
eleventh_season,
twelveth_season
])
print(ani_data)
Make animation executable inside Jupyter Notebook using %matplotlib notebook
code
%matplotlib notebook
Set the outlines of the animation frame
n, bins = np.histogram(ani_data, 12)
left = np.array(bins[:-1])
right = np.array(bins[1:])
bottom = np.zeros(len(left))
top = bottom + n
nrects = len(left)
set up the vertex and path codes arrays using plt.Path.MOVETO
, plt.Path.LINETO
and plt.Path.CLOSEPOLY
for each rect.
need 1 MOVETO
per rectangle, which sets the initial point. need 3 LINETO
's, which tell Matplotlib to draw lines from vertex 1 to vertex 2, v2 to v3, and v3 to v4. We then need one CLOSEPOLY
which tells Matplotlib to draw a line from the v4 to our initial vertex (the MOVETO
vertex), in order to close the polygon.
nverts = nrects *(1+3+1)
verts = np.zeros((nverts, 2))
codes = np.ones(nverts, int) * path.Path.LINETO
codes[0::5] = path.Path.MOVETO
codes[4::5] = path.Path.CLOSEPOLY
verts[0::5, 0] = left
verts[0::5, 1] = bottom
verts[1::5, 0] = left
verts[1::5, 1] = top
verts[2::5, 0] = right
verts[2::5, 1] = top
verts[3::5, 0] = right
verts[3::5, 1] = bottom
Make an animate
function, which generates and updates the locations of the vertices for the histogram. patch
will eventually be a Patch
object.
patch = None
def animate(i):
n, bins = np.histogram(ani_data[:,i], 12)
top = bottom + n
verts[1::5,1] = top
verts[2::5, 1] = top
return [patch, ]
Make the animation plot
Execute the animation, interval
per frame 1000 = 1 second
fig, ax = plt.subplots()
barpath = path.Path(verts, codes)
patch = patches.PathPatch(
barpath, facecolor = 'red', edgecolor = 'black', alpha = 0.5)
ax.add_patch(patch)
ax.set_xlim(left[0], right[-1])
ax.set_ylim(bottom.min(), 7)
ax.invert_xaxis
ani = animation.FuncAnimation(fig, animate, repeat = False, blit = False, interval = 1000)
plt.title('Top10 Shooting Guard Freethrow Proportion per Season', size = 15)
plt.ylabel('Rate Proportion', size = 15)
plt.xlabel('Each Season(per slide)', size = 15)
plt.show()
Now, check out Ray Allen's average freethrow percentage rate of each season
Allen_freethrow = Allen_stat['FT%']
print(Allen_freethrow)
We can obviously see his FT% comes under 0.85 ~ 0.90 rate proportion from the upper animation chart or even over 0.90
Let's look at the actual overall average of each players and Ray Allen. Lastly checking the gap of each player's percentage with Allen
print('10 Top SG Freethrow Average\n')
print(Top_SG_free.mean())
print('\n\nRay Allen Freethrow Average\n')
print(Allen_freethrow.mean())
print('\n\nNBA Freethrow Average\n')
print(NBA_stat['FT%'].mean())
print('\n\nAllen / NBA overall freethrow Difference\n')
gap = Allen_freethrow.mean() - NBA_stat['FT%'].mean()
print(gap)
print('\n\nAllen / Top 10 SG freethrow Difference\n')
Allen_freethrow.mean() - Top_SG_free.mean()
Allen takes the 1st rank freethrow success rate(0.895) among top 10 best shooting guards. Also shows more than 14% higher percentage in comparison with NBA overall average.
Conclusion
Ray Allen shows explicitly high rate from standards of judging a great shooting guard.
- Shows maximum 11% higher success rate of 3 point fieldgoal
- Taking 1st rank of 3 point fieldgoal made in all time NBA history reaching almost 3000 goals
- Having a high and steady 3point make percentage rate reaching average 0.401 even compared to top 10 other greatest shooting guards in NBA
- Shows extremely high percentage rate(0.894) in freethrow compared to NBA and also clear high rate with top 10 SG comparison
- More than 14% higher freethrow success rate compared with NBA overall average
The hypothesis
'Ray Allen is the best shooter in NBA history'
is proved, shown through various data analysis and visualization.
'개인_프로젝트 > Python' 카테고리의 다른 글
[책 후기] :: 파이썬으로 데이터 주무르기 (0) | 2019.01.27 |
---|