More OAA stats
So since I wrote my last piece, I kept on playing with the data. I wanted to be able to compare a player to player from both a total OAA and an OAA per play. I thought that total OAA was a great statistic, but what if a player has a lower one only because they haven’t had as many reps? Does that still make player A better than player B?
And, if I am being honest, I really wanted to see if my favorite Tiger, and fan enemy of the last few months, Niko Goodrum was truly as good as I had thought and hoped at SS. Or was this just me wearing rose colored glasses wanting him to be here.
This one was interesting for a few different reasons and at least one issue needed to be fixed, perhaps, also on my previous script.
Up first: generating my list of players I want to compare. I used a very simple counter and threw it in a while loop so I could specify how many players I wanted to compare at any given time. In that while loop, I could enter in the name of the players. The question here was how would I then be able to pull the stats? Looking at my previous script, I was able to see how I would do that. Here’s the link:
https://baseballsavant.mlb.com/visuals/oaa-data?type=Fielder&playerId=592348&startYear=2015&endYear=2021
So the variable that I need to get updated is the “playerId=” section. This should be easy enough with python to be able to use a variable in the place of that. But, how do I get that ID easily? I know that MLB.com has a page dedicated to players specifically so I started looking there first. And, wouldn’t luck have it, there is a link there that has a players first name, last name, full name, and a name/ID that I can use:
https://statsapi.mlb.com/api/v1/sports/1/players?fields=people,fullName,firstName,lastName,nameSlug
Which, when running through python and doing some magic, can return something as this:
fullName firstName lastName nameSlug playerID
0 Cory Abbott Cory Abbott cory-abbott-676265 676265
1 Albert Abreu Albert Abreu albert-abreu-656061 656061
... ... ... ... ... ...
1237 Brett de Geus Brett de Geus brett-de-geus-676969 676969
1238 Jacob deGrom Jacob deGrom jacob-degrom-594798 594798
[1239 rows x 5 columns]
Perfect! Now I can look up a name, get a playerID! Now I played around with the above link a little more and realized that I could get a lot more than just current MLB players, I could also use any of the minor leagues, some international leagues, and even collegiate players. This would actually help me out significantly later on, I’ll explain that later.
When I started testing my script, I realized that the data I was pulling wasn’t matching with the data I was expecting to see. Using Niko as the testing point, I saw he had +10 OAA at SS, but my results were showing +16 OAA. This had me scratching my head for a little bit, but then I realized what it was that I was doing wrong… I wasn’t filtering out my data to bring in only OAA for SS. D’oh. Applied my filter and re-ran, +10 = +10 now. I’m in business now.
For my list, I primarily wanted to compare Niko to those ahead of him. Being ranked 19th, I was able to get this list of names to run against:
Nick Ahmed
Andrelton Simmons
Francisco Lindor
Addison Russell
Javier Baez
Brandon Crawford
Freddy Galvis
Trevor Story
Jose Iglesias
Orlando Arcia
Adalberto Mondesi
Carlos Correa
JT Riddle
Jose Peraza
Adeiny Hechavarria
Wilmer Difo
Trea Turner
Jordy Mercer
Niko Goodrum
Things went very smoothly until I got to Addison Russell, then an error. Maybe I miss typed the name. Ran it again, same results. So that lead me to go to the JSON page with the players names and search for him. Hmm, doesn’t exist there. Fortunately I quickly remembered that I had access to other league’s list of players as well. So I check out the AAA JSON — there he is! This one wouldn’t be too bad, just create a variable for each league and then combine into one. Easy, peasy. Get that run and figure I better test it again. Run it using on Niko again… failed. What in the world? This one too me a slight bit longer to work on and figure out. Come to find out, he’s in there multiple times — all the exact same data — and my script doesn’t really like that. I don’t like that. I only need one entry. Some quick research and I find this nice, efficient option for pandas: drop_duplicates(). Let’s try again and.. success!
Time to get my big list going once more. First three work, try Addison Russell again and it works. Get almost done and get to Adeiny Hechavarria — failed. Welp. Same error as with Addison Russell. Time to search.
And search I did. This time was at least easy, I could check out my data locally rather than digging through a bunch of separate JSON. He doesn’t exist. In any of these. Come to find out, he’s over in Japan playing this year. Well then.
At this point I really wanted to get some data so this sort of troubleshooting I can do later. So I alter my list to remove Hechavarria and get the following:
Nick Ahmed
Andrelton Simmons
Francisco Lindor
Addison Russell
Javier Baez
Brandon Crawford
Freddy Galvis
Trevor Story
Jose Iglesias
Orlando Arcia
Adalberto Mondesi
Carlos Correa
JT Riddle
Jose Peraza
Wilmer Difo
Trea Turner
Jordy Mercer
Niko Goodrum
And I get through that list with no errors. Score! When doing this, I created two different arrays — one for average OAA/play and one for total OAA — that I sorted where the highest total is the top. From there, I concatenated the two sets together that way I could have both lists side by side to see some of that information. And here are the results!
player_name num_plays avg_oaa player_name oaa_sum
Nick Ahmed 2020.0 0.04596 Nick Ahmed 92.83724
Addison Russell 1381.0 0.03988 Andrelton Simmons 77.68226
Andrelton Simmons 2236.0 0.03474 Francisco Lindor 74.18007
Francisco Lindor 2541.0 0.02919 Addison Russell 55.07214
Wilmer Difo 374.0 0.02813 Javier Baez 37.34850
Javier Baez 1470.0 0.02541 Brandon Crawford 29.98972
Niko Goodrum 469.0 0.02227 Freddy Galvis 28.54004
JT Riddle 574.0 0.02148 Trevor Story 27.16912
Adalberto Mondesi 977.0 0.01524 Jose Iglesias 20.10135
Jose Peraza 958.0 0.01239 Orlando Arcia 19.32260
Freddy Galvis 2529.0 0.01129 Carlos Correa 14.99469
Trevor Story 2423.0 0.01121 Adalberto Mondesi 14.89005
Brandon Crawford 2714.0 0.01105 JT Riddle 12.32999
Orlando Arcia 1885.0 0.01025 Jose Peraza 11.87025
Jose Iglesias 2240.0 0.00897 Wilmer Difo 10.51963
Carlos Correa 1914.0 0.00783 Niko Goodrum 10.44344
Jordy Mercer 1732.0 0.00583 Jordy Mercer 10.10170
Trea Turner 1842.0 0.00525 Trea Turner 9.66452
That’s pretty cool. And it’s interesting as well. I’m honestly loving what I see from Nick Ahmed. The guy is a monster all the way around. Also, three of the bottom 5 make pretty significant jumps when looking at it by an average per play vs. actual total and for that matter, all players with less than 1,000 plays credited to them made jumps. I think I might work on a graph here as well that will display these data points, maybe even attempt my first 3D graph.
Now comes the next piece of research on this topic — when is a sample size too small to determine whether or not the per play even matters. Does it matter? You can reasonably say a batter can get lucky in 100 PA, but can you say a defender does in 100 plays?
Until then, thank you for reading once more. And here’s some more code!
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as snspd.options.display.max_rows = 9999
pd.options.display.max_columns = 9999counter = 1
player_data = pd.DataFrame({'playerID' : []})
player_array = pd.DataFrame({'player_id' : [], 'player_name' : []})
rankings = pd.DataFrame({'player_name' : [], 'num_plays' : [], 'avg_oaa' : []})
total_rankings = pd.DataFrame({'player_name' : [], 'oaa_sum' : []})mlb = "https://statsapi.mlb.com/api/v1/sports/1/players?fields=people,fullName,firstName,lastName,nameSlug"
aaa = "https://statsapi.mlb.com/api/v1/sports/11/players?fields=people,fullName,firstName,lastName,nameSlug"
aa = "https://statsapi.mlb.com/api/v1/sports/12/players?fields=people,fullName,firstName,lastName,nameSlug"
high_a = "https://statsapi.mlb.com/api/v1/sports/13/players?fields=people,fullName,firstName,lastName,nameSlug"
low_a = "https://statsapi.mlb.com/api/v1/sports/14/players?fields=people,fullName,firstName,lastName,nameSlug"
rookie = "https://statsapi.mlb.com/api/v1/sports/16/players?fields=people,fullName,firstName,lastName,nameSlug"
independent_league = "https://statsapi.mlb.com/api/v1/sports/23/players?fields=people,fullName,firstName,lastName,nameSlug"
international = "https://statsapi.mlb.com/api/v1/sports/51/players?fields=people,fullName,firstName,lastName,nameSlug"url_data = pd.read_json(mlb)
aaa_data = pd.read_json(aaa)
aa_data = pd.read_json(aa)
high_a_data = pd.read_json(high_a)
low_a_data = pd.read_json(low_a)
rookie_data = pd.read_json(rookie)
independent_league_data = pd.read_json(independent_league)
international_data = pd.read_json(international)url_data = url_data.append(aaa_data)
url_data = url_data.append(aa_data)
url_data = url_data.append(high_a_data)
url_data = url_data.append(low_a_data)
url_data = url_data.append(rookie_data)
url_data = url_data.append(independent_league_data)
url_data = url_data.append(international_data)normalized_data = pd.json_normalize(url_data.people)normalized_data.reset_index(drop=True)normalized_data = normalized_data.drop_duplicates()normalized_data['playerID'] = normalized_data['nameSlug'].apply(lambda x: re.sub(r'[a-z]','',str(x)))
normalized_data['playerID'] = normalized_data['playerID'].apply(lambda x: re.sub(r'-','',str(x)))num_players = input("Please enter how many players you want to compare: ")while counter <= int(num_players):
player_name = input("Please choose a player: ")
player_data = normalized_data[normalized_data['fullName'] == player_name]['playerID'].item()
test1 = {'player_id': player_data, 'player_name': player_name}
player_array = player_array.append(test1, ignore_index=True)
counter += 1
for x,y in zip(player_array['player_id'],player_array['player_name']):data = pd.read_json('https://baseballsavant.mlb.com/visuals/oaa-data?type=Fielder&playerId='+x+'&startYear=2015&endYear=2021')
data = data[data['target_id'] == 6]
total_oaa = data['outs_above_average'].sum().round(5)
total_mean = data['outs_above_average'].mean().round(5)
total_count = len(data)
player_total = {'player_name':y, 'oaa_sum':total_oaa}
player_avg = {'player_name':y, 'num_plays':total_count, 'avg_oaa':total_mean}
rankings = rankings.append(player_avg, ignore_index=True).sort_values('avg_oaa', ascending=False).reset_index(drop=True)
total_rankings = total_rankings.append(player_total, ignore_index=True).sort_values('oaa_sum', ascending=False).reset_index(drop=True)
overall_rankings = pd.concat([rankings, total_rankings], axis=1)
print(" ")
print(overall_rankings)