wOBA modification

jerrymckennan
7 min readSep 10, 2021

--

So anyone who enjoys analytics, that I know, use wOBA to describe offensive stats. It’s a great stat that takes the Run Expectancy of each type of hit and walks, multiples it by the number of times a player does that, and then divides it by the number of at-bats. Then, it multiplies it by a scaled number so that league wOBA equals league on-base percentage. I truly enjoy it and, to play off of my previous post, it plays directly into the Batting Runs section of fWAR.

One day, while surfing baseball savant, I wondered what it would look like if instead of treating all hits as the same, what would happen if you separated hard hit balls from the rest? This would take some work.

With this particular project, I made many mistakes and finished with many takeaways. In all, I think it really helped in both expanding my database querying abilities and to understand this particular stat a bit more.

Coming into the start of this project I knew a few guaranteed needs to be able to get my version of wOBA:

  1. I need a run matrix.
  2. I need all plays in a given year.
  3. I need to get a list of all hard hit balls (95+ MPH exit velocity).

For me, the hardest part of this was going to get my run matrix created. I could have probably taken one from somewhere else, but to get the full experience I really wanted to create my own.

Using Retrosheets and baseball savant, I was able to get my data for points two and three. Or at least my start. They both had great starts even. I found that as I went on, I would need to modify the given data to match up with the different tables. E.g. instead of a player ID for the “who is on 1B” field, I updated it with a “1” to signify someone, doesn’t matter who, was on there. Same deal with other fields. I also needed to know how many runs were scored on a particular play — but I knew each base runners destination. So if the destination = 4, runs += 1. Very simple yet logical updates.

Once I got all of those updates finished, I began to develop my run matrix. I created a standard one that seemed to be good. It had what seemed like a good progression and fit what I saw of other models. I got so excited that my next move was my first mistake on this journey… I decided to create a run matrix using only hard hit balls. I initially thought I was genius and that it was going to be amazing. And when my results came back, it was as glorious as I had imagined it would be! But one teeny, tiny problem — a run matrix is supposed to tell you how many runs you’re expected to score from a given situation until the end of the inning. My “hard hit run matrix” show me only how many runs were scored during those specific at bats. Basically rendering it useless and a lot of time wasted. After my epiphany, I decided to scratch that idea and move forward.

Next up was creating the run expectancy tables. I created one for the start and another one for the end. These weren’t too bad to create on their own. It was when I tried to combine them together to get my RE24 that it became a bit more of an issue.

I thought I had it all figured out though. Link up the game ID, player name, inning and it should be all set. Still, my newly formed RE24 table wasn’t correct. That’s when I noticed that sometimes a player would bat twice in the same inning of a game. Therefore, each starting RE would match to each ending RE. Leading me to also add in outs for both the starting and ending RE to match as well as linking the end results together (single = single). This seemed to clear up my issues. Testing it out, I was able to get some of my constants for my new fangled wOBA. Plug them in, get my league average, get my scale to have it match OBP (I’m using only 2020 stats at this time, so it was needing to match .322 — it was at .211 giving me a scale of 1.527 to work with)

Next I wanted to create a table that would allow for me to do the same, but for hard hit balls. I thought I had a really good process down for this. I had spent a number of hours creating my previous one, worked out a number of errors, seemed to be on a really good path.

That was not at all the case here.

I knew one issues for sure would come long. Retrosheets has no game dates, only game IDs. If you’re not familiar with it, it has a format like this: “AAAYYYYMMDDB” As you might expect, the “YYYYMMDD” is the date. The “AAA” represents the home team of the game, the B represents whether it’s a single game on the day, game one of a double header or game two of a double header. With a little bit of time (and some INSERT/RegEx/SUBSTRING usage), I was able to extract a date from this. Well. When I run this over 16,500 records, I wanted to try to simplify that. So I created a separate table that included both the game ID as well as the date for easy linking.

My thoughts were now this: use my newly found RE table, link it to my newly found date table, and then link thatup to my hard hit table, and voilà! New table completed! But no, that wasn’t even close. After my first attempt returning tens of thousands more results that I had hoped, I needed to dig into these data sets further. It took a lot of trial and error. I needed to link together names, outs, innings, dates/game ids, starting REs, and still it wasn’t working out for me…

Then there was a column that gave me a light bulb. In my table for hard hit balls, I had decided to leave in the number of pitches seen in an at-bat. In Retrosheets, they have a pitch sequence. I could link the two of them together! pitches_seen = CHAR_LENGTH(pitch_seq)! Run my new query, hoping this fixes the issues…. and not quite. But closer! So that must be a key.

Running through the text for Retrosheets, trying to figure out what could be used I noticed a little chart for additional characters in the pitch sequence field for the likes of things like pickoff attempts. Well… those aren’t technically pitches. So if I remove them, it should link up… right? I tested that theory out by removing one of the characters. Bingo, closer yet to my results. A few more removed, test again. Even closer. Remove the rest and, you’d guess it, closer yet.

But not quite there. Have I mentioned that I am really in love with temp tables at this point?

The data still wasn’t fully synced up though. I mean, I’m talking 15–20 of the 16,521 records were missing. So I did an export of the data I have in my hard hit table and again of my hard hit RE table that shows player name, date, and count of times for each on. This was I could compare the two and see where it would be different and query my tables to see what could be missing. These all missing data points had one thing in common — they were all in extra innings. What in the world could be happening there? And then it dawns on me what it is..

Retrosheets wasn’t accounting for the new extra innings rule of a player starting on 2B. And because of that, the data couldn’t link up. Jeez. OK. That’s not a big deal. Since it’s so few ABs, I’ll just ignore those and drop them. I can’t imagine it would cause too many issues.

So here I am. One final step. Get a third table for only non-hard hit balls. This one, I am happy to admit, was as easy as I had hoped. SELECT * FROM initial re WHERE NOT EXISTS (SELECT * FROM hard hit re WHERE a bunch of stuff matches). That easy.

And… drum roll please.. here’s my results!

              Hard hit         Soft hit           All 
Entities 16841 16841 50125 50125 66966 66966
wOBA outs -0.196 0.000 -0.205 0.000 -0.203 0.000
wOBA BB 0.247 0.451 0.247 0.451 0.247 0.451
wOBA HBP 0.256 0.459 0.256 0.459 0.256 0.459
wOBA 1B 0.342 0.538 0.261 0.466 0.281 0.484
wOBA 2B 0.759 0.955 0.725 0.930 0.748 0.951
wOBA 3B 1.242 1.438 1.253 1.458 1.245 1.448
wOBA HR 1.427 1.623 1.405 1.610 1.427 1.630

So to break this down. Each header has two column underneath them. The left column for each one shows the actual value (except for BB and HBP which are only the values to All since the others are for balls in play only) and he right shows the values when you make outs = 0 runs. The columns on the right are the actual figures that I would use to calculate wOBA. So my equations separately would look like this:

wOBA = (BB*.451+HBP*.459+1B*.484+2b*.951+3B*1.448+HR*1.630)/(AB+BB-IBB+SF+HBP)hhwOBA = (BB*.451+HBP*.459+1B*.538+2b*.955+3B*1.438+HR*1.623)/(AB+BB-IBB+SF+HBP)shwOBA = (BB*.451+HBP*.459+1B*.466+2b*.930+3B*1.458+HR*1.610)/(AB+BB-IBB+SF+HBP)

And for a wOBA that separates the two:

wOBA = (BB*.451+HBP*.459)+(1B*.466+2b*.930+3B*1.458+HR*1.610)+(1B*.538+2b*.955+3B*1.438+HR*1.623)/(AB+BB-IBB+SF+HBP)

A couple of take aways here — the values don’t change all that much except for singles (.538 for hard hits, .466 for all others). RE24 value for hard hit singles made up 30% of the value despite being 25% of the singles hit. I was also slightly surprised to see that softly hit triples were more valuable in the wOBA equation than hard hit, though a chunk of that is like due to outs giving a bigger boost.

Now that I have this data, I can actually apply it to players. For fun, and to keep with my theme, I applied it to the big 4 shortstop free agents as well as Marcus Semien. I compared their new wOBA totals to that of what FanGraphs has listed for each of them and I included the percent change for each:

Player         New wOBA   FG wOBA   % difference
Javier Baez 0.269 0.256 5.078%
Trevor Story 0.393 0.370 6.216%
Carlos Correa 0.305 0.308 -0.974%
Corey Seager 0.423 0.394 7.360%
Marcus Semien 0.307 0.299 2.676%

This might have been the biggest surprise to me — I was not anticipating anyone being lower than what FanGraphs has. Yet, somehow, Carlos Correa had done it.

This was a fun experiment. I’m thinking I might apply these numbers to the WAR tables I previously had posted to see what changes would happen there. And, maybe/hopefully/possibly, update this and make it better.

--

--

jerrymckennan
jerrymckennan

Written by jerrymckennan

Learning and writing about the journey

No responses yet