Continued model magic
In my previous post (which seems like ages ago), I was working on a prediction model for SP fWAR. I had read somewhere that when you create a model, the tinkering never ends. And.. well.. it surely hasn’t ended for me. And not only did I continue to tinker it, I included the outliers and relievers in my next set of data. I was really curious what would happen if I did. And here is some visuals of my results:
Blue dots all over the place. Not terrible, though, I don’t think. Which lead me to my biggest tinker: applying regression to my data.
I used the simple y=a+bx equation to apply my regression and created a separate column with that data. That way I could compare the two together to see what difference, and how much, it would make. Before I get too far into my comparison, here’s the same graphs as above, but with the regressed data:
Not too different, but some differences in there.
I wanted to try to compare it all side-by-side to see which one is more accurate. I did so two different ways — I did a histplot for all three right next to each other and all three in one ecdfplot. Let’s take a look at those:
Looking at the histplot, it looks like the original predictions were lined up more closely than the regressed data. However, in the ecdfplot shows the regressed data with the slight edge. Being completely honest here: I was just happy to see that my prediction data distributed similar to actual data, even if it wasn’t correct.
So for my final test, I was curious how the differences lined up. I used 0.25, 0.5, 1, 1.5, 2, and 3 fWAR differences In this case I didn’t really care about whether I was over or below for each one, just off by that amount so I used the absolute value of the difference for this exercise:
Standard predicted data:
Within +/- 0.25 WAR: 20.0100%
Within +/- 0.5 WAR: 41.1163%
Within +/- 1.0 WAR: 65.9267%
Within +/- 1.5 WAR: 79.9032%
Within +/- 2.0 WAR: 88.2571%
Within +/- 3.0 WAR: 95.8208%
Regressed predicted data:
Within +/- 0.25 WAR: 22.9332%
Within +/- 0.5 WAR: 46.2227%
Within +/- 1.0 WAR: 71.5127%
Within +/- 1.5 WAR: 84.5026%
Within +/- 2.0 WAR: 91.2168%
Within +/- 3.0 WAR: 97.1453%
In this case, there is a clear winner with the regressed data. Not that the standard predictions did bad at all! But between 2%-5% differences at each level.
I wanted to see how my data did as far as the OLS model and…:
My R-squared number improved! From 0.419 to 0.448 seems like a fairly significant jump… though the nearly 3x more observations probably helped out with that.
Still. I think that I’m getting at least a decent first model built and one that I have learned so, so much from.