Problem data

jerrymckennan
2 min readJan 20, 2022

So I wanted to follow up my last post with one about trying to figure out where my problem areas were. Going into this I had two different suspicions: season and age.

Part of what made me wonder about season was the use of “value_counts()” function available within pandas. I used the “normalize=True” option and took only the top 10 results. Using +1.5 WAR, -1.5 WAR, and the absolute 1.5 WAR, I found that the 2000s returned consistently as the highest percentage of issue children. But I also know that some data is limited as far as age goes:

index   Age  Count
24 19 6
21 20 50
18 21 135
14 22 398
10 23 830
7 24 1385
4 25 1858
2 26 2253
0 27 2367
1 28 2286
3 29 2032
5 30 1750
6 31 1504
8 32 1234
9 33 1017
11 34 794
12 35 617
13 36 458
15 37 334
16 38 225
17 39 150
19 40 91
20 41 61
22 42 39
23 43 14
25 44 3
26 45 2
27 46 1

Most of the bulk data lies between ages 24–34. So I decided to get the the mean() difference and absolute difference when grouped by season and age, then graphed them out with the overall means for each. Here’s my results for season:

And again for age:

The first set of graphs surpised me. I was not expecting to see the dip in the standard differences for the mid-1980s. But for the most part, there wasn’t too much deviation there.

The age… to be honest, I’m still not surprised. My model uses data based on age over periods of time. So if there isn’t much data, there will be big swings — as we can see in the 19–24 and again between 40–46.

And since the other years/ages seem to fall in line with the average, I’m pretty happy once again with the model I’ve created.

--

--