NCAA SOS Metric is Both a Strong AND Unreliable Metric

One would hope the 2nd of 5 criteria (NCAA SOS Metric) for the critical decisions to be made by the NCAA Selection Committee would have a high degree of reliability. Does it, though?

A question posited last week regarding the NCAA SOS Metric
Try watching this one minute video again after reading the post, too.

All SOS Models are Strong:

There are 5 different models that report SOS metrics for each of 120+ teams in D3 MVB:

NCAA, KPI, T100, Massey, & Inside Hitter

What if I standardized all 600+ of these reported data and then plotted each one as a function of the AZR-SOS for each volleyball team across the whole landscape? (The AZR concept – A Wisdom of the Crowd Composite Metric which is an Average Z Score Rating for any team – has shown itself to be very well aligned with the AVCA Coaches Poll all season long, and most recently just a few days ago, it was shown to align with 26 of the 28 teams the NCAA Selection Committee chose for its initial regional rankings.)

The question quoted at the top asks “Does the NCAA SOS Metric have a high degree of reliability?” It was precipitated by observing an unusually large discrepancy between the NCAA SOS Metric’s perception of Baruch (ranked 32nd highest) and that of 4 other models’ SOS (combined rank = 78th). For this reason Baruch is the only labeled data point of the 600+ shown above.

That sure seems like a pretty solid linear signal above, doesn’t it? A fairly unwieldy viewpoint of the SOS concept at large, however. What do each of the 5 models look like within it?

Four of the 5 models are best represented by quadratic curves, two concave up and the other two concave down, with the T100 the only one better suited to linear modeling. This is the reason all of them together appear linear. All five show really strong signals. How strong? I could tell you R-squared for each is the percent of variability in SOS determined by the median for each team, but that doesn’t mean “diddly” to most. So how about I simply say the T100’s and Massey’s models each earn an A+ being above 95%, Inside Hitter’s gets an A at 94%, and NCAA & KPI each earn B+ grades, both being in the high 80%’s. For those “stat-heads” out there, “Yes, it is rigged to be high for all 5 because the horizontal axis isn’t independent.” Not so unlike high school grades where every teacher, administrator, and parent seems to think it’s not only possible for more than three-quarters of students to perform above average, but it’s become the expectation! A neat trick only possible with smoke and mirrors and the reinvention for the definition of the word “average,” I think! LOL However, none of these realities, particularly the one about high school grades, diminishes the fact these models each are independently high functioning signals of SOS for D3 Men’s Volleyball teams, including the one most in question, the NCAA SOS Metric.

Hopefully this is you now: “That’s kind of cool! Okay, I’m convinced all these models carry strong signals with relatively minimal noise. However, when can we take a look at only the NCAA SOS Metric under the microscope? You know, the one which plays a role in Regional Rankings, and through it, National Bid selection?”

I’ll get there. Just hang on for one more moment, as I would be remiss to avoid sharing, “Not only can these 5 models’ signals reliably stand alone, but for the middle 85% of the data of each, ignoring the fringes of the best 6 and the weakest 12 teams, their’ signals are pretty much one and the same!

(Circled below you can see all 5 colored lines come together at its “heart.”)

I got it – Understood! They are all good fitting models AND they are essentially the same, having nearly identical signals, except for the few teams we really don’t need an SOS metric for in the first place. Now, can I see the one I really care about, please?

Sure. The NCAA Model for SOS and its output is bold in blue on the plot below.

Alright, so I now see why it gets a B+ with R-Sq.=.891 because there are a few noisy data points here and there. However, it’s still pretty decent, right?

Sure, as models go, the noise about this one is certainly not too loud. However, there is just one more thing to consider. Before I share, I think you should read about the archer below. Especially if you are not a “stat-head” or particularly skilled at reading plots. I promise it will crystalize what’s right around the corner:

The NCAA SOS Metric is Unreliable:

Now, lets see what the NCAA SOS Metric looks like when accounting for regions and conferences:

If the only thing you do with the above scatter plot is to take note of the color of its noise about the light blue signal, that should be sufficient. Just pretend these colored points were arrows leaving their mark in an attempt to hit its target.

If the NCAA SOS Metric were not biased, then I would be the first to agree with any differing SOS values by .017 or larger being statistically significant. However, with 3 conferences from region #4 having its average noise (residuals) constitute more than 33% higher than this, i.e. .023, be mighty careful. Especially should the teams you’re comparing be from the NACC and CUNY, the most extreme cases in opposite directions. The average CUNY team with its “BIAS FOR/ .032” compared to the average NACC team with its “BIAS AGAINST/ -.031” makes it so the NACC team’s SOS is handicapped by .063 (.032-.031 = .063). So “Team Typical” from the CUNY with a .063 higher SOS Metric to its name would have played the same strength of schedule as “Standard Squad” from the NACC with a .063 less SOS metric to its credit.

Here is the good news, though. The SOS Metric might at most be a faulty culprit once in a while to keep a deserving team from making the Regional Rankings list – One of the reasons Rivier, rather than Baruch, would have been favored to be on Region 1’s list last week & contributing for the case of Augustana rather than Benedictine in Region #4, too. However, considering the limited number of at-large bids available this year, my guess is that the other four criteria for the Selection Committee’s decisions will have little problem holding enough weight for it not to matter when it comes to the big ticket item of who goes to the National Tournament. In the current light of a debate heating up between Nichols and Santa Cruz’s consideration for a third Pool B Bid, though, I would recommend adding no less than another .01 to Nichol’s NCAA SOS Metric to make that part of the selection criteria a “fair fight.”

Statistics is wonderful at raising concerns about the truth of things, but less so at pin pointing reasons for them. This is why we can only reject a hypothesis and never accept it. It is exactly like the system of law we have in this country where a verdict is never stated as “Innocent.” it can only be stated as “Not Guilty!” I could go on and on about specific reasons why this SOS statistic is biased to the degree it is, but instead, I will illustrate the larger idea below :

I searched 5 teams from the Midwest conferences & 5 more not from any Midwest conference such that they were the closest teams to having an AZR=0. (A z-score of zero is the quintessential “average” team in America.) The 5 from the Midwest have won 31% (22 of 70) of non conference matches so far this year. Those not of the Midwest have so far won 58% (50 of 86). When equivalent middling teams in the landscape win twice as frequently playing outside their conference when they are not from the Midwest than when they are, it is a terribly confounding condition which plays havoc with a metric whose foundation at its core is one built on winning percent. i.e. Given 90% of the 40 least skilled volleyball teams live in your backyard, you are bound to be able to win lots more games, and it isn’t necessarily attributable to a preference for how these teams choose to build their non-conference schedules, though admittedly it sometimes can be that with a select few.

It seems to me the NCAA needs to either do better finding a reliable SOS statistic, or at least educate people who interpret them to know exactly where they are most vulnerable and why, before putting it out there in the public domain. Their list of decision making criteria is produced and available to provide accountability to their stakeholders. The only thing I despise more than having to make a decision lacking enough information is to have found out I made it with what later turns out to have been a false premise, or at least less truthful than what it was purported to be. The margin for error in interpreting a difference in SOS metrics between teams is not consistent across conferences nor regions because of this high degree of unreliability (officially referred to as a biased statistic). My guess is any astute stakeholder who engages in D3 MVB could do at least as good a job determining the quantity and degree of good opponents any team has played all year than the NCAA Metric does. I trust the 4 committee members without the NCAA SOS metric in their hand more than I do knowing they would have it in their hand. I am certain of this, and I am a math nerd who digs numbers. What’s that tell you?