What is Crashing the Dance?

Each spring, bracketologists around the country do their best to trump the NCAA Men's Basketball Committee in selecting, seeding, and filling out the bracket for the NCAA Division I Men’s Basketball Championship (a.k.a., the Big Dance). They pore over the same "nitty gritty" reports seen by the committee, analyzing RPI, polls, wins in the last 10 games, and conference performance.

The committee comprises a rotating set of athletic directors and conference commissioners. Each year, several members move out and several others take their place. This creates continuity in the body building the bracket, presumably leading to continuity in the process itself. The principles and procedures (PDF version) themselves are also fairly well defined (not to mention analyzed and simulated). However, the committee's deliberations are kept highly secret. This makes it difficult to know whether they weigh certain factors more than others.

Hmm... so we have known input (team information) and output data (the selected teams and their assigned seeds), unknown process (the committee's deliberation) to create the output from the input - this sounds like a classic supervised machine learning problem!

Our approach applies statistical machine learning (a form of artificial intelligence) techniques to understand how the committee selects the 34 at-large teams and determines the 1-65 seeding (formally known as the S-Curve), and attempt to predict their efforts. Others have applied simple statistical models (e.g., linear regression) to identify the at-large selections, but we are aware of neither the use of more advanced machine learning techniques nor automated efforts to predict the S-Curve.

Unfortunately, the committee's output (i.e., the bracket) is not a 100% accurate representation of their deliberations. We will discuss these and other problems as we go along. However, we believe that because of the committee's consistency in applying their principles and procedures to selecting and seeding, our approach can be as accurate on average as any human bracketologist.

For the 2005 tournament, we successfully predicted 31 of 34 (91%) of the at-large teams and 56 of 65 (86%) of seeds within one of their actual seed line after training on data from the 2000-2004 brackets. In 2006 (using 2000-2005 data for training), we got 29 of 34 (85%) at-large correct and 43 of 65 (66%) correct within one seed. As we get more seasons of training data, the performance should theoretically improve. Of course, just as any flesh and blood bracketologist, this approach will have good years and bad years.

Crashing the Dance is run by Andy Cox and based on work done by Andy and Yushi Jing while computer science graduate students at Georgia Tech. Thanks to Jerry Palm for past RPI data and Ken Pomeroy for game data.