MLB Draft Research Sample



For the first edition of this database, I have decided to use the 1997-2007 MLB draft classes to provide an 11-year sample for three reasons: (1) stable number of teams/draft picks per round in year-to-year sample, following the 1997 expansion in Arizona and Tampa Bay to today’s 30 team environment; (2) a sample of draft classes that predates significant changes to resource allocation, primarily through the implementation of draft pools as dictated by the 2011 Collective Bargaining Agreement; (3) all drafted players have completed their age 29 season.

Draft History

Draft class history is provided by Baseball Reference, which provides round-by-round signing and background data with very limited exceptions. If signing status is listed as ‘Unknown’, it’s assumed that the player was never signed, provided he made no professional appearances at any level.

Player Data

I record the professional level by age of any drafted player through age 29, regardless of whether the player was traded or released by his original organization.

I award credit to the player for the highest level of affiliated baseball he reached at any given age, regardless of sample size. A player in rookie ball who plays one game in AA on an emergency basis is credited for having spent the season in question at AA. If he does not play at any affiliated level in a given season, or his professional career is already over, I denote his affiliated level as N for null.

For MLB seasons through age 29, I record the player’s fWAR (Fangraphs Wins Above Replacement value). Because across the sample I want to analyze frequency, distribution and probability of a given value threshold occurring (e.g. any 3rd round pick’s chance at a 3 fWAR season), I take an extra step to also place each season in a “bucket”. I broke WAR into columns of .5 increments from less than -1 WAR (think Chris Davis’ 2018 season) through greater than 10 WAR. Buckets are inclusive: that is, if a player had a 3 WAR season, a 5 WAR season, and a 7 WAR season, in my database this will be reflected as three seasons of 3+ WAR, two seasons of 5+ WAR, and one season of 7 WAR.

Here are a number of variables I am isolating, which will perhaps be renamed further on into this project:

  • DraftAge – the player’s age on draft day using June 30th (customary) as a cutoff. Let’s say that it’s 2019, like it is. A player born on June 29th, or June 30th for that matter, in 1994 would be 25 years old for the purposes of his 2019 baseball season. A player born on July 1st, 1994 would be considered 24 years old.
  • LvlAgeX – Each player has 13 columns of LvlAgeX, where X is every year between his potential age 17 and age 29 seasons. This can only be R, A-, A, A+, AA, AAA or MLB for players who appeared in an affiliated professional game. If the player was drafted but did not appear in a game in his DraftAge year, or any year that follows, this column will be marked null (N).
  • fWARAgeX – These are the accompanying fWAR columns for each age season and are only filled when the player appeared in a MLB game for the age season in question.
  • PeakLvl – The highest affiliated level of professional baseball that the drafted player ever played in through age 29.
  • AgeMLBDebut – If the player ever appeared in any MLB game, this is the age season when he made his debut.
  • MLB (Y/N) – Did the player ever appear in MLB?
  • MLB30+ – I added this column to account for the rare cases in which a player did not reach MLB during my sample of age 17-29 seasons, but did eventually reach MLB. I decided that as long as I was doing all this work, it would be nice to capture all MLB debuts, regardless of age.
  • YrsNonMLB – This is how many full seasons a drafted player spent in affiliated professional baseball before either (a) making his MLB debut the following season or (b) his affiliated professional career ending.
  • DraftDebut (Y/N) – Did the player make an appearance at any level of affiliated professional baseball in the summer immediately following the June draft he was selected in?
  • AgeDebut – The age season in which the player made his first appearance at any level of affiliated professional baseball.
  • LvlDebut – The highest level of affiliated professional baseball a player reached during the season in which the player made his first professional appearance.
  • AgeLast – The age season in which the player made his last appearance at any level of affiliated professional baseball. Note that some players continue playing beyond age 29. Age 29 is the endpoint of this sample, so in those cases we refer to the Age 29 season as the player’s last year of ball.
  • LvlLast – The highest level of affiliated professional baseball a player played at before either his professional career ended, or alternatively, the highest level of affiliated professional baseball that a player made an appearance at during his age 29 season.
  • YearsToMLB – This variable is only for players who made an MLB debut at some point during or before their age 29 season. It is otherwise the same as YrsNonMLB, except it includes the MLB debut season in its count, whereas YrsNonMLB isolates only the non-MLB seasons prior to debut.
  • MLBYears – The number of seasons during which a player spent any time in MLB between age 17 through age 29.

Study Design Choices

10 years ago, I created a different data-intensive draft study that used $/WAR to attempt to assign draft slot values. It was on the internet for several years, and although I can only hope a backup somehow exists somewhere, I deleted the blog that hosted it. Darn.

Something I learned from working through that study was that projects are as complicated as you choose for them to be. I would go through each player’s entire career history and attempt to identify the point at which a player would have exhausted his full six seasons of cost control (three league minimum seasons followed by, usually, three years of arbitration) that would have been potentially available to the team that originally drafted the player. This was a time intensive nightmare that likely added little value to the findings of the study.

I chose to stratify seasons by age rather than worry about service time at all. For most players, this has no effect whatsoever, since most draftees never reach MLB at all, and many who do fall out of the league well before playing six full seasons there. Still, it also means that a small handful of very successful draftees are credited for more MLB years than the six cost-controlled years available to the team. On one hand, this is a limitation of my methodology, in that this small handful of very successful draftees will have their success somewhat overstated as it relates to the team that originally drafted them. Most players reach free agency after their first six full seasons in MLB.

Conversely, I look at the 2006 draft file, and two of the first names to appear are star players Clayton Kershaw and Evan Longoria. Because of the rapid ascent of each player through the minor leagues, Kershaw was able to appear in MLB in 10 years through age 29, with Longoria playing in MLB in 8 years through age 29. It should be noted that both Kershaw and Longoria avoided free agency by signing what are generally regarded to be team-friendly extensions with their original teams. On this basis, there is arguably significant nonzero value in drafting the rights to a future star MLB player. In these two cases, both players provided value significantly below market cost to the teams that drafted them, well beyond the conclusion of each player’s sixth full season in MLB. Acquiring productive players at below market cost is a primary purpose of the MLB draft, and I don’t feel my methodology choice here negates the viability of the research.

All draftees’ professional timelines begin with the age season in which they were drafted, regardless of whether they played any affiliated professional ball in that season or not. This is a mouthful. If a player was drafted in 1997 but made his pro debut in rookie ball in 1998, I count 1998 as his second season. It’s a fair criticism to say this penalizes players for circumstances, like finite roster sizes or workload management, that are beyond their control. I did not make this choice with an intent to punish a player whose debut was held back. Age relative to league is a big factor in the career trajectory of any given player. The player has lost potential development time in each successive season whether or not he appeared in affiliated professional ball. I also want to be able to compare all draftees to one another on how many years it took them to either reach MLB or fall out of affiliated professional ball. If four years elapsed before a draftee reached MLB, but he only played in the minor leagues for the final three of those four years, it still took four years following the draft selection to develop a MLB contributor.

Only affiliated professional baseball is considered a part of a draftee’s career. The Mexican League (MEX) is considered a AAA professional level by MLB and is reflected as such. Nippon Professional Baseball (NPB; Japanese baseball) is not affiliated with MLB and is not counted in the sample. Neither are professional independent baseball leagues, like the Frontier League. Even if a player continues his professional career in a non-affiliated capacity, I only track his progress through the affiliated levels of MLB. Note that a handful of draftees play some minor league baseball, are released, play several seasons in an independent league, and return to play more affiliated minor league baseball. In these cases, I count each age season in between stints in affiliated minor league baseball as null (N), and I continue tracking a player’s progress through either his final affiliated minor league baseball season or his age 29 season, whichever comes first.

Drafted players who sign but never appear in affiliated professional baseball are not excluded from the sample. These cases are uncommon. Excluding them from the sample would reduce (slightly) the failure rates of draftees, as well as exaggerate (slightly) the number of years any given draftee spends in affiliated professional baseball prior to reaching MLB or his career coming to an end. In these uncommon cases, draftees spent 0 years in affiliated professional baseball and variables like DraftDebut, AgeDebut and LvlDebut are all null (N). They are still data points for the given draft class, as well as for the subset of draftees who sign professional contracts. They can always be excluded with relative ease from specific analyses that are built off of this database.

Lazy and Dumb II

This is the time of year to do offseason prospect lists. Even though I enjoy other sports, we’re missing baseball for another several weeks, and nothing compares to it. To each their own, but that’s my mine.

On the smiley face days, I’ve been able to keep up with everything I decided I want to do with the first six months of this year. There’s a super-cereal calendar staring back at me and everything.

New Years’ optimism makes you forget that there’s going to be a bunch of frowny face days in there, too, because stupid brain chemistry. I’m not optimistic about doing the two a week that would have the series finished by the end of March.

In reality, a lot of other things bring me joy, and the internet feels like a second world that sucks me away from it and makes those things less joyful by comparison. I (and many others) have now lived more than half of my life in this paradigm, so I hesitate to mentally commit myself to spending even more time on it. I don’t think this is a healthy world for mental health. Meanwhile, the real world sure is going to shit.

As for scouting Baseball Reference minor league numbers and doing my best with video (disclaimer: I am not experienced at this), I’m just not sure it’s adding much to the conversation anyway. I’d rather find it in me to complete the series than not, but it’s not an exciting project right now. I also don’t want to be derivative of the work of others, and there’s some real good work out there in this niche.

I don’t know how some people manage to stay on top of a couple thousand minor league players, in addition to all the kids pouring in every year from American high schools and colleges, not to mention the Dominican Republic and Venezuela and a couple dozen other countries. You’ve got to maintain contacts and weigh a bunch of sometimes disparate information to corroborate your evaluations.

Every year, dozens and dozens and dozens of minor leaguers are traded, and only a few of them are of the chosen upper echelon list of prospect. Most of them are interesting, developing players, however. There’s a large number of prospects who could make it, and a much smaller subgroup of them will make it. I don’t think anyone would dispute that. So how do you know which ones will and which ones won’t?

I think that’s a very, very difficult answer, and projecting outward is a very, very difficult job. In the course of thinking about this, I began a draft database project. I don’t have objective draft-based metrics to guide the probability of success or failure outcomes, and I can’t be the only one who’d be interested in some. I know that every year, a bunch of advanced college arms will be drafted, and a few of them will turn into tomorrow’s 5th starters and middle relievers and most of them won’t, and figuring out who is who is an imperfect science of guesswork.

My project will attempt to offer probabilistic guidance on outcomes. I want to answer questions like:

  • What are the odds of a 5th round pick reaching MLB and producing at least one 2 WAR season?
  • How often do draft-sourced 23-year-olds reach MLB in any capacity when they haven’t yet reached AA?
  • How are 3, 4 and 5+ WAR seasons distributed among the draft pool?
  • What are the odds of any minor league draftee at any level of reaching MLB after spending four full seasons in the minors?
  • How is MLB debut age distributed among the draft pool?

The process of creating a database involves a lot of manual data gathering and is very time intensive. I’m scrubbing from two different websites and I’m tracking every draftee’s debut level, as well as affiliate level and, if applicable, MLB season WAR by age through age 29. I have finished 1.5 drafts at this point, and I’ve drafted a plan to get through all of the remaining drafts by May, which is a brisk pace but will allow time for some other projects.

Another motivation for a draft database is to give myself a data pool I have familiarity with, which I think would make learning database query and programming languages easier to tackle if I follow through with trying to master those skills over the next couple years. A language you know will help you with a language you don’t. I’m not 100% sure that I want to dedicate a very significant part of my life to living in databases, but exploring that avenue is my happily vague plan.

At the point of finishing the database and acquiring advanced data set skills, I would be equipped to develop a projection system based on the probabilistic outputs I’m developing. I can see returning to these draft classes in the future to scrub additional data, such as BB%, K% and ISO, three factors I lean on heavily for projecting minor league players forward. Minor league performance data would improve the viability of any model, but I think there will nonetheless be quite a lot of helpful information to look at from age, minor league seasons and draft position alone. I like this as a jumping off point for considering growing that database into a projection model incorporating position and minor league performance data, at that point.

There are a lot of questions I’d like to dig into using this data, but I’ve got to collect all of it before I can do anything with it. Hopefully I’ll have some interesting draft articles for you to chew on in the weeks leading up to this year’s draft.