I was talking with a friend of mine late last year about baseball and the Ohtani signing. He idly speculated that the ratio of foreign-born to domestic players in the MLB had stabilized a while ago, perhaps around the 1960s. This was mostly a gut check, and I wasn’t convinced. I went digging for more info.
To investigate this properly, I needed comprehensive biographical information for MLB players. There’s a huge amount of freely-available data that is regularly audited and considered canonical for all practical purposes. I ended up using the Retrosheet Biofile. To simplify the analysis, I only considered the country of origin for players that debuted in a given year, starting in 1930. My python code is at the end of this post. I used pandas to build the table I needed, then used Google Sheets to build the rolling averages and create charts.
The first figure shows the total number of players that debut each year as a stacked bar chart based on whether they are domestic or international players. I show both the raw numbers and a 3 year rolling average.
The second figure shows the proportion of players who debuted each year that were born in the United States. Again, I have charts for both the raw numbers and the 3 year rolling average.
We can clearly see that my friend’s intuition was off. It does appear that the proportion of domestic players who debut each year has leveled off, but only as of the late 1990s. Here are a few other observations:
- There’s been an overall decrease since 1947, when Jackie Robinson debuted and broke the color barrier in MLB.
- The number of players debuting each year has overall increased. With baseball optimizing itself as a sport and having less patience for poor performance, and with more money available to throw around, there’s a burgeoning minor league system of professional players who could be called up at any moment to debut. There’s also a higher rate of September debuts as teams either prepare for the postseason and try to give their key players a rest, or a team with no hope at the postseason wants to give higher prospects some exposure to the league to see what happens.
- You can track some key internal and external events across history with these graphs. For example, there’s a huge jump in the number of players who debuted in 1944–45 when the United States entered World War II. There’s also a significant dip in 1994, which was a strike-shortened season.
This was a fun little project. I hope to dive into other random questions about baseball over time, since the data is available and it keeps my data analysis skills sharp.
Python Code
import pandas as pd
people = pd.read_csv('biofile/biofile.csv')
people = people[people['PLAY.DEBUT'].notnull()]
people[['DEBUT.MONTH','DEBUT.DAY','DEBUT.YEAR']] = people['PLAY.DEBUT'].astype('str').str.split('/',expand=True)
people['DEBUT.YEAR'] = people['DEBUT.YEAR'].astype(int)
people_smaller = people[['DEBUT.YEAR','BIRTH.COUNTRY']]
def country_compare(x):
if x == 'USA':
return 'USA'
else:
return 'Not USA'
people_smaller['COUNTRY.COMPARE'] = people_smaller['BIRTH.COUNTRY'].apply(lambda x: country_compare(x))
people_smaller = people_smaller[people_smaller['DEBUT.YEAR'] >= 1930]
people_smaller = people_smaller[['DEBUT.YEAR','COUNTRY.COMPARE']]
people_smaller['DEBUT.YEAR'] = people_smaller['DEBUT.YEAR'].astype(str)
df = people_smaller.value_counts().to_frame()
df.sort_values('DEBUT.YEAR',inplace=True)
df.reset_index(inplace=True)
df['DEBUT.YEAR'] = df['DEBUT.YEAR'].astype(str)
usa = df[df['COUNTRY.COMPARE']=='USA'][['DEBUT.YEAR','count']]
not_usa = df[df['COUNTRY.COMPARE']!='USA'][['DEBUT.YEAR','count']]
total = pd.merge(usa,not_usa,on="DEBUT.YEAR",suffixes=[' USA',' Not USA'])
total.to_csv('debut_origins.csv')