Making MLB Team Scatter Plots

You may have seen any number of scatter plots on the internet that show data comparisons among players or teams in a given league. These are part of my daily experience on the /r/baseball community, and I finally decided to scratch my statistical presentation itch by making my own. This post isn’t to cover what statistics to compare, just the process I’ve settled on for now to turn a table of comparisons into precisely-designed charts suitable for sharing on the internet.

We’ll assume you have your data from a source like Baseball Reference, FanGraphs, or Baseball Savant. You’re making a scatterplot after all, so you likely already know which two statistics are being compared.

So, open up that spreadsheet and get your data ready. I tend to list team data alphabetically by a team’s abbreviation, like this.

Three columns of a spreadsheet showing MLB team abbreviations, Whiff Rate, and Home Runs.
Initial data downloaded from Baseball Savant.

Now it’s time to figure out where the points of our plot need to go on a canvas to make the final result accurate. I believe Adobe Illustrator has a CSV import tool that allows for some automatic creation of scatter plots, but I don’t pay for the Adobe Suite. I use the Affinity suite of apps, and their Designer program doesn’t have this functionality. So, I find the locations “by hand”.

First, I find the maximum, minimum, range, and average of each statistic. This lets me plan out the rough scale of my plot. In the screenshot above, you can see Whiffs per Swing has a range of 7.91%, or 0.0791, from 0.1956 to 0.2747. Home runs has a range of 40, from 35 to 75.

I can now plan out the bounds of my data. I don’t want any data points to be on the actual edge of the plot, so I’ll have Whiffs range from 0.19 to 0.28, and Home Runs range from 30 to 80. Everything should fit comfortably in there.

The benefit of working in a vector graphics tool like Affinity Designer or Illustrator is that you can make the starting numbers as nice as is beneficial, and then arbitrarily scale everything from there to create the final design. I’ll make nice powers-of-10-based decisions about how the data maps to pixels on the design canvas. Let’s say every change of 0.01 in Whiffs corresponds to 100 pixels on the canvas, while every increase of 10 home runs should map to 200 vertical pixels. This gives me a 900-by-1000 canvas to start, and I can run a formula on how to place my data points on said canvas.

Let’s now rephrase our data-to-pixel conversions: A change of 1 in Whiffs corresponds to 10,000 pixels, and a change of 1 in home runs corresponds to 20 pixels. This’ll make it more intuitive to write our formulas.

Whiff rate is easy, since it’s the horizontal axis. As we move to the right on the canvas, both Whiff rate and the x-coordinate of the pixels increase. So, we find the difference between the Whiff rate data point and 0.19, then multiply by 10000 to get the corresponding horizontal pixel coordinate.

For my spreadsheet, in T2 I would enter =10000*(P2-0.19).

The vertical axis is a smidge more annoying because typically 0 pixels is the top of the canvas, so as Home Runs increase on our chart, the y-coordinate of the pixels decrease. We subtract 20 times the difference between our data point and 30 from the maximum pixel value of 1000.

For my spreadsheet, in U2 I would enter =1000 - 20*(S2 - 30).

Complete your formulas down the columns, and you have a nice spreadsheet with pixel coordinates that correspond to your canvas. Great!

A spreadsheet the same as the previous one, but with two additional columns: the x and y pixel coordinates for placing these points on a canvas.

Once this is ready, you can open up a canvas with your required dimensions—again, 900-by-1000 pixels for me. I add axes at the average of each of my data points, which you can see I’ve also put in at row 32 of the sheet above. Once the axes are there, I start entering the points.

But wait, what points? Isn’t the point to have a fun design that contains team logos? That’s right. I discovered this page that lists all 30 MLB teams, and right-clicking to save the logos gives me files that (a) are already named with the teams’ abbreviations, and (b) are .svg so they can be resized.

I’ve already saved these to a folder, so I drag each logo onto the canvas, resize it with a locked aspect ratio to be 50 pixels tall,1I discovered this after a bit of experimentation. Ensuring the height is the same felt most important to making all the logos “feel” the same size, and 50 pixels seems appropriate for the approximately 1000-by-1000 pixel canvases I’ve settled on as a default. and adjust the coordinates according to my spreadsheet.

You can see in the top-right the pixel coordinates for BAL, and that the logo is locked at a height of 50 pixels.

Once all the logos are in place, I select the entire canvas and group the items. I then resize the canvas to be a bit larger so I can add in axis labels, a title, and all that good stuff. That leaves me with a finished product like so!

While there are probably slicker ways to do this,2I’m fascinated by the idea of writing a Python script to do some or most of this for me, but haven’t yet dove into that. entering 30 data points is pretty quick work and I enjoy seeing the plot build up as I move along. I always make a scatter plot in Google Sheets based on the original data though, so I can compare it to the final design and ensure I didn’t make a mistake.3The first time I did this I really messed up the vertical coordinate formula. The second time, I did all the scaling on Whiff rate incorrectly, though I figured that out just be looking at the table of coordinates.

I enjoyed building out this process for myself on a whim. It only took two tries before I settled on something fairly efficient and repeatable, and it’s scratched a data presentation itch I didn’t quite know I had.

  • 1
    I discovered this after a bit of experimentation. Ensuring the height is the same felt most important to making all the logos “feel” the same size, and 50 pixels seems appropriate for the approximately 1000-by-1000 pixel canvases I’ve settled on as a default.
  • 2
    I’m fascinated by the idea of writing a Python script to do some or most of this for me, but haven’t yet dove into that.
  • 3
    The first time I did this I really messed up the vertical coordinate formula. The second time, I did all the scaling on Whiff rate incorrectly, though I figured that out just be looking at the table of coordinates.

One Reply to “Making MLB Team Scatter Plots”

Leave a Reply