import React from 'react';
import './Article.css';  // Ensure you have a CSS file for styling
import HelmetName from '../components/HelmetName';

const SequencingPlusArticle = () => {
  return (
    <div className="article-container">
      <HelmetName title="Creating Sequencing+ (Part 1: Pitch Sequence Run Values)" description="An in-depth exploration of pitch sequencing and the creation of the Sequencing+ metric." />
      <h1>Creating Sequencing+ (Part 1: Pitch Sequence Run Values)</h1>
      <p><em>By Logan Anthony | July 16, 2023</em></p>

      <h2>Intro</h2>
      <p>A Brief Review of Pitch Sequencing Research</p>
      <p>
        In my opinion, the most intriguing areas of baseball analytics are the ones that challenge our intuition about the game and force us to reassess our most deeply held beliefs in baseball. The topic of pitch sequencing is one of these special areas of analytics because its effects are acknowledged, but measuring their magnitude largely remains a work in progress. Not only are the breadth of the problems surrounding pitch sequencing complex, but the number of angles one can approach these problems adds another layer of complexity on top.
      </p>
      <p>
        I’ve compiled a list of articles that demonstrate, with varying kinds of evidence, that changing speed/location/movement (or changing pitch types more broadly) is a valuable strategy in pitching:
      </p>
      <ul>
        <li>The Art and Science of Sequencing: Sabermetrics’ Undiscovered Country (Robert Arthur, 2014)</li>
        <li>Defining the Pitch Sequence Question (Peter Bonney, 2015)</li>
        <li>Finding Value in Fastballs Mixing (Max Weinstein, 2015)</li>
        <li>Entropy and Pitch Sequencing (Patrick Brennan, 2021)</li>
        <li>A Game Theoretical Approach to Optimal Pitch Sequencing (William Melville, Jesse Melville, Theodore Dawson, Delma Nieves-Rivera, Christopher Archibald, David Grimsman, 2022)</li>
        <li>Pitch Type Sequence Similarity Ratio: Understanding the Role Pitch Sequencing Plays in the MLB (Ajay Patel and Sean Sullivan, 2023)</li>
      </ul>
      <p>
        Some past efforts have cleverly analyzed pitch sequencing through measuring pitch-type randomness via entropy, mutual information, and Sequence Similarity Ratios. Other efforts have focused on measuring the effects of differences in velocity/location/movement between pitches. These are all great efforts by great minds in the field of baseball analytics, however, I believe there are two things lacking from past research on pitch sequencing that Peter Bonney outlined in his 2015 article linked above:
      </p>
      <ol>
        <li>Finding evidence that throwing Pitch A is better/worse than Pitch B.</li>
        <li>Measuring how much better/worse throwing Pitch A is than Pitch B.</li>
      </ol>

      <h2>Pitch Sequence Run Expectancy Matrix</h2>
      <p>
        A noteworthy exception is the 2022 research paper titled “A Game Theoretical Approach to Optimal Pitch Sequencing” by William Melville, Jesse Melville, Theodore Dawson, Delma Nieves-Rivera, Christopher Archibald, and David Grimsman. Using sophisticated neural network models, this group successfully evaluated pitch type decision-making by assigning expected run values to each pitch. In my opinion, this is the right approach to pitch sequencing research. I believe cracking the pitch sequence code has to involve run values on a pitch-by-pitch level, and this is my attempt to do exactly that.
      </p>
      <p>
        Similar to how current pitch grading models work (Stuff+, PitchingBot, etc.) where run values are predicted using a pitcher’s stuff, the goal is to predict “sequence“ run values using various aspects of a pitcher’s arsenal and behavior – the result being a pitch sequence version of Stuff+ or what I’m calling Sequencing+. To achieve this, however, a new run expectancy matrix will be required to capture relationships between run expectancies and pitch sequences – something that traditional run expectancy matrices (RE24 or RE288) aren’t designed to do. This pitch sequence run expectancy matrix will form the foundation for calculating sequence run values which will (in theory) be used to create a Sequencing+ model and resulting metric.
      </p>
      <p>
        If you’re uncertain or need a refresher on how a traditional run expectancy matrix is calculated, it’s very straightforward. We begin with a specific game state, such as 2 outs, no runners on base, and a 0-2 count, and calculate the average number of runs from that state until the end of the inning.
      </p>
      <p>
        When it comes to the pitch sequence run expectancy matrix, there are a few modifications. Instead of using columns to represent count states and rows to represent base states, I replaced the columns with pitch sequences and the rows with pitch numbers. It’s important to note that unlike the traditional run expectancy matrix, which calculates the average number of runs scored from a particular state until the end of the inning, the pitch sequence run expectancy matrix calculates the average number of runs scored from a specific state until the end of the plate appearance.
      </p>
      <p>
        For each state, we take the average number of runs thrown after the last pitch in the sequence has been thrown. Here’s an example of this pitch sequence [FF – SL – FF], this state is referencing the average number of runs scored on the third pitch of a Fastball-Slider-Fastball sequence, which happens to be a Fastball. Comparing that to another state [FF – SL – SL], which is the average number of runs scored on the third pitch of a Fastball-Slider-Slider sequence, which happens to be a Slider. It looks like a Slider is the best pitch to throw after a Fastball-Slider sequence has been thrown. The decision to throw a Slider instead of a Fastball turns out to be a pretty marginal gain at first glance, only about 0.036 runs on average are prevented if a Slider is thrown compared to a Fastball. But remember, this is on a pitch-by-pitch level, which means there are far more gains to be had throughout a single game, let alone an entire season – more on this later.
      </p>

      <p>
        To ensure the reliability of the pitch sequence run expectancy matrix, I made certain limitations in its construction. Specifically, I focused on plate appearances that consisted of six pitches or fewer and included six distinct pitch types: Four-Seam Fastball, Slider, Changeup, Curveball, Sinker, and Cutter. By restricting the analysis as such, we can maintain a reasonable sample size for accurate calculations. If we were to incorporate additional pitch types or allow for more pitches per plate appearance, we would encounter challenges with sample size limitations, which are already a concern in the current dataset.
      </p>
      <p>
        To overcome small sample sizes, I chose to use pitcher xwOBA to regress these run expectancies, given there’s already an existing relationship between the two. These regressed run expectancy values will then be attached to pitch-by-pitch level data. It’s certainly not a perfect approach – Sinkers and Cutters are thrown significantly less than other pitch types, and sequences that feature these pitches have noticeably less accurate run expectancies, even after they’ve been regressed. It’s worth considering taking them out entirely, but for now, I’ve decided to include them.
      </p>

      <h2>Sequence Run Values</h2>
      <p>
        Now that we have a new run expectancy matrix, we can derive sequence run values for each pitch by subtracting the run expectancy of the previous state from the run expectancy of the current state. Here’s an example: Let’s say a pitcher has thrown a [FF – SL – FF] sequence, he then throws a Changeup – what’s the run value of that Changeup?
      </p>
      <p>
        RE – RE = 0.081 – 0.117 = -0.036
      </p>
      <p>
        These run values will allow us one way to measure who the best pitch sequencers are, at least the best pitch sequencers in terms of pitch selection. I’ve gathered a list of the pitchers who have saved the most runs via pitch selection/sequencing in the last five seasons. I’ve also included a per-100 pitch version of SRV called SRV/100.
      </p>
      <p>
        To put these numbers into perspective, the best pitch sequencers are capable of preventing nearly half a run every 100 pitches they throw. Throughout a season, this can translate to saving a significant number of runs – potentially as many as a dozen for the most elite pitch sequencers. Going forward I’ll be referring to the total runs prevented as just “SRV” (the sum of the sequence run values) and the runs prevented per-100 pitches as SRV/100.
      </p>

      <h2>SRV Leaders</h2>
      <p>Here are some of the top SRV leaders over the last five seasons:</p>
      <ul>
        <li>Joe Musgrove (2021): -13.7 SRV, -0.47 SRV/100</li>
        <li>Corbin Burnes (2022): -12.6 SRV, -0.38 SRV/100</li>
        <li>Justin Verlander (2019): -12.5 SRV, -0.36 SRV/100</li>
        <li>Miles Mikolas (2022): -12.4 SRV, -0.39 SRV/100</li>
        <li>Stephen Strasburg (2019): -12.1 SRV, -0.36 SRV/100</li>
        <li>Adam Wainwright (2021): -12.1 SRV, -0.39 SRV/100</li>
        <li>Zack Greinke (2019): -11.9 SRV, -0.38 SRV/100</li>
        <li>Madison Bumgarner (2019): -11.8 SRV, -0.36 SRV/100</li>
        <li>Jon Lester (2017): -11.8 SRV, -0.39 SRV/100</li>
        <li>Adam Wainwright (2019): -11.7 SRV, -0.41 SRV/100</li>
        <li>Corey Kluber (2018): -11.5 SRV, -0.36 SRV/100</li>
        <li>José Quintana (2017): -11.4 SRV, -0.36 SRV/100</li>
        <li>Adam Wainwright (2022): -11.3 SRV, -0.36 SRV/100</li>
        <li>Jon Lester (2019): -11.3 SRV, -0.37 SRV/100</li>
      </ul>

      <h2>SRV Correlations with Pitching Metrics</h2>
      <p>
        Another interesting discovery emerged when analyzing the distribution of SRV and SRV/100. It became apparent that every pitcher, with few exceptions, has a negative impact. In other words, every pitcher benefits from their pitch sequencing, albeit to varying degrees. This might not be a revelation, but it’s further evidence that pitch sequencing effects are a matter of magnitude, not existence.
      </p>
      <p>
        Next, let’s explore some evidence that demonstrates the significance of SRV and SRV/100 as more than just stable metrics but as ones that correlate with important pitching stats. To begin, let’s examine the “stickiness” of SRV and SRV/100, which refers to their year-to-year predictability. Promisingly, SRV and SRV/100 exhibit a year-to-year correlation of 0.68 and 0.80 (minimum 800 pitches thrown) respectively. Granted, we can expect some degree of year-to-year relationship since we used xwOBA as a regression factor, which itself has a year-to-year correlation of approximately 0.4. However, the fact that SRV and SRV/100 display almost twice the stickiness of xwOBA suggests that the signal embedded in these stats extends well beyond the influence of xwOBA alone.
      </p>
      <p>
        Next, I aggregated the pitch-by-pitch level data by sequence run value bins and calculated xwOBA and SwStr% for these bins. You’ll notice a pattern emerge as we ascend the SRV bin ladder; pitches with the lowest SRVs demonstrate a remarkable ability to generate a high swinging strike rate while maintaining a low xwOBA – conversely, pitches with high SRVs exhibit the opposite pattern.
      </p>
      <table>
        <thead>
          <tr>
            <th>SRV Bin</th>
            <th>xwOBA</th>
            <th>SwStr%</th>
            <th>n</th>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td>-0.079 – -0.042</td>
            <td>0.222</td>
            <td>14.3%</td>
            <td>40,876</td>
          </tr>
          <tr>
            <td>-0.042 – 0</td>
            <td>0.293</td>
            <td>11.7%</td>
            <td>236,978</td>
          </tr>
          <tr>
            <td>0 – 0.031</td>
            <td>0.326</td>
            <td>9.5%</td>
            <td>595,642</td>
          </tr>
          <tr>
            <td>0.031 – 0.067</td>
            <td>0.363</td>
            <td>7.5%</td>
            <td>31,219</td>
          </tr>
          <tr>
            <td>0.067 – 0.105</td>
            <td>0.409</td>
            <td>6.7%</td>
            <td>3,132</td>
          </tr>
        </tbody>
      </table>
      <p>
        Finally, I wanted to find any relationships between SRV and SRV/100 and other important pitching stats. Here are the highest correlations I found:
      </p>
      <table>
        <thead>
          <tr>
            <th>Metric</th>
            <th>SRV</th>
            <th>SRV/100</th>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td>ERA</td>
            <td>0.22</td>
            <td>0.06</td>
          </tr>
          <tr>
            <td>FIP</td>
            <td>0.18</td>
            <td>0.04</td>
          </tr>
          <tr>
            <td>HardHit%</td>
            <td>0.18</td>
            <td>0.07</td>
          </tr>
          <tr>
            <td>Contact%</td>
            <td>0.17</td>
            <td>0.06</td>
          </tr>
          <tr>
            <td>xFIP</td>
            <td>0.16</td>
            <td>0.04</td>
          </tr>
          <tr>
            <td>SIERA</td>
            <td>0.15</td>
            <td>0.03</td>
          </tr>
          <tr>
            <td>BB/9</td>
            <td>0.13</td>
            <td>0.07</td>
          </tr>
          <tr>
            <td>O-Contact%</td>
            <td>0.12</td>
            <td>0.04</td>
          </tr>
          <tr>
            <td>Zone%</td>
            <td>0.11</td>
            <td>0.11</td>
          </tr>
          <tr>
            <td>Z-Contact%</td>
            <td>0.11</td>
            <td>0.02</td>
          </tr>
          <tr>
            <td>Pitching+</td>
            <td>-0.11</td>
            <td>0.06</td>
          </tr>
          <tr>
            <td>Stuff+</td>
            <td>-0.11</td>
            <td>0.05</td>
          </tr>
          <tr>
            <td>K/9</td>
            <td>-0.14</td>
            <td>-0.02</td>
          </tr>
          <tr>
            <td>O-Swing%</td>
            <td>-0.17</td>
            <td>-0.10</td>
          </tr>
          <tr>
            <td>SwStr%</td>
            <td>-0.17</td>
            <td>-0.07</td>
          </tr>
          <tr>
            <td>Age</td>
            <td>-0.21</td>
            <td>-0.14</td>
          </tr>
          <tr>
            <td>WPA</td>
            <td>-0.25</td>
            <td>-0.05</td>
          </tr>
          <tr>
            <td>WAR</td>
            <td>-0.36</td>
            <td>-0.02</td>
          </tr>
        </tbody>
      </table>

      <h2>What’s Next?</h2>
      <p>
        The point of this article is to share a process with you. I’ve intentionally refrained from going into the specific details of building the Sequencing+ model because, well, I’m still creating the plan. But, my blueprint for the Sequencing+ model is based on current pitch grading models that utilize variables describing a pitcher’s stuff – ei: spin rate, average velocity/movement differences, approach/release angles, release points, extension etc. to predict run values. It’s plausible the signal these variables hold in predicting a pitcher’s stuff also translates to the prediction of how effective a pitch sequence is. Take, for instance, the average difference in velocity/movement between two of a pitchers’ pitches, something that has been proven to have a significant impact on Stuff+.
      </p>
      <p>
        But considering the significance of pitch order within sequences, additional inputs will be required to capture the dynamics between pitches within the same sequence. For instance, we might need a variable called “pitch_2_velo_diff_pitch_1“ to describe the velocity difference between the second and first pitch. We may have to get even more complicated and create variables like “pitch_4_velo_diff_pitch_2” to capture the velocity difference between the fourth and second pitch thrown in the plate appearance – if such an effect exists. I’m just brainstorming here – nothing too serious. Scaling this approach across all permutations of pitch types, pitch numbers, and variables, I’ve quickly realized how complex this project could become.
      </p>
      <p>
        All of this could be completely unrealistic for reasons that I can’t know yet. There’s also the chance that I’m not the right person for the job which is partly the reason why I’m putting this into the public space. The harsh reality of baseball analytics is that many ideas never materialize the way we want them to. Regardless, if nothing else materializes from this project I’m happy with what SRV and SRV/100 provide. It’s not a perfect metric but it certainly captures some level of pitch sequencing ability concerning pitch selection, and I’m interested in exploring more of what it has to offer in the future.
      </p>
      <p>Thanks for reading! If there are any questions or comments, feel free to reach out to me via email or twitter.</p>
    </div>
  );
};

export default SequencingPlusArticle;
