Home of the original IBM PC emulator for browsers.
[PCjs Machine "ibm5170"]
Waiting for machine "ibm5170" to load....
SIMBASE is a baseball simulation program. It contains data for two
teams. This program gives you a tool to analyze the value of a
particular player or a group of players to the team over the course of
one or more full seasons. It lets you review the pluses and minuses of
different player lineups. And it provides a framework where you can
forecast the likely outcome for a full baseball season. To really use
SIMBASE productively, you must run a large number of games and look at
the averages. Once you are familiar with the program, you will want to
amend the LINEUPS.DAT file by shuffling or adding and subtracting
players. This is easily done with any wordprocessor that reads and
writes ASCII files.
Disk No 1186
Program Title: BASEBALL version 2.0 (Disk 1 of 2)
PC-SIG version 1
SIMBASE is a baseball simulation program. It is a limited version,
however, there is only data for two teams. This program gives you a
tool to analyze the value of a particular player or a group of players
to the team over the course of one or more full seasons. It lets you
review the pluses and minuses of different player lineups. And it
provides a framework where you can forecast the likely outcome for a
full baseball season. To really use SIMBASE productively, you must run
a large number of games and look at the averages. Once you are familiar
with the program, you will want to amend the LINEUPS.DAT file by
shuffling or adding and subtracting players. This is easily done with
any wordprocessor that reads and writes ASCII files. This disk is disk
one of a two-disk set. The second disk of the set is disk #1187.
Usage: Sports/Entertainment.
Special Requirements: None.
How to Start: Type SIMBASE (press enter).
Suggested Registration: $25.00
File Descriptions:
SIMBASE DOC Documintation for SIMBASE.
SIMBASE EXE Main program.
GO BAT Batch file to start program
READ ME Introductory text file
SETUP BAT Batch file to configure program
PC-SIG
1030D E Duane Avenue
Sunnyvale Ca. 94086
(408) 730-9291
(c) Copyright 1987,88 PC-SIG Inc.
Baseball Simulation Program
October 10, 1988
Version 2.0
Philip Smith
109 Tripp Crescent
Nepean, Ontario
Canada K2J 1E2
TABLE OF CONTENTS
I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
II. RUNNING THE PROGRAM . . . . . . . . . . . . . . . . . . . . . . . . . 2
III. THE FUNCTION KEYS. . . . . . . . . . . . . . . . . . . . . . . . . . 4
IV. HOW THE PROGRAM WORKS . . . . . . . . . . . . . . . . . . . . . . . . 7
V. SETTING THE PROBABILITIES THAT DRIVE THE SIMULATION . . . . . . . . . 9
VI. COMMAND LINE OPTIONS. . . . . . . . . . . . . . . . . . . . . . . . . 11
VI.1 RANDOM NUMBER GENERATORS . . . . . . . . . . . . . . . . . . . . 11
VI.2 DISPLAY MONITORS AND VIDEO MODES . . . . . . . . . . . . . . . . 12
VI.3 THE DATA SUB-DIRECTORY . . . . . . . . . . . . . . . . . . . . . 12
VI.4 THE PITCHER POWER . . . . . . . . . . . . . . . . . . . . . . . . 13
VII. ILLUSTRATIVE PRACTICAL RESULTS FROM THE PROGRAM . . . . . . . . . . 14
VIII. FUTURE IMPROVEMENTS . . . . . . . . . . . . . . . . . . . . . . . . 15
OVERVIEW OF THE BASEBALL SIMULATION PROGRAM . . . . . . . . . . . . . . . 16
CONDITIONS ON THE USE OF THE BASEBALL SIMULATION PROGRAM . . . . . . . . . 17
REGISTRATION FORM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
- 1 -
Baseball Simulation Program
October 10, 1988
Version 2.0
Philip Smith
109 Tripp Crescent
Nepean, Ontario
Canada K2J 1E2
I. INTRODUCTION
This is a program to simulate baseball games. And baseball is a
great game indeed. Millions of fans testify to this fact every year by
spending enormous amounts of time and money on it. The game is fun as an
athletic challenge, as a casual diversion and as a subject for serious study.
Personally I love it. If you are reading this, you probably do too.
The game of baseball lends itself quite naturally to probabilistic
analysis. This BASEBALL SIMULATION PROGRAM is intended as a tool for use in
such analysis. This program is NOT intended as a "game" itself. It does NOT
make little graphic images of baseball players run about the screen and it
does NOT cast you in the role of team manager for a particular game, at least
not in the sense of letting you decide when to bunt, when to pinch hit or
when to bring in a relief pitcher. What the program DOES do is this: it
provides you with a tool for analyzing the value of a particular player or a
group of players to the team over the course of one or more full seasons. It
allows you to review the pluses and minuses of different player lineups. And
it provides a framework within which you can forecast the likely outcome for
a full baseball season. You can use the program to help you assess current
or historical won-loss records in the major leagues, or you can use it to
help improve the prospects of your own local team.
This is the second major version of the SIMBASE program. The first
version was released in August 1987. There are several improvements in this
new version including:
- an improved user interface;
- more efficient coding, in the C language
instead of in BASIC;
- a new database with statistics for all
26 major league teams;
- the ability to select teams and lineups
interactively;
- display of player performance statistics
during the simulations;
- a choice of random number generator
functions;
- a "pitcher importance" factor; and
- detailed printed output.
- 2 -
Baseball is a complex game. It is safe to say that no one will ever
succeed in creating a program that can simulate all of its features
faithfully. Certainly there is much room for improvement in my program. I
am continuing to work on it, making it both more realistic and more
efficient. But the the program does simulate the outcomes of real world
major league baseball games moderately well. I hope you find this new
version of SIMBASE both useful and fun.
II. RUNNING THE PROGRAM
Running the BASEBALL SIMULATION PROGRAM is easy. You just type its
name SIMBASE at the DOS prompt, wait while several supporting files are read
in from the disk and proceed by choosing options using the function keys.
The program is distributed on two diskettes. Diskette #1 contains
the program itself (SIMBASE.EXE), this documentation file (SIMBASE.DOC), a
batch file to copy the SIMBASE files to your hard disk if you have one
(SETUP.BAT), a quick start batch file (GO.BAT) and a small text file with
basic information about the program (README). Diskette #2 contains six
PREBAT and POSTBAT files (PREBAT.DAT, PREBAT1.DAT, PREBAT2.DAT, POSTBAT.DAT,
POSTBAT1.DAT and POSTBAT2.DAT) and twenty-six TEAM files (TEAM1, TEAM2, ...
TEAM26).
You can run the program by putting diskette #1 in your a: drive,
putting diskette #2 in your b: drive and issuing the command:
SIMBASE f=b:\
at the DOS prompt. (The "f=" option and other command line options are
explained in Section VI below). If you have a hard disk drive you can
create a sub-directory with a name such as SIMBASE, copy all of the files
from the two distribution diskettes onto the hard drive and then simply
issue the command:
SIMBASE
at the DOS prompt. In this case the "f=" option is not needed because the
default case assumes all files are in the same sub-directory on the same disk
drive. However, if you wanted to use the command line option anyway the
proper form would be:
SIMBASE f=c:\SIMBASE\
assuming that your hard drive is designated c: and the sub-directory where
the files are stored is named SIMBASE.
When the program is loading you will see a message displayed on the
bottom line of the screen that says: "Now reading the baseball statistics.
Please be patient." When the statistics have been completely read in and the
program is ready to respond to your commands a second message will be
displayed on this line that says: "The baseball statistics are now read in.
Press F1 for help." Alternatively, if the program is unable to find the
statistics it is looking for, an error alert will be displayed on the message
line. These examples illustrate an important feature of the program. The
bottom line of the screen always acts as a message line. It provides
important information about what the program is doing and what kind of inputs
- 3 -
it expects from you, if any. You should keep your eye on the message line.
After the program is loaded you can begin by pressing the F1 key.
This will give you a small "help screen" that summarizes the separate actions
performed by each of the function keys. The program uses thirteen such keys:
the ten function keys themselves (F1, F2, ... F10), the first two control-
shifted keys (<ctrl>-F1 and <ctrl>-F2) and the escape key (ESC). Section III
below provides a detailed account of what each function key is used for.
If you want to get a quick idea of what the program is all about,
you should begin with the F5 key. You will see the baseball diamond and the
scoreboard appear. A player will come up to bat. A single game simulation
will then begin, taking you step-by-step through the probabilistic outcomes
for each AT BAT phase, using default teams and lineups. You proceed from one
step to the next by pressing a key -- any key except the ESC key, which takes
you out of the single game simulation and back to the main menu.
Stepping through a single game in this manner is a useful way
to learn how the program works. But it is not what the program is really
intended for. To really use SIMBASE productively you must run a large number
of games and look at the averages. Random factors can easily dominate in any
one ballgame. The results of one or two games alone mean almost nothing. To
see what's really going on, you must run a large number of games and look at
the central tendency. This is true in real life baseball and it's also true
in this simulation program. The difference however is that in real life it
takes six months to complete a full season's worth of games. With this
program it takes only five minutes or so! (It could take more or less,
depending on what kind of computer you are using.)
To run a multi-game simulation, just press the F6 key. The more
games you run the better, from the point of view of getting a reliable
statistical sample. But there are tradeoffs here. You don't want to sit
waiting in front of your computer screen for too long. And anyway, there are
diminishing returns: after a point, adding more games doesn't make the
sample much more reliable. You'll have to experiment with the program to
arrive at your own judgement on how large your samples should be. If you've
got a very powerful machine, say an 80386 or an 80286 with a stepped up clock
speed, you can afford to use larger samples. Personally I like to run three
full seasons of 162 games -- 486 games in total -- for my sample. But this
takes fifteen minutes on my 10 MHz 80286 based machine. You may want to
settle for less. The maximum number of games you can do at one time is 650.
Once you are familiar with the program and are using it seriously,
you will want to amend the player lineups by shuffling players. Rearranging
the lineups between simulations is the name of the game here. It's by doing
this that you can learn the net advantage of using batter X in the 3rd
batting spot instead of in the 4th one, for example. You do this with the F3
key. You will also want to work with a variety of different teams. You
choose your teams with the F2 key. And you will want to see lots of detailed
output from the simulations, on your screen and possibly on your printer.
For this you use the F7, F8 and F10 keys. You may also want to examine the
input statistics (which come from the 1987 season) and the associated
probabilities and PREBAT/POSTBAT plays that drive the simulations. You do
this using the F4, F9, <ctrl>-F1 and <ctrl>-F2 keys.
Eventually you may want to set up your own database, perhaps using
statistics from your local amateur and semi-professional baseball teams.
- 4 -
This is easily done with any word processor that reads and writes pure ASCII
files. (PC-WRITE (C) and NEW YORK WORD (C) are two excellent shareware
options for this.) When you create new TEAM files or make changes to the
existing National and American League files be sure to stick exactly to the
current format. The program expects certain information to be in certain
columns, although it is tolerant about the order in which you put the lines
within any particular file.
III. THE FUNCTION KEYS
The program's operation is based on 13 major function keys: the
keys F1, F2, ... F10, the control-shifted function keys <ctrl>-F1 and
<ctrl>-F2 and the escape key ESC. If the program is expecting one of these
keys and you press some other key, you will hear a sharp beep. The following
is an account of how each of the 13 function keys are used.
Function key F1 is the help key. You can press this key whenever
the program expects a major function key and it will display a short
explanation of what each of the 13 function keys is used for.
Function key F2 is used to select the teams that will play. If you
do not press this key to select teams explicitly, then the two default teams
Toronto Blue Jays and New York Yankees will be used. (These two teams are
the defaults because the are named TEAM1 and TEAM2. If you want two other
teams to be the defaults, all you have to do is reorder the team names
accordingly.) When you press the F2 key, the 26 National and American League
team names will be displayed. You use the cursor control keys (the arrow
keys) to move to the first team you want, and then press the return key.
This will select the visiting team and display its name at the top of the
screen. Then you use the cursor control keys again to move to the second
team you want, and select it by pressing the return key a second time. You
can change your mind if you wish, by simply repeating the procedure. You can
even have the same team play against itself! When your selections are ready
you press the ESC key. The program proceeds to read the statistics for the
players on the two selected teams. It then returns to the main level. You
are ready to execute another function by pressing another function key.
Function key F3 is used to select the player lineups. Each team
has a default lineup and these defaults are displayed initially. You change
the lineups in the following manner. First move the cursor, using the arrow
keys, to select a player. Press the return key to select the player and it
will then be shown as "selected" at the bottom of the screen. Then move the
cursor to another player and press the return key again. The previously
selected player will be swapped for this second player. The players not
currently in the 10-player lineup are listed as hitters and pitchers on the
bench. These players are "stacked" on top of one another and only one of
them can be seen at a time. You must move the cursor to the first one and
then page through the list one by one. When you complete the list of hitters
on the bench the cursor will jump to the list of pitchers. The "select and
swap" procedure can be used either to switch the positions of two players
already in the lineup, or to move players to and from the bench.
The game can play with or without the designated hitter rule. The
lineups include nine places for batters and a separate, tenth place for the
pitcher. If you wish to play by National League rules, the pitcher must be
listed twice, once as a hitter and again as a pitcher. The database has all
- 5 -
pitchers' pitching records, but only a few pitchers' batting records are
included at this time. If you want to play without the designated hitter
rule and the batting record for the pitcher you have chosen is not in the
database, then I suggest you use the batting record for another pitcher.
Most pitchers are poor hitters, so this approximation may work adequately in
many cases. If you simply must have the proper statistics for your favorite
pitcher, then you can add them to the database yourself: the statistics are
stored in a set of ASCII files called TEAM1, TEAM2, ... TEAM26 that can be
easily modified by virtually any wordprocessing program.
Function key F4 is used to display the player statistics for the
two teams that will play or have just played. These are the statistics that
drive the simulations. They are drawn from the full 1987 season and include
all major players on each team. The statistics displayed for hitters are the
number of at bats (AB), the number of singles (1B), doubles (2B), triples
(3B) and home runs (HR), the total number of hits (H), the number of bases on
balls (BB), the number of strikeouts (SO), the number of other outs (OO), the
batting average (BA) and the slugging average (SA). For pitchers the
statistics displayed are the number of wins (W) and losses (L), the number of
at bats by opponents (AB), the number of one-to-three-base hits by opponents
(1-3B), the number of home runs by opponents (HR), the total number of hits
by opponents (H), the number of bases on balls given up to opponents (BB),
the number of strikeouts of opponents (SO), the batting average of opponents
(BA), the on-base average of opponents (OBA) and each pitcher's earned run
average (ERA). There are four screens of data and you page from one to the
next by pressing any key.
Function key F5 is used to play a single game in step-by-step mode.
This function is included in the program to demonstrate how the simulations
are conducted. When you press F5 a baseball diamond is displayed with an
accompanying scoresheet. You cause the simulations to proceed by pressing
any key. Three keypresses are required for each batter, since there are
three phases to each at bat (more on this in Section IV below). A box on the
screen contains text describing what is happening. The F5 function has no
real use other than to demonstrate, in slow motion, how the simulations are
performed. It is not intended as a "game" function, as no provision has been
made to allow player substitutions or other decisions by the manager during
the course of the game. Such game features could easily be included,
however, and I may add some to the F5 function in future if users indicate
they would like to have them built in.
When you use the F5 function to demonstrate how a ballgame is
simulated you will see a brief textual description of what is happening in a
box in the lower left hand portion of the screen. You will also see a play
number and a probability, enclosed in square brackets, within this same box.
The play number identifies the outcome taken from the PREBAT.DAT and
POSTBAT.DAT files (and displayed with the <ctrl>-F1 and <ctrl>-F2 keys --
more on this below). When you use the "r=" command line option to select a
random number generator explicitly, the random numbers themselves and the
associated seed value will also be displayed.
Function key F6 is the most important function key in the program.
It causes a number of game simulations to be performed, using the teams and
lineups that have been previously selected. When you press F6 a small box
appears on the screen asking how many game simulations you wish to compute.
You must select a number greater than 0 and no more than 650. It takes about
1.75 seconds to compute each game simulation on my 80286 based PC. It could
- 6 -
take more or less time than this on your machine, depending on what kind you
have. The simulation time per game is also affected by the random number
generator used. When the simulations are complete a short summary of the
results will be displayed on the screen. This summary includes the total
runs, hits and errors accounted for by each team plus the averages, standard
deviations and variances for each of these variables. The number of wins and
the winning percentage for each team is also displayed. You can see more
detailed results by using the F7, F8 and F10 keys. When you use the F6
function to conduct multiple simulations, exactly the same process is being
followed as when you conduct a single simulation using the F5 function,
although the play-by-play results are no longer displayed on the screen.
Function key F7 displays the player statistics for the simulated
games. The display is similar to that obtained with the F4 key, except that
the statistics are calculated from the simulations rather than from the 1987
season. Four screens are displayed, showing the hitters and pitcher from
each team. You display these screens sequentially by pressing a key. If the
number of games simulated is very large, the simulated player statistics
should be close to the actual statistics from 1987. For a small number of
games however, the statistics could diverge quite markedly from the full
season benchmark numbers.
Function key F8 displays the detailed runs, hits and errors table
for the previous set of simulated games. This display shows the number of
runs, hits and errors accounted for by each team for each game simulated. It
can go on for several screens if a large number of games have been simulated,
since only 15 games can be shown on a single screen.
Function key F9 displays the hitting probabilities that drive the
simulations. These probabilities are calculated from the player statistics
for the 1987 season. Three sets of probabilities are displayed. The first
is the set of probabilities based solely on the hitters' overall batting
records (that is, without any allowance for the particular pitcher they will
be facing). Each hitter has a separate set of probabilities in this display,
based on his own record. The second set of probabilities is based solely on
the opposing pitcher's statistical record. All hitters on a particular team
have the same set of probabilities in this case, as they all face the same
opposing pitcher. Finally, the third set of probabilities is a combination
of the first two. In fact, this third set of probabilities is the one
actually used in the simulations. In combining the hitters' probabilities
with the pitchers' it is possible to put relatively more or relatively less
weight on the pitcher's record. Baseball lore has it that "good pitching
beats good hitting", suggesting that the pitchers' probabilities should get
the greater weight. But you may not agree with this. The program
incorporates a parameter (the "pitching power") that will allow you to put
more or less weight on the pitchers' statistics relative to the hitters',
according to your preference (more on this below). The default value for
this parameter is one, which implies equal weight for pitchers and hitters.
Function key F10 generates simulation output for the printer. Two
kinds of output can be obtained: a short summary of the results or a longer,
more detailed summary. The short summary includes:
- the number of games played;
- the "pitching power" used;
- 7 -
- the type of random number generator used;
- the lineups for each team;
- the runs, hits and errors summary table
(same as displayed automatically on the
screen following a set of simulations); and
- the player performance statistics (same
as displayed on the screen with the F7
key).
The longer printout includes all of these results plus the detailed game-by-
game runs, hits and errors results (the same as displayed on the screen with
the F8 key).
The <ctrl>-F1 key displays the PREBAT plays that drive the
simulations. There are 137 of these plays. Their role in the simulations is
described in more detail in Section IV below. Some impossible (zero
probability) plays have been included in the list for completeness. You can
choose to see these too or to pass them by, as you prefer. You can also
choose the play number (between 1 and 137) that you wish to begin with.
The <ctrl>-F2 key is similar to the <ctrl>-F1 key and displays the
POSTBAT plays that drive the simulations. There are 607 of these plays.
Their role too is described in more detail in Section IV below. Again, some
impossible (zero probability) plays have been included in the list for
completeness. You can choose to see these too or to pass them by, as you
prefer. You can also choose the play number (between 1 and 607) that you
wish to begin with.
Finally, the ESC key is used to exit the program. Whenever you
press this key a warning will appear on the message line at the bottom of the
screen, asking if you REALLY want to quit the program. If you do, then press
the y key for yes. Pressing any other key will tell the program to continue
executing.
IV. HOW THE PROGRAM WORKS
The program generates probabilistic outcomes for a sequence of
AT BATs, based on individual batter and pitcher statistics from real world
games. Each AT BAT is broken down into three PHASEs, called the PREBAT
phase, the BATTING phase and the POSTBAT phase. The program generates
random numbers between zero and one for each PHASE of each AT BAT, and
then looks up the associated outcome in a probability table. Several
alternative random number generators are available within the program,
although the results are not greatly affected by the particular choice (see
Section VI on command line options for more information about alternative
random number generators). The program keeps generating AT BATs until three
outs are accumulated. Full nine-inning games are simulated. The possibility
of extra innings is allowed for. The program can simulate several games in
rapid succession and then generate a summary table displaying the results.
To print the results, use F10.
The three PHASEs of each AT BAT are as follows:
- 8 -
(1) The PREBAT phase.
This is the period while the batter is at the
plate. During this period runners may attempt
to steal or may be picked off, balks may occur,
wild pitches may be thrown, etcetera. In other
words, any runners who happen to be on base may
advance or be thrown out in various ways.
(2) The BATTING phase.
This is the PHASE where the direct outcome of
the AT BAT itself is determined. It can be one
of eight possibilities:
(i) a hit that is normally a single;
(ii) a hit that is normally a double;
(iii) a hit that is normally a triple;
(iv) a home run;
(v) a walk;
(vi) a strikeout;
(vii) a ground ball that is normally an out;
(viii) a fly ball that is normally an out.
(3) The POSTBAT phase.
This is the period when runners, possibly
including the hitter, are running the bases
after the BATTING outcome. It is a time when
fielding, throwing and running errors, double
plays, etcetera can occur.
The net outcome from the AT BAT is the combined result of the interdependent
outcomes from these three PHASEs.
The simulation is organized around the eight possible states of the
bases that may occur:
(1) All bases clear.
(2) One player on first.
(3) One player on second.
(4) One player on third.
(5) Players on first and second.
(6) Players on second and third.
(7) Players on first and third.
(8) Bases loaded.
All innings start with state (1) and proceed through a series of
transformations to other states as the inning progresses. The probabilities
of different outcomes under each of the three sequential PHASEs of an AT BAT
depend upon the initial state. For example, the probability of a ground ball
double play at first and second in the POSTBAT phase is zero if the initial
state is (1) or (3) or (4) or (6), but it will be a positive number with any
other initial state.
- 9 -
V. SETTING THE PROBABILITIES THAT DRIVE THE SIMULATION
In the present version of the program, the probabilities for the
outcomes under phases (1) and (3) are determined independently of the player
statistics. These probabilities can be examined by using the <ctrl>-F1 and
<ctrl>-F2 function keys. I have been intending to revise the PREBAT and
POSTBAT plays, deriving better estimates of the probabilities based on the
results of Project Scoresheet (see below), but this still lies in the future.
I also intend to improve the program so that these probabilities get adjusted
on the basis of the individual player statistics, rather than being
predetermined. The probability of a successful steal, for example, should
depend crucially on who is on base and who are pitching and catching.
By the way, for those not familiar with Project Scoresheet, this is
an informal non-profit organization of baseball fans scattered across North
America that keeps detailed computerized records on all baseball games played
in the National and American leagues. The organization is based in Chicago.
It has accumulated extremely detailed records on every major league baseball
game played since (and including) 1984. The database is available to the
public both in printed form and on MS-DOS formatted floppies. For more
information see the publication "The Great American Baseball Stat Book,"
published by Ballantine Books in New York in 1987. This book presents a
large volume of statistics derived from the Project Scoresheet database.
Alternatively, you can write to Project Scoresheet, P.O. Box 46074, Chicago,
Illinois, 60646.
Developments during PHASE (2) of the simulation really constitute
the heart of the game. It is here that batter meets pitcher and the real
action takes place. In this PHASE of the game the individual player batting
and pitching statistics are used to determine the probabilities of the
different outcomes. The probabilities are recomputed separately for each
batter-pitcher combination.
The computed BATTING probabilities are based on performance
statistics from previous baseball games -- the 1987 season in the case of the
distribution diskettes. The player statistics themselves are recorded in the
ASCII files TEAM1, TEAM2, ... TEAM26. You may want to update the statistics
with numbers for the current season when they are available. Or you may want
to add data for your own local amateur or semi-professional teams. There are
many sources of data you can draw on to update or augment the database,
including your daily newspaper. I used "The Great American Baseball Stat
Book," referred to above, "The 1988 Elias Baseball Analyst" (Macmillan
Publishing Company, New York, 1988), "Gravengood's Baseball Today" (PPC
Publications, 1988) and "The Sporting News Official Baseball Guide" (The
Sporting News Publishing Company, 1988) as my primary sources.
How are the BATTING probabilities for PHASE (2) calculated? Let's
suppose Tim Wallach comes up to bat and Dwight Gooden is pitching for the
Mets. The question then is, which of the eight possible outcomes listed
under the BATTING phase above will be the outcome? Let's look first at
Wallach's stats. He had 593 official AT BATS last year plus 37 bases on
balls, for a total of 630 plate appearances. (Free trips to first base are
not counted in official "at bats" but they are counted in "plate
appearances".) That means his total ex post probability of walking in any
given plate appearance was 5.6 per cent (100*37/630). Wallach got 177 hits
in the 1987 season, of which 105 were singles, 42 were doubles, 4 were
triples and 26 were home runs. Consequently the probabilities of his getting
- 10 -
a single, double, triple or home run can be calculated as 16.7 per cent, 6.7
per cent, 0.6 per cent and 4.1 per cent respectively. He struck out 98 times
and hit ground ball or fly ball outs 318 times. So his overall ex post
probability vector, based on his 630 plate appearances against a wide range
of pitchers in 1987, was quite impressive and looked this way:
Single 16.7%
Double 6.7%
Triple 0.6%
Homer 4.1%
Hit, total -----> 28.1%
Base on balls 5.6%
On base, total -----> 33.7%
Strike out 15.6%
Ground or fly ball out 50.7%
Out, total -----> 66.3%
Total -----> 100.0%
But what about the influence of Met ace Dwight Gooden? These
averages for Wallach are based on the wide variety of pitchers he faced
through the course of the 1987 season. Gooden is substantially above average
in pitching ability, a real "good-un" as Duke Snider likes to say. How do we
take this into account? One possibility would be to recompute Wallach's
averages using only his plate appearances against Gooden. But this would
give us a very small sample and would be statistically unreliable. It would
gauge neither Wallach's hitting ability nor Gooden's pitching ability
adequately.
As an alternative, I thought I might use a weighted average of the
full season probabilities for the hitter and the pitcher. For the hitter, I
calculated probabilities as described for Wallach's case in the table above.
For the pitcher, I did much the same thing for hits and walks allowed. For
example, Dwight Gooden faced 756 batters last year, walking 53, allowing 162
hits and getting strikeouts, groundouts or flyouts on the other 541 hitters.
Looking only at Gooden's stats, the estimated probabilities are 21.4 per cent
for a hit, 7.0 per cent for a walk, 28.4 per cent for getting on base and
71.6 per cent for an out. To compute the final probabilities when Wallach
faces Gooden, I thought I might use weighted averages of their respective
probabilities in each category.
What weights to use? To examine this question I decided to try some
simple regression analysis. I assembled a sample of hitter-pitcher matchups
and tried to estimate the ex post batting average in each matchup (the
dependent variable) as a function of the hitter's overall batting average and
the pitcher's overall batting average allowed. I used career batting
statistics for several players and only considered cases where 40 or more
matchups between the particular hitter and pitcher were available. One more
or less plausible result was the following:
Overall BA = 0.43 * Hitter's BA + 0.57 * Pitcher's BA allowed
in matchup (0.16) for career (0.16) for career
Standard error of estimate = 6.6% Observations = 68
More work is needed on this aspect of the simulation. A larger sample must
- 11 -
be assembled and more sophisticated models should be considered.
I decided to use a simpler approach than this in the program itself,
for a couple of reasons. First, the regression results were not robust
enough to warrant much confidence. Second, a simple weighted average
approach as just described has a rather bizarre implication: it pulls up the
hitting probabilities for abysmally bad hitters, even when they face good
quality pitchers. For the program I decided to use a more ad hoc formula:
the final BATTING probabilities are computed using the hitter's overall
averages, multiplied by the ratios of the pitcher's probabilities to the
overall league average probabilities, with the latter ratios being raised to
an arbitrary "pitcher power". The default value for this pitcher power
exponent is one, but you can experiment with other values by using the "s="
command line option (see Section VI below). If the pitcher is better than
average the hitter's probabilities deteriorate as they should, regardless of
the pitcher power value. A pitcher power of one implies equal weighting for
the pitcher's and hitter's statistics. A value slightly greater than one
puts greater weight on the pitcher's statistics while a value slightly less
than one puts greater weight on the hitter's statistics. This multiplicative
formula, while very simple, appears to be quite effective.
All the raw statistics as well as the hitters' unweighted and
pitching-weighted BATTING phase probabilities are displayed by the program if
you ask for them via the F4 and F9 functions.
VI. COMMAND LINE OPTIONS
The program contains a few parameters which, while they have
default values, can also be specified via command line options. You do not
need to make use of these command line options if you are satisfied with the
default values. But the command line options are available to experiment
with if you need them. To use a command line option you simply add one or
more arguments to the DOS command line when you start up SIMBASE. There are
four such options available. You can specify none, one, two, three or all
four of the options on the command line after the program name SIMBASE. No
spaces or quotes are allowed WITHIN any of these arguments, although the
arguments themselves are separated from one another by spaces.
VI.1 RANDOM NUMBER GENERATORS
The first command line option, "r=", allows you to specify which
random number generator function you wish to use. The default generator,
generator number 1, is the RAND function that comes with the Microsoft C
compiler, Version 5.0. Generator number 2 uses the linear congruential
method. Random numbers R are generated using the formula:
R = {(double)[(R*125)%2796203]}/2796203.000000
where % is the modulus operator. Numbers are calculated sequentially, after
the process is started up with a seed value. Generator number 3 uses the
linear congruential method with an added intercept. Random numbers R are
generated using the formula:
R = {(double)[(R*32719+3)%32749]}/32749.000000
- 12 -
Again, numbers are calculated sequentially after the process is started up
with a seed value. Finally, generator number 4 is a combination of the other
three generators. With this generator a random number is first generated
using generator number 1. If the resulting number is greater than 0.500000
then the final random number is calculated with generator number 2.
Otherwise it is calculated with generator number 3.
Each of these generators requires a seed value. The program gets
one automatically by reading the system clock.
You select one of these four random number generators by including
an "r=" argument on the command line. For example, if you want to use
generator number 3 you would start up SIMBASE by issuing the following
command at the DOS prompt:
SIMBASE r=3
When you choose a random number generator explicitly using the "r="
command line option, the step-by-step ballgame display requested with the F5
function key will contain two additional pieces of information: each random
number generated will be displayed along with the associated seed value, in
the description box alongside the play number and probability.
VI.2 DISPLAY MONITORS AND VIDEO MODES
The program has been designed to automatically sense the video mode
and act accordingly. If you have a colour monitor and are using video mode
3, you should get colour results. If you have a black and white monitor the
results should be clearly visible in black and white. However in some cases
you may wish to tell the program explicitly what kind of monitor to assume.
For example, you may prefer to use the program in black and white mode even
though you have a colour screen. The command line option "d=" is designed
for this purpose.
There are three sub-options here. You can set the video mode to 2
if you have a black and white graphics or a colour graphics monitor. This
will give you a black and white video display. You can set the video mode to
3 if you have a colour graphics monitor. This will give you a multi-colour
display. Or you can set it to 7 if you have a standard monochrome monitor.
This will give you a black and white display. But be careful. If you tell
the program to use a video mode that is not compatible with your equipment
you may see nothing at all.
By way of example, if you want to tell the program to use a colour
display (video mode 3) you would start up SIMBASE by issuing the following
command at the DOS prompt:
SIMBASE d=3
VI.3 THE DATA SUB-DIRECTORY
While the SIMBASE program itself is contained in a single file
called SIMBASE.EXE, the program requires several supporting files if it is to
work properly. These supporting files include three files with PREBAT play
information (PREBAT.DAT, PREBAT1.DAT, PREBAT2.DAT), three files with POSTBAT
- 13 -
play information (POSTBAT.DAT, POSTBAT1.DAT, POSTBAT2.DAT) and twenty-six
files with TEAM statistics (TEAM1, TEAM2, ... TEAM26). The program must know
how to find these files. By default, it assumes that the files are located
in the same directory that the SIMBASE.EXE program is being executed from.
Thus, if you have a hard disk on your system you will probably find it most
conveneient to copy all of the files into a single sub-directory and work
directly out of that sub-directory. However for one reason or another you
may want to keep the supporting data files in a sub-directory that is
different from the one containing SIMBASE.EXE. If you do not have a hard
disk on your computer system and work with floppy disks only, for example,
you will fall into this category since it is not possible to fit all the
required files onto a single 360K double density floppy diskette. In cases
like this a way is needed to inform the program about where to find the
supporting data files.
The way you tell the program about the data sub-directory is via
the "f=" command line option. For example, suppose you are working with a
two floppy diskette system, you are using the a: drive to hold the diskette
with the SIMBASE.EXE program on it (distribution disk #1) and you are using
the b: drive to hold the diskette with the supporting data files on it
(distribution disk #2). Then you would start up the program by issuing the
following command at the DOS prompt:
SIMBASE f=b:\
You would of course issue this command from within the a:\ directory where
the SIMBASE.EXE program itself would, by assumption, be located. As another
example, suppose you have a hard disk and you have copied the the SIMBASE.EXE
program into a directory called c:\SIMBASE\ and you have copied the
supporting data files into a sub-directory of this directory called
c:\SIMBASE\DATA\. Then you would start up SIMBASE by issuing the following
command at the DOS prompt:
SIMBASE f=c:\SIMBASE\DATA\
You would of course issue this command from within the c:\SIMBASE\ directory
where the SIMBASE.EXE program itself would be located.
VI.4 THE PITCHER POWER
In Section V above the role of the "pitcher power" in the
simulations was explained. The default value of this parameter is 1.000000
but you can experiment with different values slightly greater than or less
than one. You do this with the "s=" option. Values greater than one will
put more weight on the pitchers' statistics and less on the batters' while
values less than one will have the opposite effect. For example, if you
wanted the pitcher power to be 1.010000 you would start up SIMBASE by issuing
the following command at the DOS prompt:
SIMBASE s=1.010000
It is recommended that six decimals of accuracy be included when specifying a
value for the pitcher power.
- 14 -
VII. ILLUSTRATIVE PRACTICAL RESULTS FROM THE PROGRAM
It takes about 1.75 seconds on my AT-compatible (scoring 7.7 on the
Norton SysInfo scale) to simulate one ballgame and it takes somewhat less
than 5 minutes to simulate a full 162 game season. I also tried the program
on a older but souped up IBM-PC (1.8 SysInfo) and it took 2.65 seconds per
game on that machine. Version 1 of the SIMBASE program was much slower,
requiring about 10 seconds per game on the AT-compatible and about 30 seconds
per game on the enhanced IBM-PC. The improved speed of operation was
achieved by reprogramming in the C language. These time estimates were
derived using the default random number generator and different times would
result for the other random number generators.
As I've been emphasizing, the program is really intended to run many
games at one time. As in real world baseball, the outcome of a single game
does not tell you very much. Even the World Series Champion Minnesota Twins
of 1987 lost 48 per cent of their games in that season. To really learn
something about a team you must look at its track record over a large number
of games. This is true in the real world and it's equally true in computer
simulation.
To test the normal properties of the simulation as it is currently
calibrated I ran two 162-game series and one 486-game series between the Mets
and the Expos. I used the same lineups for every game, with Gooden and
Martinez pitching. The results were as follows:
Mets Expos
R H E R H E
First 162 games:
Average 3.2 7.8 1.1 2.8 7.8 1.4
Standard deviation 2.3 3.0 1.1 2.3 2.9 1.2
Second 162 games:
Average 3.6 8.2 1.3 3.5 8.0 1.4
Standard deviation 2.4 3.0 1.3 2.6 3.1 1.2
Third 486 games:
Average 3.4 8.0 1.2 2.9 7.6 1.3
Standard deviation 2.6 3.3 1.2 2.2 2.9 1.1
For purposes of comparison, the corresponding National League actual average
figures for all league teams during the 1987 season were as follows:
Nat. League
R H E
For 971 games, 12 teams:
Average 4.5 8.9 0.8
Taking account of the fact that Gooden and Martinez were two of the better
pitchers in the league last year, it is entirely appropriate that the average
number of hits and runs scored during the Mets-Expos simulations should be
somewhat less than the league averages. A concern that existed in version
1.0 of the program and continues in version 2.0 is that the average number
of errors in the simulations exceeds the league average by 0.4-0.6 errors per
game, suggesting the need for further refinement of the PREBAT.DAT and
POSTBAT.DAT plays and probabilities.
- 15 -
The 1987 Expos did not do as well as the Mets in these simulations.
The Expo record was 70 wins and 92 losses (a 43.2% winning percentage) in the
first set of simulations, 72 and 90 (a 44.4% winning percentage) in the
second set and 231 and 255 (a 47.5% winning percentage) in the third set.
The Expos actual season record against all National League teams was 91 and
71 while the Mets overall season record was 92 and 70. Against the Mets the
Expos were 7 and 9 during the season.
These results do not really tell us much about what the overall
performance of the two teams should have been during the season. This is
because the identical nine-player rosters were used for each game. Of
course, Gooden and Martinez could not really pitch all 162 games during the
regular season, so these simulations are not realistic as structured.
Moreover, the hitting lineups also will change during the course of the
season due to injuries, player fatigue, managerial decisions, etcetera. If
you seriously wished to simulate the entire season outcome, you would need to
play different games with different mixes of players.
To illustrate the sensitivity of the simulations to the pitcher, a
further 486 games were run with Bob Sebra pitching for the Expos instead of
Dennis Martinez. Sebra was one of the Expos poorer pitchers in 1987. All
other players in the two teams' lineups were kept the same in the simulations.
The results of this simulation were quite different. The Mets got 9.2 hits
per game and 4.4 runs, much better than with Martinez pitching. With Gooden
still pitching for the Mets and the Expos maintaining the same batting lineup,
the Expo hits and runs performance was comparable to the previous simulations
(7.7 hits per game and 3.1 runs). The Mets won 303 of the 486 games and the
Expos won just 183 (37.7%).
These two exercises demonstrate a key point about this program: it
is most useful as a tool for estimating the differential impact of one player
when he is substituted for another, either by shuffling the batting lineup or
by bringing in a new player from the bench. The program as it presently
stands is more limited when used as a tool for forecasting full seasonal
outcomes when opposition teams vary, rosters evolve and pitchers rotate,
although these sorts of simulations could also be done if desired.
VIII. FUTURE IMPROVEMENTS
There is no perfect way to simulate a baseball game. This program
provides a workable approach, but there are many improvements that can be
made to it. I have referred to some of these above. The PREBAT and POSTBAT
play lists can be improved and their probabilities made dependent, within the
program, on the players' stats. The determination of the BATTING outcome
probabilities, as a function of the hitters' and pitchers' stats, can be
researched more thoroughly. The players' defensive stats can be taken into
account as well as their offensive stats. The F5 function can be enhanced
to make it more "game oriented", allowing player substitutions for pinch
hitting and running and pitching changes as the game proceeds, for example.
These and other enhancements are projects on my future agenda. I hope to get
further ideas from you the user.
Overview of
The Baseball Simulation Program
(C) Copyright 1987, 1988 by Philip M. Smith
- All Rights Reserved -
The program (SIMBASE.EXE) and its documentation (SIMBASE.DOC) are for the
Baseball Simulation Program user supported software. The following is a
brief summary of the Baseball Simulation Program.
The Baseball Simulation Program is an MS-DOS program providing an environment
for simulating or forecasting the outcomes of major or minor league baseball.
Using SIMBASE you can do things such as:
- study the impact of rearranging the batting lineup for your
favorite ball team;
- investigate the costs and benefits of a player trade in the
major leagues;
- forecast the likely outcome of a baseball season;
- analyze who the most valuable player on a team really is; and
- proceed through a simulated baseball game step by step, watching
the performance of the individual players on each team.
The program uses actual player hitting and pitching performance statistics,
tabulated in an ASCII-format database, to calculate batting outcome
probabilities. The database includes performance statistics for the players
on all 26 National and American League teams during the 1987 season. You can
rapidly run a full season's worth of games and tabulate the results, or you
can slowly step your way through a single game. You could even enter the
statistics for the teams of your own local amateur baseball league and use
them as the basis for the analysis.
The Baseball Simulation Program runs on the IBM PC, XT, AT and compatibles
under MS-DOS versions 3.0 and higher. It works with either a monochrome or a
color monitor and requires no special peripheral equipment. The complete
program and associated documentation comes on two MS-DOS formatted 360K floppy
diskettes. Copies of the diskettes are available from the author for $20 at
the address below, or from several Public Domain and Shareware software
libraries and bulletin boards. Full registration is encouraged for those who
find the program useful, at a cost of $50. Registration entitles the user to
the latest version of the program, a copy of the source code, notice of future
program and database updates and telephone support. Contact Philip Smith at
109 Tripp Crescent, Nepean, Ontario, CANADA, K2J 1E2.
Thank you for your interest in the Baseball Simulation Program.
Sincerely,
Philip M. Smith
Conditions on the Use of
The Baseball Simulation Program
(C) Copyright 1987, 1988 by Philip M. Smith
- All Rights Reserved -
SIMBASE is Copyright (C) 1987, 1988 by Philip M. Smith.
WARNING: This program and documentation could include technical inaccuracies
or typographical errors. They are provided without warranty of any kind.
I will in no event be liable for any damages arising out of the use of this
program.
Individuals are granted permission to freely copy the SIMBASE diskettes for
their own use or for others to use, so long as no price is charged and no
changes are made to the program, documentation and other files on the
diskettes. Computer and other clubs are encouraged to copy the diskettes and
share them with their members, under the same conditions. A distribution
cost may be charged for the cost of the diskettes, shipping and handling, up
to $10 per diskette.
If you use the program, please help underwrite the cost of its development by
sending a $50 contribution. A contribution entitles you to the C language
source code, telephone support, plus notices of future program and database
updates. Your support now will help make future releases possible.
REGISTRATION FORM
Please register me as a user/supporter of the Baseball Simulation Program. I
understand I will receive the latest version of the program and its complete
C source code. In addition, I will receive notice of future new versions of
the program and of database updates when they become available.
NAME_________________________________________________DATE_____________________
STREET________________________________________________________________________
CITY_________________________________STATE/PROVINCE___________________________
COUNTRY______________________________ZIP/POSTAL CODE__________________________
TELEPHONE_____________________________________________________________________
Cost Number Total
Use License $50.00 ______ ________.____
Please make payment by bank or postal money order, or by personal cheque,
payable to Philip M. Smith.
MAIL TO: Philip M. Smith
109 Tripp Crescent
Nepean, Ontario
CANADA, K2J 1E2
What features would you like to see added to the Baseball Simulation Program?
______________________________________________________________________________
______________________________________________________________________________
What features need most to be improved in the current version (2.0)?
______________________________________________________________________________
______________________________________________________________________________
Where did you hear about the Baseball Simulation Program?
______________________________________________________________________________
Volume in drive A has no label
Directory of A:\
FILE1186 TXT 1416 10-26-88 9:06a
GO BAT 558 10-10-88 1:00p
READ ME 1893 10-10-88 1:00p
SETUP BAT 1103 10-10-88 1:00p
SIMBASE DOC 57993 10-10-88 1:00p
SIMBASE EXE 212690 10-10-88 1:00p
6 file(s) 275653 bytes
44032 bytes free