Commit 7fcfd6de authored by O'Reilly Media, Inc.'s avatar O'Reilly Media, Inc.
Browse files

Initial commit

parents
9780596009427
\ No newline at end of file
## Example files for the title:
# Baseball Hacks, by Joseph Adler
[![Baseball Hacks, by Joseph Adler](http://akamaicovers.oreilly.com/images/9780596009427/cat.gif)](https://www.safaribooksonline.com/library/view/title/0596009429//)
The following applies to example files from material published by O’Reilly Media, Inc. Content from other publishers may include different rules of usage. Please refer to any additional usage rights explained in the actual example files or refer to the publisher’s website.
O'Reilly books are here to help you get your job done. In general, you may use the code in O'Reilly books in your programs and documentation. You do not need to contact us for permission unless you're reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from our books does not require permission. Answering a question by citing our books and quoting example code does not require permission. On the other hand, selling or distributing a CD-ROM of examples from O'Reilly books does require permission. Incorporating a significant amount of example code from our books into your product's documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN.
If you think your use of code examples falls outside fair use or the permission given here, feel free to contact us at <permissions@oreilly.com>.
Please note that the examples are not production code and have not been carefully testing. They are provided "as-is" and come with no warranty of any kind.
To help you try out the code in this book, we are providing a copy of the
Baseball Databank database. This file is based on the 12-30-2005 release
of the databse, and includes data through the end of the 2005 season.
You can get more information about this database from
http://www.baseball-databank.org
For reference, I have also posted the script that I used to create
this file. You do not need to run this script, unless you want to get
a copy of the database directly from the Baseball Databank web site.
Here are some simple instructions on how to load this file into your computer:
1. Install MySQL
2. Download the file bbdb.sql.gz
3. Decompress the file.
a. On Microsoft Windows, I recommend using WinZip to unpack this file.
You can get a copy of this progrma from http://www.winzip.com
b. On Linux, MacOS, and other Unix-like platforms, use gunzip to
unpack this file. You can use a command like this to unpzck the
file:
> gunzip bbdb.sql.gz
4. Create a new MySQL database. You can do this with a set of commands
like this (note that I called the database "example"):
$ mysql -u root
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 190 to server version: 5.0.15-nt
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql> grant ALL on example.* to 'you'@'localhost';
Query OK, 0 rows affected (0.03 sec)
mysql> create database example;
Query OK, 1 row affected (0.00 sec)
mysql> quit;
5. Load the file into the database. You can do this with a command like
this:
$ mysql -s example < bbdb.sql
6. Check that everything is there:
mysql> show tables;
+-------------------+
| Tables_in_example |
+-------------------+
| allstar |
| awardsvotes |
| awardswinners |
| batting |
| fielding |
| fieldingof |
| managers |
| managershalf |
| master |
| pitching |
| salaries |
| teams |
| teamsfranchises |
| teamshalf |
| transactions |
+-------------------+
15 rows in set (0.00 sec)
mysql> select count(*) from allstar;
+----------+
| count(*) |
+----------+
| 4115 |
+----------+
1 row in set (0.05 sec)
mysql> select count(*) from awardsvotes;
+----------+
| count(*) |
+----------+
| 6211 |
+----------+
1 row in set (0.01 sec)
mysql> select count(*) from awardswinners;
+----------+
| count(*) |
+----------+
| 2430 |
+----------+
1 row in set (0.00 sec)
mysql> select count(*) from batting;
+----------+
| count(*) |
+----------+
| 87308 |
+----------+
1 row in set (0.37 sec)
mysql> select count(*) from fielding;
+----------+
| count(*) |
+----------+
| 126130 |
+----------+
1 row in set (0.49 sec)
mysql> select count(*) from fieldingof;
+----------+
| count(*) |
+----------+
| 21602 |
+----------+
1 row in set (0.15 sec)
mysql> select count(*) from managers;
+----------+
| count(*) |
+----------+
| 3067 |
+----------+
1 row in set (0.05 sec)
mysql> select count(*) from managershalf;
+----------+
| count(*) |
+----------+
| 95 |
+----------+
1 row in set (0.05 sec)
mysql> select count(*) from master;
+----------+
| count(*) |
+----------+
| 16566 |
+----------+
1 row in set (0.17 sec)
mysql> select count(*) from pitching;
+----------+
| count(*) |
+----------+
| 36898 |
+----------+
1 row in set (0.20 sec)
mysql> select count(*) from salaries;
+----------+
| count(*) |
+----------+
| 17277 |
+----------+
1 row in set (0.14 sec)
mysql> select count(*) from teams;
+----------+
| count(*) |
+----------+
| 2505 |
+----------+
1 row in set (0.06 sec)
mysql> select count(*) from teamsfranchises;
+----------+
| count(*) |
+----------+
| 120 |
+----------+
1 row in set (0.00 sec)
mysql> select count(*) from teamshalf;
+----------+
| count(*) |
+----------+
| 52 |
+----------+
1 row in set (0.00 sec)
mysql> select count(*) from transactions;
+----------+
| count(*) |
+----------+
| 0 |
+----------+
1 row in set (0.00 sec)
And that's it! You should be ready to start calculating baseball statistics.
You should now be able to run queries like this:
$ mysql example
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 195 to server version: 5.0.15-nt
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql> create temporary table h
-> as select idxLahman, sum(hr) as hr
-> from batting
-> group by idxLahman;
Query OK, 16416 rows affected (1.32 sec)
Records: 16416 Duplicates: 0 Warnings: 0
mysql> create index h_idx on h(idxLahman);
Query OK, 16416 rows affected (0.51 sec)
Records: 16416 Duplicates: 0 Warnings: 0
mysql> select m.nameLast, m.nameFirst, h.hr
-> from master m inner join h
-> on m.idxLahman=h.idxLahman
-> where hr > 500;
+-----------+-----------+------+
| nameLast | nameFirst | hr |
+-----------+-----------+------+
| Aaron | Hank | 755 |
| Banks | Ernie | 512 |
| Bonds | Barry | 708 |
| Foxx | Jimmie | 534 |
| Griffey | Ken | 536 |
| Jackson | Reggie | 563 |
| Killebrew | Harmon | 573 |
| Mantle | Mickey | 536 |
| Mathews | Eddie | 512 |
| Mays | Willie | 660 |
| McCovey | Willie | 521 |
| McGwire | Mark | 583 |
| Murray | Eddie | 504 |
| Ott | Mel | 511 |
| Palmeiro | Rafael | 569 |
| Robinson | Frank | 586 |
| Ruth | Babe | 714 |
| Schmidt | Mike | 548 |
| Sosa | Sammy | 588 |
| Williams | Ted | 521 |
+-----------+-----------+------+
20 rows in set (0.11 sec)------------153DE698B73C7--------------A71642E11F26F9B--
\ No newline at end of file
game_id,visiting_team,inning,batting_team,outs,balls,strikes,pitch_sequence,vis_score,home_score,batter,batter_hand,res_batter,res_batter_hand,pitcher,pitcher_hand,res_pitcher,res_pitcher_hand,catcher,first_base,second_base,third_base,shortstop,left_field,center_field,right_field,first_runner,second_runner,third_runner,event_text,leadoff_flag,pinchhit_flag,defensive_position,lineup_position,event_type,batter_event_flag,ab_flag,hit_value,SH_flag,SF_flag,outs_on_play,double_play_flag,triple_play_flag,RBI_on_play,wild_pitch_flag,passed_ball_flag,fielded_by,batted_ball_type,bunt_flag,foul_flag,hit_location,num_errors,1st_error_player,1st_error_type,2nd_error_player,2nd_error_type,3rd_error_player,3rd_error_type,batter_dest,runner_on_1st_dest,runner_on_2nd_dest,runner_on_3rd_dest,play_on_batter,play_on_runner_on_1st,play_on_runner_on_2nd,play_on_runner_on_3rd,SB_for_runner_on_1st_flag,SB_for_runner_on_2nd_flag,SB_for_runner_on_3rd_flag,CS_for_runner_on_1st_flag,CS_for_runner_on_2n
d_flag,CS_for_runner_on_3rd_flag,PO_for_runner_on_1st_flag,PO_for_runner_on_2nd_flag,PO_for_runner_on_3rd_flag,Responsible_pitcher_for_runner_on_1st,Responsible_pitcher_for_runner_on_2nd,Responsible_pitcher_for_runner_on_3rd,New_Game_Flag,End_Game_Flag,Pinch_runner_on_1st,Pinch_runner_on_2nd,Pinch_runner_on_3rd,Runner_removed_for_pinch_runner_on_1st,Runner_removed_for_pinch_runner_on_2nd,Runner_removed_for_pinch_runner_on_3rd,Batter_removed_for_pinch_hitter,Position_of_batter_removed_for_pinch_hitter,Fielder_with_First_Putout,Fielder_with_Second_Putout,Fielder_with_Third_Putout,Fielder_with_First_Assist,Fielder_with_Second_Assist,Fielder_with_Third_Assist,Fielder_with_Fourth_Assist,Fielder_with_Fifth_Assist,event_num
# check to make sure that there are the right number of arguments
unless ($#ARGV == 1) {die "usage $0 <input file> <output file>\n";}
# open the input and output files
open INFILE, "<$ARGV[0]" or die "couldn't open input file $ARGV[0]: $!\n";
open OUTFILE, ">$ARGV[1]" or die "couldn't open output file $ARGV[1]: $!\n";
$lineno = 1;
while(<INFILE>) {
# loop over each line in the input file
print OUTFILE "$lineno: ", $_;
$lineno++;
}
close INFILE;
close OUTFILE;
use FileHandle;
use LWP::UserAgent;
$ua = LWP::UserAgent->new;
$baseurl = "http://www.retrosheet.org/";
for ($year = 60; $year <= 92; $year++) {
foreach $league ("al", "nl") {
my $filename = '19' . $year . $league . '.zip';
my $url = $baseurl . '19' . $year . '/19' . $year . $league . '.zip';
my $req = HTTP::Request->new(GET => $url);
my $res = $ua->request($req);
print STDERR "fetching $filename\n";
if ($res->is_success) {
my $fh = new FileHandle ">$filename";
if (defined $fh) {
print $fh $res->content;
$fh->close;
} else {
print STDERR "could not open file $filename: $!\n";
}
}
else {
print STDERR $res->status_line, "\n";
}
}
}
$league = 'ml';
for ($year = 00; $year <= 04; $year++) {
my $filename = '200' . $year . $league . '.zip';
my $url = $baseurl . '200' . $year . '/200' . $year . $league . '.zip';
my $req = HTTP::Request->new(GET => $url);
my $res = $ua->request($req);
print STDERR "fetching $filename\n";
if ($res->is_success) {
my $fh = new FileHandle ">$filename";
if (defined $fh) {
print $fh $res->content;
$fh->close;
} else {
print STDERR "could not open file $filename: $!\n";
}
}
else {
print STDERR $res->status_line, "\n";
}
}
#!/usr/bin/perl
use Getopt::Std;
getopts('dhs:i:') or die "bad options: $!";
if ($opt_s) {
$sep = $opt_s;
} else {
$sep = "\t";
}
$useheader = $opt_h;
if ($opt_i) {
open INFILE, "<$opt_i" or die "can't open input file: $!\n";
} else {
die "must specify file name\n";
}
if ($useheader) {
# read in the first line
$headerline = <INFILE>;
# split it into an array by commas
$headerline =~ s/[\"\n\r\f]//g;
@header = split /$sep/, $headerline;
#read in the second line
$second = <INFILE>;
# split it into an array
$second =~ s/[\"\r\n\f]//g;
@types = split /$sep/, $second;
@terms = split /$sep/, $second;
} else {
# read in the first line
$first = <INFILE>;
# split it into an arry by commas
$first =~ s/[\"\r\n\f]//g;
@types = split /$sep/, $first;
@terms = split /$sep/, $first;;
}
# count the number of fields
$fields = $#terms + 1;
# check if each element is numerical or character
for ($i = 0; $i < $fields; $i++) {
# print $i, " ", $header[$i], "\n";
if ($types[$i] =~ /^\s*\d*\s*$/) {
$field_type[$i] = "I";
} elsif ($types[$i] =~ /^\s*\d*\.(\d+)\s*$/) {
$field_type[$i] = length($1);
} else {
$field_type[$i] = "V";
}
# print $i, " ", $header[$i], " ", $types[$i], " ", $field_type[$i], "\n";
}
# measure the length of each field
@maxlengths = map length, @terms;
# start looping through file:
while(<INFILE>) {
$_ =~ s/B\,R\"/B\;R/g;
$_ =~ s/[\"\r\n\f]//g;
@terms = split /$sep/, $_;
@lengths = map length, @terms;
for ($i = 0; $i < $fields; $i++) {
if ($lengths[$i] > $maxlengths[$i]) {
if ($opt_d) {print "new max pos $i: $terms[$i]\n";}
$maxlengths[$i] = $lengths[$i];
}
if ($terms[$i] !~ /^\s*[\d,\.]*\s*$/) {
if (($opt_d) and ($field_type[$i] ne "V")) {
print "changing field type for $header[$i]: \"$terms[$i]\"\n";
print "\t$_\n\t";
print join ",", @terms;
print "\n";
}
$field_type[$i] = "V";
}
}
}
close(INFILE);
if ($opt_i =~ /([\w|\.]*)\.\w+$/) {
$tablename = $1;
$tablename =~ s/\.//;
} elsif ($opt_i =~ /.*\/([\w|\.]*)\.\w+$/) {
$tablename = $1;
$tablename =~ s/\.//;
} else {
$tablename = "NONAMETABLE";
}
print "CREATE TABLE $tablename (\n";
for ($i = 0; $i < $fields; $i++) {
if ($useheader) {
$varname = $header[$i];
} else {
$varname = "variable$i";
}
if ($field_type[$i] =~ /I/ and $maxlengths[$i] < 6) {
print "\t$varname SMALLINT($maxlengths[$i])";
} elsif ($field_type[$i] =~ /I/) {
print "\t$varname INTEGER($maxlengths[$i])";
} elsif ($field_type[$i] =~ /V/ and $maxlengths[$i] > 255) {
print "\t$varname TEXT";
} elsif ($field_type[$i] =~ /V/) {
print "\t$varname VARCHAR($maxlengths[$i])";
} else {
print "\t$varname DECIMAL($maxlengths[$i],$field_type[$i])";
}
if ($i + 1 == $fields) {
print "\n);\n";
} else {
print ",\n";
}
}
print "LOAD DATA LOCAL INFILE '$opt_i'\n";
print "INTO TABLE $tablename\n";
if ($opt_s =~ /\t/) {$opt_s = "\\t";}
print "FIELDS TERMINATED BY '$opt_s'\n";
print "OPTIONALLY ENCLOSED BY '\"'";
print "LINES TERMINATED BY '\\n'";
if ($opt_h) {
print "\nIGNORE 1 LINES;\n";
} else {
print ";\n";
}
Date,DoubleHeader,DayOfWeek,VisitingTeam,VisitingTeamLeague,VisitingTeamGameNumber,HomeTeam,HomeTeamLeague,HomeTeamGameNumber,VisitorRunsScored,HomeRunsScore,LengthInOuts,DayNight,CompletionInfo,ForfeitInfo,ProtestInfo,ParkID,Attendence,Duration,VisitorLineScore,HomeLineScore,VisitorAB,VisitorH,VisitorD,VisitorT,VisitorHR,VisitorRBI,VisitorSH,VisitorSF,VisitorHBP,VisitorBB,VisitorIBB,VisitorK,VisitorSB,VisitorCS,VisitorGDP,VisitorCI,VisitorLOB,VisitorPitchers,VisitorER,VisitorTER,VisitorWP,VisitorBalks,VisitorPO,VisitorA,VisitorE,VisitorPassed,VisitorDB,VisitorTP,HomeAB,HomeH,HomeD,HomeT,HomeHR,HomeRBI,HomeSH,HomeSF,HomeHBP,HomeBB,HomeIBB,HomeK,HomeSB,HomeCS,HomeGDP,HomeCI,HomeLOB,HomePitchers,HomeER,HomeTER,HomeWP,HomeBalks,HomePO,HomeA,HomeE,HomePassed,HomeDB,HomeTP,UmpireHID,UmpireHName,Umpire1BID,Umpire1BName,Umpire2BID,Umpire2BName,Umpire3BID,Umpire3BName,UmpireLFID,UmpireLFName,UmpireRFID,UmpireRFName,VisitorManagerID,VisitorManagerName,HomeManagerID,HomeManagerName,WinningPitcherID,WinningPitcherName,LosingPitcherID,LosingPitcherNAme,SavingPitcherID,SavingPitcherName,GameWinningRBIID,GameWinningRBIName,VisitorStartingPitcherID,VisitorStartingPitcherName,HomeStartingPitcherID,HomeStartingPitcherName,VisitorBatting1PlayerID,VisitorBatting1Name,VisitorBatting1Position,VisitorBatting2PlayerID,VisitorBatting2Name,VisitorBatting2Position,VisitorBatting3PlayerID,VisitorBatting3Name,VisitorBatting3Position,VisitorBatting4PlayerID,VisitorBatting4Name,VisitorBatting4Position,VisitorBatting5PlayerID,VisitorBatting5Name,VisitorBatting5Position,VisitorBatting6PlayerID,VisitorBatting6Name,VisitorBatting6Position,VisitorBatting7PlayerID,VisitorBatting7Name,VisitorBatting7Position,VisitorBatting8PlayerID,VisitorBatting8Name,VisitorBatting8Position,VisitorBatting9PlayerID,VisitorBatting9Name,VisitorBatting9Position,HomeBatting1PlayerID,HomeBatting1Name,HomeBatting1Position,HomeBatting2PlayerID,HomeBatting2Name,HomeBatting2Position,HomeBatting3PlayerID,HomeBatting3Name,HomeBatting3Position,HomeBatting4PlayerID,HomeBatting4Name,HomeBatting4Position,HomeBatting5PlayerID,HomeBatting5Name,HomeBatting5Position,HomeBatting6PlayerID,HomeBatting6Name,HomeBatting6Position,HomeBatting7PlayerID,HomeBatting7Name,HomeBatting7Position,HomeBatting8PlayerID,HomeBatting8Name,HomeBatting8Position,HomeBatting9PlayerID,HomeBatting9Name,HomeBatting9Position,AdditionalInfo,AcquisitionInfo
opendir DIR, ".";
@files = readdir DIR;
closedir DIR;
print "retroID,lastName,firstName,bats,throws,team,pos\n";
LOOP:
foreach $file (@files) {
unless ($file =~ /(\w{3})(\d{4})\.ROS/) {
next LOOP;
}
$team = $1;
$year = $2;
open FILE, "<$file";
while (<FILE>) {
s/\n//;
s/\cM//;
s/\"//g;
if (/[a-z]{5}\d{3}/) {
if ($year >= 2002) {
# after 2002, these files included a team and position
print "$year,$_\n";
} else {
# before 2002, no team or position
print "$year,$_,$team,\n";
}
}
}
}
#!/usr/bin/perl
# print "opening .\n";
$outfile = "pbp1960-1992.csv";
$outfile2k = "pbp2000-2004.csv";
print `cat all_hdr.txt > $outfile`;
print `cat all_hdr.txt > $outfile2k`;
opendir INFDIR, "." or die "can't open directory .: $!\n";
@archives = readdir INFDIR;
close INFDIR;
LOOP: foreach $archive (@archives) {
unless ($archive =~ /(\d\d\d\d[anm]l)\.zip$/) {
# print "skipping $file\n";
next LOOP;}
print STDERR "uncompressing $archive\n";
print `unzip -qq -o $archive`;
opendir INFDIR, "." or die "can't open directory .: $!\n";
@files = readdir INFDIR;
ILOOP: foreach $file (@files) {
unless ($file =~ /(\d\d)(\d\d)(\w\w\w)\.EV[AN]$/) {
print STDERR "not processing $file\n";
next ILOOP;}
$century = $1; $year = $2; $team = $3;
print STDERR "processing $file\n";
if ($century eq "19") {
print `./BEVENT.EXE -y $century$year -f 0-96 $file >> $outfile`;
} else {
print `./BEVENT.EXE -y $century$year -f 0-96 $file >> $outfile2k`;
}
print `rm $file`;
}
}
#!/usr/bin/perl
use LWP;
use Data::Dump;
use FileHandle;
$browser = LWP::UserAgent->new;
my $fh = new FileHandle ">fielding.html";
print $fh "<html><body>\n";
print $fh "<table><th>\n";
print $fh " <td>checkbox</td>\n";
print $fh " <td>Player</td>\n";
print $fh " <td>ID</td>\n";
print $fh " <td>TEAM</td>\n";
print $fh " <td>POS</td>\n";
print $fh " <td>G</td>\n";
print $fh " <td>GS</td>\n";
print $fh " <td>INN</td>\n";
print $fh " <td>TC</td>\n";
print $fh " <td>PO</td>\n";
print $fh " <td>A</td>\n";
print $fh " <td>E</td>\n";
print $fh " <td>DP</td>\n";
print $fh " <td>PB</td>\n";
print $fh " <td>SB</td>\n";
print $fh " <td>CS</td>\n";
print $fh " <td>RF</td>\n";
print $fh " <td>FPCT</td>\n";
print $fh "</th>\n";
for $i (1 .. 35) { # should go to 35
print $fh "<tr><td><img src=\"/images/trans.gif\" width=\"1\" height=\"1\" border=\"0\" /></td><td><input type=\"checkbox\" name=\"box2\" value=XXXX408307tbaO onClick=\"countChoices3(this, 'box2');\"></td><td>\n";
$url = "http://mlb.mlb.com/NASApp/mlb/stats/sortable_player_stats.jsp?c_id=mlb&section3=$i&statSet3=1&sortByStat=G&statType=3&timeFrame=1&timeSubFrame=2005&baseballScope=mlb&prevPage3=1&readBoxes=true&sitSplit=&venueID=&teamPosCode=all&print=true";
$response = $browser->get($url);
die "Couldn't get $url:", $response->status_line, "\n"
unless $response->is_success;
$html = $response->content;
@lines = split "\n", $html;
my $out = -2;
foreach $line (@lines) {
if ($line =~ /<table/) {$out++;}
if ($line =~ /<INPUT TYPE="IMAGE" NAME="compare"/) {$out = 0;}
if ($out > 0) { print $fh "$line\n";}
if ($line =~ /playerID=(\d+)\&/) {
print $fh "</tr></table></td><td><table><tr><td>$1</td>\n";
}
}
}
$fh->close();
-- create a new database schema for the box score information,
-- and a user to access the database
GRANT ALL ON boxes.* to 'boxer'@'localhost'
IDENTIFIED BY 'boxers password';
CREATE DATABASE boxes;
USE boxes;
-- Create three tables containing information about players:
-- batting, fielding, and pitching. Fields are sized to
-- minimize storage space for this database
CREATE TABLE batting (
eliasID INT(6),
teamID CHAR(3),
gameID VARCHAR(32),
gameDate DATE,
h SMALLINT(2), -- hits
hr SMALLINT(2), -- home runs
bb SMALLINT(2), -- walks
so SMALLINT(2), -- strikeouts