stats tools...
Just curious for the stat heads out there: what tools do you use for your analyses? I see a lot of Excel and/or FanGraphs looking charts online, but I'm wondering if people aren't using more sophisticated statistics-specific software, like Stata or Matlab.
Also, where does your data come from? I can't imagine people are just hand copying numbers from baseball-reference.com. Is there some way to retrieve and parse large amounts of data from the mlb site?
This FanPost is reader-generated, and it does not necessarily reflect the views of McCovey Chronicles. If the author uses filler to achieve the minimum word requirement, a moderator may edit the FanPost for his or her own amusement.
1 recs |
76 comments
Comments
My Favorite Tool

Apologies to Xanthan
I was THE GREATEST OF ALL TIME (for 3 days in 1995).
by Mike Benjamin Hit King on Aug 27, 2008 2:57 PM PDT reply actions 0 recs
xanthan is a STATS tool
Oh no I didn’t.
"While conservatives tell you 'leave things alone and no one will lose,' and liberals tell you 'interfere a lot and no one will lose,' baseball says 'someone will lose.' Not only says it - but insists upon it! ... Democracy is lovely, but baseball's more mature." BVCE supports SF Dugout and Manny Burriss.
by BaronVonCurrentEvents on Aug 27, 2008 3:01 PM PDT up reply actions 0 recs
You did!
I seent it.
And I applauded it.
Billy Hayes: His job is better than yours.
by delorean on Aug 27, 2008 3:08 PM PDT up reply actions 0 recs
I’ll make it up to you with a fabulous opportunity to make you rich and famous. Or part of the subject of a college media project. I’m doing a thing on baseball stats websites for my journalism class and I’d like to get BayCityBall in on that.
"While conservatives tell you 'leave things alone and no one will lose,' and liberals tell you 'interfere a lot and no one will lose,' baseball says 'someone will lose.' Not only says it - but insists upon it! ... Democracy is lovely, but baseball's more mature." BVCE supports SF Dugout and Manny Burriss.
by BaronVonCurrentEvents on Aug 28, 2008 8:02 AM PDT up reply actions 0 recs
Super. I’ll shoot you an email in a few weeks and we’ll get started on things.
"While conservatives tell you 'leave things alone and no one will lose,' and liberals tell you 'interfere a lot and no one will lose,' baseball says 'someone will lose.' Not only says it - but insists upon it! ... Democracy is lovely, but baseball's more mature." BVCE supports SF Dugout and Manny Burriss.
by BaronVonCurrentEvents on Aug 28, 2008 9:14 AM PDT up reply actions 0 recs
SPSS and Excel with VBA.
Not that there aren’t better things out there, but these are what I know.
you can't block the Bocock
by oldjacket on Aug 27, 2008 3:51 PM PDT reply actions 0 recs
Can you program?
ZIPS: Milledge: 466 HR, 485 2B, 2282 hits, 278-379-524
by rfloh on Aug 27, 2008 3:53 PM PDT reply actions 0 recs
a lil'
I run other people’s Matlab scripts at work, but I’ve done some rudimentary stuff in R. As an undergrad, I used Stata in my econometrics classes and I also took an introductory course for SAS.
by FPTV on Aug 27, 2008 4:52 PM PDT up reply actions 0 recs
I don't think that's programming
..that’s scripting. Not that an R or python+scipy script can’t do pretty impressive things, mind you.
by wcw on Aug 27, 2008 5:33 PM PDT up reply actions 0 recs
right, but...
when you’re in a social sciences department at a state school, you just get in the habit of calling it “programming”.
by FPTV on Aug 27, 2008 5:48 PM PDT up reply actions 0 recs
If you can do Perl
it’s pretty much the best language for data parsing / manipulation. Though I do love Python and it is great.
Of course, other languages like Java, can do data manipulation too.
Don’t worry, you don’t have to know how to do a Quicksort, or about the nuances of different algorithms. You don’t need to be CS major to do this stuff.
ZIPS: Milledge: 466 HR, 485 2B, 2282 hits, 278-379-524
by rfloh on Aug 27, 2008 11:35 PM PDT up reply actions 0 recs
Matlab is most certainly programming
And is way overkill for any statistical analysis I’ve done.
Eugeniooooooo!!!!
by FairweatherFan on Aug 28, 2008 7:33 AM PDT up reply actions 0 recs
I was surprised to learn that most of the newer professors in my undergrad program were doing all of their econometrics work in Matlab (rather than something more specific, like Stata).
Actually, when I showed up at my current job using R for stats, my managers all gave me grief along the lines of, “You’re an engineer now. REAL engineers use MATLAB!” ;-)
by FPTV on Aug 28, 2008 10:14 AM PDT up reply actions 0 recs
There is programming and then there is programming
Matlab isn’t really programming.
ZIPS: Milledge: 466 HR, 485 2B, 2282 hits, 278-379-524
by rfloh on Aug 28, 2008 10:58 AM PDT up reply actions 0 recs
That statement
Makes me wonder what experience you have with matlab.
I think If I can write (well, help write) a flight control computer for an autonomous vehicle with it, it’s programming.
Ever used simulink ?
Eugeniooooooo!!!!
by FairweatherFan on Aug 28, 2008 2:52 PM PDT up reply actions 0 recs
scripts can work *wonders*
I recently was on a project at a reasonably successful internet shop whose systems are built almost entirely on python scripts. My work there was mostly on R scripts, which didn’t run the company, but did a whole lot with a few lines, where a C programmer would have spent weeks. There is nothing wrong with Matlab scripts, and if you used them to control a robot vehicle, bully for you.
But, you know, they’re scripts. Someone else has done the heavy lifting behind the scenes. You just make it go.
by wcw on Aug 28, 2008 5:30 PM PDT up reply actions 0 recs
hey wow distinctions that don’t make a difference
you can't block the Bocock
by oldjacket on Aug 28, 2008 7:32 PM PDT up reply actions 0 recs
in this taxonomy, Matlab is 'scripting'
To me, high-level is scripting, low-level is programming.
If you never have to allocate memory, you’re probably scripting.
by wcw on Aug 28, 2008 5:27 PM PDT up reply actions 0 recs
right now i use a combination of sed and and whatever scripting language I have handy to clean up my data… I was gonna learn Perl until my coworkers shamed me into switching to Python.
by FPTV on Aug 28, 2008 10:10 AM PDT up reply actions 0 recs
retrosheet can get you the raw data
but you’ll need to know how to program to sort through it
by NeifiChicken on Aug 27, 2008 4:04 PM PDT reply actions 0 recs
ok, I take it back
if parsing text is ‘programming’, you can program.
by wcw on Aug 27, 2008 5:33 PM PDT up reply actions 0 recs
databuilding is programming
you can't block the Bocock
by oldjacket on Aug 27, 2008 8:40 PM PDT up reply actions 0 recs
Writing a set of instructions for a computer
So that the computer does something for you is programming.
Eugeniooooooo!!!!
by FairweatherFan on Aug 28, 2008 7:34 AM PDT up reply actions 0 recs
prompt> while (fork)
FIRE BRIAN SABEAN
by zenbitz on Aug 28, 2008 8:41 PM PDT up reply actions 0 recs
Baseball Prospectus
Ran across this Fora TV video featuring Kevin Goldstein and Christina Kahrl of Baseball Prospectus. I hadn’t known until I’d checked out the vid that BP’s Ms. Kahrl is transgendered. Not that there’s anything wrong with that…
The Fora spot is part of a Baseball Prospectus book tour, and offers some interesting insights on the gathering of baseball data.
by biff pocoroba on Aug 27, 2008 4:16 PM PDT reply actions 0 recs
I like to use spreadsheets.
BB-Ref’s Play Index is awesome.
Other good stat sites: First Inning, Statcorner, Fangraphs, The Baseball Cube
I’m playing with the Lahman database at work, I don’t have Access at home.
by xanthan on Aug 27, 2008 4:17 PM PDT reply actions 0 recs
how can you have access at work but not home?
out of curiosity
by NeifiChicken on Aug 27, 2008 4:58 PM PDT up reply actions 0 recs
I assume he means MS Access
The program
Billy Hayes: His job is better than yours.
by delorean on Aug 27, 2008 5:59 PM PDT up reply actions 0 recs
woops...
that’s another way to read that comment. And here I was thinking he was just telling him to steal it from work. :-)
by FPTV on Aug 27, 2008 6:07 PM PDT up reply actions 0 recs
incidentally...
why don’t you just steal it from work?
by FPTV on Aug 27, 2008 6:07 PM PDT up reply actions 0 recs
When it comes to “stats” and “tool”, I’d reccomend using either a ruler or tape measure.
Zooperstars, they quack me up!
by Goofus on Aug 27, 2008 4:54 PM PDT reply actions 0 recs
But those devices don’t measure a curve well.
by chilibean_3 on Aug 27, 2008 7:08 PM PDT up reply actions 0 recs
baseball-reference.com
"But if he's swinging at real flies, well, in that case there are two definite solutions: 1) Fresno 2) Ritalin." - howtheyscored
by CPGiant756 on Aug 27, 2008 5:28 PM PDT reply actions 0 recs
stats
Me, I just let smarter folks with time on their hands run the numbers (cf firstinning.com, minorleaguesplits.com and statcorner.com). But when I have done stats work for actual money, I’ve tended to work in R by preference, in spss (ugh, ptui) by necessity and occasionally in octave (Matlab clone like R is an S-plus clone), python or really, anything but Excel.
by wcw on Aug 27, 2008 5:35 PM PDT reply actions 0 recs
Sites I go to:
-First Inning
-The Hardball Times
-Minor League Equivalency Calculator
-Minor League Splits
-Fan Graphs
-Sean Smith’s Stats Site (which includes Colin Wyers’ WAR calculator)
-StatCorner
-UZR Spreadsheet from 2003-2007
-Baseball Reference
I’m learning how to use Excel, so hopefully I can start doing my own stuff soon.
Proud adoptive parent of Tim Alderson.
by Anticon23 on Aug 27, 2008 5:43 PM PDT reply actions 0 recs
please don't use Excel
..it’s for quick spreadsheets, not analysis.
by wcw on Aug 27, 2008 5:55 PM PDT up reply actions 0 recs
Right, but...
…the guy’s got to start somewhere, and Excel is a lot less intimidating than, say, Matlab, especially if you’re also learning the statistics concurrently.
So what is it that do you do in the day time that you’re such a software snob, anyway? :-)
by FPTV on Aug 27, 2008 6:02 PM PDT up reply actions 0 recs
I'm an investment geek
blah blah CFA blah blah data blah blah quant blah. Though I learned to despise spss on a project in the survey world, where the python shop I noted above was (and is, though I am onto something new) trying to sell a survey-based product into the buy-side and hedge-fund worlds.
by wcw on Aug 28, 2008 5:39 PM PDT up reply actions 0 recs
bah.
You can use Excel to do real analysis, just stay away from the spreadsheet functions, if possible. You can use VBA to do anything that a statisical package to do. It’s pivot tables are also a legit and quick way to do some kinds of analysis.
Granted, I think the best use of Excel is for making reports and doing summary statistics, but there’s nothing there to turn your nose up at.
you can't block the Bocock
by oldjacket on Aug 27, 2008 8:50 PM PDT up reply actions 0 recs
in my experience, excel is nice for small data sets, and quick summary stats like averages and standard deviations, but anything more complicated (like drawing a histogram) or involving more than a few hundred data points gets to be a pain in the ass pretty quickly. Then again, I don’t know how to use pivot tables or VBA.
by FPTV on Aug 27, 2008 9:35 PM PDT up reply actions 0 recs
I can see why you would be down on Excel without them. It’s a very limited analytical tool without VB and PT.
you can't block the Bocock
by oldjacket on Aug 28, 2008 8:22 AM PDT up reply actions 0 recs
um, no
You. Can. Not. Use. VBA. To. Do. Anything. That. A. Stats. Package. Can. Do.
But please, convince me: in VBA, how would you do this (from R):
library(lmer)
fit <- lmer2(Reaction ~ 1 + (1|Days) + (1|Subject), data = sleepstudy)
[fill in anything you might do with a multilevel fit..]
by wcw on Aug 28, 2008 5:37 PM PDT up reply actions 0 recs
and for that matter, how does Excel/VBA handle something like the ACS?
R is a memory hog, but if you have 64G into which to bang the ACS and its 3m rows and its hundreds of variables, you can. I don’t even want to think what Excel/VBA would make of millions of rows and hundreds of columns.
by wcw on Aug 28, 2008 5:53 PM PDT up reply actions 0 recs
Well, you certainly wouldn’t want to store such data in Excel. But VBA can handle very large amounts of data in it’s arrays.
If I’m dealing with a huge dataset and need to do an analysis that isn’t available in the in house stats package(SPSS), I usually use some combination of VBA and ADO.
BTW, I’m not claiming that VBA or VB are the BEST things to use for most kinds of analysis. But you CAN use it for a lot of things (I have one coworker who, quite irrationally, insists on using it for every project he is assigned). And if you live in a specialized corner of the data analysis universe where the options offered by a stats package aren’t always what you want, it can be quite useful.
you can't block the Bocock
by oldjacket on Aug 28, 2008 7:59 PM PDT up reply actions 0 recs
PS
what’s the ACS?
Actuarial something?
you can't block the Bocock
by oldjacket on Aug 28, 2008 7:59 PM PDT up reply actions 0 recs
American Community Survey
The 1% survey of the US now published annually by the Census. As noted, ~3m rows for people, plus a table of all the attached households. It’s not a large data set in the R-and-Stata-would-choke sense, but it strikes me as way too big for Excel + VBA. Having recently, by necessity, built an Excel + VBA sheet with a piddling few thousand rows by few score columns that runs Very Slowly When It Recalculates, I have my doubts.
by wcw on Aug 28, 2008 8:29 PM PDT up reply actions 0 recs
I don’t know R, so I can’t tell you. I would assume that your line there is some sort of model evaluation.
I can tell you that any mathematical or statistical method can be turned into a VBA function. Personally, I’ve never done anything more complicated than logistic regression and most of the time, it would be reinventing the wheel to use VBA to copy stat package functions.
you can't block the Bocock
by oldjacket on Aug 28, 2008 7:47 PM PDT up reply actions 0 recs
lmer == 'linear mixed effects in R'
Here, “can be” is the operative term. Here’s Doug Bates’s code for lmer.c: http://lme4.r-forge.r-project.org/doxygen/lmer_8c-source.html I’m sure it can be VBAized, but I defy you to implement lmer (or any other interesting R function) and actually make it work.
For that matter, why were you even wasting your time with logistic regressions in VBA? Even spss should be able to spit those out, and R’s glm() will handle them in one line.
by wcw on Aug 28, 2008 8:36 PM PDT up reply actions 0 recs
had to make a user-form program for someone else to do their own (very specific) LRs.
you can't block the Bocock
by oldjacket on Aug 28, 2008 9:04 PM PDT up reply actions 0 recs
'very specific'?
Douglas M. Bates is not some kid cooking up oddball stats in his basement.
I mean, his name is on the spine of textbooks for a reason.
by wcw on Aug 28, 2008 9:16 PM PDT up reply actions 0 recs
by very specific, I mean these LRs were predicting chance of promotion from similarly formatted datasets.
you can't block the Bocock
by oldjacket on Aug 28, 2008 9:24 PM PDT up reply actions 0 recs
my mistake
..I misread “had” as “hard”.
I’d still have scripted it in R, but if your enduser just can’t handle anything without the MSFT brand on it, I see the utility of reinventing the logit wheel.
by wcw on Aug 28, 2008 9:30 PM PDT up reply actions 0 recs
That's bullshit
I’ve built some incredibly complex AND powerful spreadsheets in excel.
Nothing to do with baseball, of course.
Eugeniooooooo!!!!
by FairweatherFan on Aug 28, 2008 7:36 AM PDT up reply actions 0 recs
yes, and so have I
..and it is always a terrible idea.
Always.
But it happens over and over again, because Excel is there, and familiar, and a sunk cost.
by wcw on Aug 28, 2008 5:38 PM PDT up reply actions 0 recs
This is a good explanation of how to download the pitch f/x data.
R is pretty awesome for statistical analysis.
by Scottsdale on Aug 27, 2008 6:05 PM PDT reply actions 0 recs
R
so i’ve seen a couple references to R now…if not by actual users, then by people who are at least curious about it. Is there a baseball-specific R group or web site?
by FPTV on Aug 28, 2008 10:13 AM PDT up reply actions 0 recs
I think there’s a whole book about R and baseball stats: Baseball Hacks
you can't block the Bocock
by oldjacket on Aug 28, 2008 11:24 AM PDT up reply actions 0 recs
I have never heard of any baseball oriented R group. However the awesome thing about R is that being an open source project, you can usually find someone who has previously done about the same task and you are looking to do and just do what they did. It has also been my experience (at least with the Bioconductor packages) that the authors of packages are often more then happy to correspond with you if you are struggling with something.
by Scottsdale on Aug 28, 2008 11:25 AM PDT up reply actions 0 recs
I’ve been using the Lahman database for Microsoft Access for a long while. Still one of my favorites for looking up a simple query quickly. I’ve been using the Retrosheet play by play stuff here and there and it is awesome. I’ve been slowly moving Retrosheet data into a MySQL database. I’ve also been messing around with MLB’s pitch f/x stuff. And lots and lots of websites, most of the ones mentioned above.
Python and Perl in the scripting department. SQL for queries. Excel for making the data look pretty. I really don’t do anything too fancy so the above has met my needs, but I need to mess around with some of the fancier analytic tools like R.
Keiichi Yabu: Leading your San Francisco Giants in triple plays induced
by BaysideBaller on Aug 28, 2008 1:48 AM PDT reply actions 0 recs
not to be the resident software bigot yet again, but..
..Postgres, not mysql. Postgres, not mysql. Postgres, not mysql. Not mysql.
If you need an inducement, postgres has pl/r: http://www.joeconway.com/plr/
by wcw on Aug 28, 2008 5:47 PM PDT up reply actions 0 recs
All my poker database softwares use postgres
but the pitchfx database Mike Fast set up (I think I saw you post over there in the comments) is mysql. (I have very limited knowledge of scripts and parsing but I was able to get his database instructions followed fine)
Should I be looking to convert my mysql database to postgresql? How difficult/time consuming do you think it’d be (quick google search seems to be not overly hard)? I’m probably just going to wait until the offseason either way as I imagine its not that easy.
Can't get enough of the Oakland A's? Visit Oaktown Awesomer's
by iamawesomer on Aug 31, 2008 7:39 PM PDT up reply actions 0 recs
I don’t know why I keep clicking on this topic when I don’t remotely understand a thing that anyone is saying.
Trent Kline: Decentish. Also, my website is called ChatterBalks Dot Com and on it I make jokes about things.
by groug on Aug 28, 2008 6:12 PM PDT reply actions 0 recs
Yeah, that’s probably it.
Trent Kline: Decentish. Also, my website is called ChatterBalks Dot Com and on it I make jokes about things.
by groug on Aug 28, 2008 9:10 PM PDT up reply actions 0 recs
I should have just skipped down to this comment.
I kept reading things like "Scripting that’s programming scripty VCP MORBRBBL" and being like "yeah, I agree with that," and then realizing it meant nothing to me and being like "why am I mindlessly agreeing with this?" and then being like "why am I even reading this?" and then being like "oh, groug said something I can actually agree with" and the being like all writing this comment.
My Dave Righetti is better than your Dave Righetti.
by howtheyscored on Aug 29, 2008 3:05 AM PDT up reply actions 0 recs
Being? You can’t program on BeOS! (But you can script!)
Fred Lewis can stand under my umbrella.
31 May 2007, 21:38 EST - the last time Matteh's career W-L wasn't below.500
by S.F. Giangst on Aug 29, 2008 4:35 AM PDT up reply actions 0 recs
I just assumed everyone was talking about Pokemon again.
by Evan on Aug 29, 2008 11:39 AM PDT up reply actions 0 recs
Nah, you can’t script with Pokemon. I mean, at least not without Inscripteon (Evee’s evolved state when you feed it a desktop PC).
My Dave Righetti is better than your Dave Righetti.
by howtheyscored on Aug 29, 2008 9:49 PM PDT up reply actions 0 recs

by 



















