stats tools...
Just curious for the stat heads out there: what tools do you use for your analyses? I see a lot of Excel and/or FanGraphs looking charts online, but I'm wondering if people aren't using more sophisticated statistics-specific software, like Stata or Matlab.
Also, where does your data come from? I can't imagine people are just hand copying numbers from baseball-reference.com. Is there some way to retrieve and parse large amounts of data from the mlb site?
This FanPost is reader-generated, and it does not necessarily reflect the views of McCovey Chronicles. If the author uses filler to achieve the minimum word requirement, a moderator may edit the FanPost for his or her own amusement.
76 comments
|
1 recs |
Do you like this story?
Comments
My Favorite Tool

Apologies to Xanthan
I was THE GREATEST OF ALL TIME (for 3 days in 1995).
by Mike Benjamin Hit King on Aug 27, 2008 2:57 PM PDT reply actions
xanthan is a STATS tool
Oh no I didn’t.
"While conservatives tell you 'leave things alone and no one will lose,' and liberals tell you 'interfere a lot and no one will lose,' baseball says 'someone will lose.' Not only says it - but insists upon it! ... Democracy is lovely, but baseball's more mature." BVCE supports SF Dugout and Manny Burriss.
by BaronVonCurrentEvents on Aug 27, 2008 3:01 PM PDT up reply actions
I’ll make it up to you with a fabulous opportunity to make you rich and famous. Or part of the subject of a college media project. I’m doing a thing on baseball stats websites for my journalism class and I’d like to get BayCityBall in on that.
"While conservatives tell you 'leave things alone and no one will lose,' and liberals tell you 'interfere a lot and no one will lose,' baseball says 'someone will lose.' Not only says it - but insists upon it! ... Democracy is lovely, but baseball's more mature." BVCE supports SF Dugout and Manny Burriss.
by BaronVonCurrentEvents on Aug 28, 2008 8:02 AM PDT up reply actions
Super. I’ll shoot you an email in a few weeks and we’ll get started on things.
"While conservatives tell you 'leave things alone and no one will lose,' and liberals tell you 'interfere a lot and no one will lose,' baseball says 'someone will lose.' Not only says it - but insists upon it! ... Democracy is lovely, but baseball's more mature." BVCE supports SF Dugout and Manny Burriss.
by BaronVonCurrentEvents on Aug 28, 2008 9:14 AM PDT up reply actions
Can you program?
ZIPS: Milledge: 466 HR, 485 2B, 2282 hits, 278-379-524
a lil'
I run other people’s Matlab scripts at work, but I’ve done some rudimentary stuff in R. As an undergrad, I used Stata in my econometrics classes and I also took an introductory course for SAS.
I don't think that's programming
..that’s scripting. Not that an R or python+scipy script can’t do pretty impressive things, mind you.
right, but...
when you’re in a social sciences department at a state school, you just get in the habit of calling it “programming”.
If you can do Perl
it’s pretty much the best language for data parsing / manipulation. Though I do love Python and it is great.
Of course, other languages like Java, can do data manipulation too.
Don’t worry, you don’t have to know how to do a Quicksort, or about the nuances of different algorithms. You don’t need to be CS major to do this stuff.
ZIPS: Milledge: 466 HR, 485 2B, 2282 hits, 278-379-524
Matlab is most certainly programming
And is way overkill for any statistical analysis I’ve done.
Eugeniooooooo!!!!
by FairweatherFan on Aug 28, 2008 7:33 AM PDT up reply actions
I was surprised to learn that most of the newer professors in my undergrad program were doing all of their econometrics work in Matlab (rather than something more specific, like Stata).
Actually, when I showed up at my current job using R for stats, my managers all gave me grief along the lines of, “You’re an engineer now. REAL engineers use MATLAB!” ;-)
There is programming and then there is programming
Matlab isn’t really programming.
ZIPS: Milledge: 466 HR, 485 2B, 2282 hits, 278-379-524
That statement
Makes me wonder what experience you have with matlab.
I think If I can write (well, help write) a flight control computer for an autonomous vehicle with it, it’s programming.
Ever used simulink ?
Eugeniooooooo!!!!
by FairweatherFan on Aug 28, 2008 2:52 PM PDT up reply actions
scripts can work *wonders*
I recently was on a project at a reasonably successful internet shop whose systems are built almost entirely on python scripts. My work there was mostly on R scripts, which didn’t run the company, but did a whole lot with a few lines, where a C programmer would have spent weeks. There is nothing wrong with Matlab scripts, and if you used them to control a robot vehicle, bully for you.
But, you know, they’re scripts. Someone else has done the heavy lifting behind the scenes. You just make it go.
in this taxonomy, Matlab is 'scripting'
To me, high-level is scripting, low-level is programming.
If you never have to allocate memory, you’re probably scripting.
ok, I take it back
if parsing text is ‘programming’, you can program.
Writing a set of instructions for a computer
So that the computer does something for you is programming.
Eugeniooooooo!!!!
by FairweatherFan on Aug 28, 2008 7:34 AM PDT up reply actions
Baseball Prospectus
Ran across this Fora TV video featuring Kevin Goldstein and Christina Kahrl of Baseball Prospectus. I hadn’t known until I’d checked out the vid that BP’s Ms. Kahrl is transgendered. Not that there’s anything wrong with that…
The Fora spot is part of a Baseball Prospectus book tour, and offers some interesting insights on the gathering of baseball data.
I like to use spreadsheets.
BB-Ref’s Play Index is awesome.
Other good stat sites: First Inning, Statcorner, Fangraphs, The Baseball Cube
I’m playing with the Lahman database at work, I don’t have Access at home.
how can you have access at work but not home?
out of curiosity
by NeifiChicken on Aug 27, 2008 4:58 PM PDT up reply actions
woops...
that’s another way to read that comment. And here I was thinking he was just telling him to steal it from work. :-)
When it comes to “stats” and “tool”, I’d reccomend using either a ruler or tape measure.
Zooperstars, they quack me up!
baseball-reference.com
"But if he's swinging at real flies, well, in that case there are two definite solutions: 1) Fresno 2) Ritalin." - howtheyscored
stats
Me, I just let smarter folks with time on their hands run the numbers (cf firstinning.com, minorleaguesplits.com and statcorner.com). But when I have done stats work for actual money, I’ve tended to work in R by preference, in spss (ugh, ptui) by necessity and occasionally in octave (Matlab clone like R is an S-plus clone), python or really, anything but Excel.
Sites I go to:
-First Inning
-The Hardball Times
-Minor League Equivalency Calculator
-Minor League Splits
-Fan Graphs
-Sean Smith’s Stats Site (which includes Colin Wyers’ WAR calculator)
-StatCorner
-UZR Spreadsheet from 2003-2007
-Baseball Reference
I’m learning how to use Excel, so hopefully I can start doing my own stuff soon.
Proud adoptive parent of Tim Alderson.
please don't use Excel
..it’s for quick spreadsheets, not analysis.
Right, but...
…the guy’s got to start somewhere, and Excel is a lot less intimidating than, say, Matlab, especially if you’re also learning the statistics concurrently.
So what is it that do you do in the day time that you’re such a software snob, anyway? :-)
I'm an investment geek
blah blah CFA blah blah data blah blah quant blah. Though I learned to despise spss on a project in the survey world, where the python shop I noted above was (and is, though I am onto something new) trying to sell a survey-based product into the buy-side and hedge-fund worlds.
bah.
You can use Excel to do real analysis, just stay away from the spreadsheet functions, if possible. You can use VBA to do anything that a statisical package to do. It’s pivot tables are also a legit and quick way to do some kinds of analysis.
Granted, I think the best use of Excel is for making reports and doing summary statistics, but there’s nothing there to turn your nose up at.
you can't block the Bocock
in my experience, excel is nice for small data sets, and quick summary stats like averages and standard deviations, but anything more complicated (like drawing a histogram) or involving more than a few hundred data points gets to be a pain in the ass pretty quickly. Then again, I don’t know how to use pivot tables or VBA.
um, no
You. Can. Not. Use. VBA. To. Do. Anything. That. A. Stats. Package. Can. Do.
But please, convince me: in VBA, how would you do this (from R):
library(lmer)
fit <- lmer2(Reaction ~ 1 + (1|Days) + (1|Subject), data = sleepstudy)
[fill in anything you might do with a multilevel fit..]
and for that matter, how does Excel/VBA handle something like the ACS?
R is a memory hog, but if you have 64G into which to bang the ACS and its 3m rows and its hundreds of variables, you can. I don’t even want to think what Excel/VBA would make of millions of rows and hundreds of columns.
Well, you certainly wouldn’t want to store such data in Excel. But VBA can handle very large amounts of data in it’s arrays.
If I’m dealing with a huge dataset and need to do an analysis that isn’t available in the in house stats package(SPSS), I usually use some combination of VBA and ADO.
BTW, I’m not claiming that VBA or VB are the BEST things to use for most kinds of analysis. But you CAN use it for a lot of things (I have one coworker who, quite irrationally, insists on using it for every project he is assigned). And if you live in a specialized corner of the data analysis universe where the options offered by a stats package aren’t always what you want, it can be quite useful.
you can't block the Bocock
American Community Survey
The 1% survey of the US now published annually by the Census. As noted, ~3m rows for people, plus a table of all the attached households. It’s not a large data set in the R-and-Stata-would-choke sense, but it strikes me as way too big for Excel + VBA. Having recently, by necessity, built an Excel + VBA sheet with a piddling few thousand rows by few score columns that runs Very Slowly When It Recalculates, I have my doubts.
I don’t know R, so I can’t tell you. I would assume that your line there is some sort of model evaluation.
I can tell you that any mathematical or statistical method can be turned into a VBA function. Personally, I’ve never done anything more complicated than logistic regression and most of the time, it would be reinventing the wheel to use VBA to copy stat package functions.
you can't block the Bocock
lmer == 'linear mixed effects in R'
Here, “can be” is the operative term. Here’s Doug Bates’s code for lmer.c: http://lme4.r-forge.r-project.org/doxygen/lmer_8c-source.html I’m sure it can be VBAized, but I defy you to implement lmer (or any other interesting R function) and actually make it work.
For that matter, why were you even wasting your time with logistic regressions in VBA? Even spss should be able to spit those out, and R’s glm() will handle them in one line.
had to make a user-form program for someone else to do their own (very specific) LRs.
you can't block the Bocock
'very specific'?
Douglas M. Bates is not some kid cooking up oddball stats in his basement.
I mean, his name is on the spine of textbooks for a reason.
by very specific, I mean these LRs were predicting chance of promotion from similarly formatted datasets.
you can't block the Bocock
my mistake
..I misread “had” as “hard”.
I’d still have scripted it in R, but if your enduser just can’t handle anything without the MSFT brand on it, I see the utility of reinventing the logit wheel.
That's bullshit
I’ve built some incredibly complex AND powerful spreadsheets in excel.
Nothing to do with baseball, of course.
Eugeniooooooo!!!!
by FairweatherFan on Aug 28, 2008 7:36 AM PDT up reply actions
yes, and so have I
..and it is always a terrible idea.
Always.
But it happens over and over again, because Excel is there, and familiar, and a sunk cost.
R
so i’ve seen a couple references to R now…if not by actual users, then by people who are at least curious about it. Is there a baseball-specific R group or web site?
I have never heard of any baseball oriented R group. However the awesome thing about R is that being an open source project, you can usually find someone who has previously done about the same task and you are looking to do and just do what they did. It has also been my experience (at least with the Bioconductor packages) that the authors of packages are often more then happy to correspond with you if you are struggling with something.
I’ve been using the Lahman database for Microsoft Access for a long while. Still one of my favorites for looking up a simple query quickly. I’ve been using the Retrosheet play by play stuff here and there and it is awesome. I’ve been slowly moving Retrosheet data into a MySQL database. I’ve also been messing around with MLB’s pitch f/x stuff. And lots and lots of websites, most of the ones mentioned above.
Python and Perl in the scripting department. SQL for queries. Excel for making the data look pretty. I really don’t do anything too fancy so the above has met my needs, but I need to mess around with some of the fancier analytic tools like R.
Keiichi Yabu: Leading your San Francisco Giants in triple plays induced
not to be the resident software bigot yet again, but..
..Postgres, not mysql. Postgres, not mysql. Postgres, not mysql. Not mysql.
If you need an inducement, postgres has pl/r: http://www.joeconway.com/plr/
All my poker database softwares use postgres
but the pitchfx database Mike Fast set up (I think I saw you post over there in the comments) is mysql. (I have very limited knowledge of scripts and parsing but I was able to get his database instructions followed fine)
Should I be looking to convert my mysql database to postgresql? How difficult/time consuming do you think it’d be (quick google search seems to be not overly hard)? I’m probably just going to wait until the offseason either way as I imagine its not that easy.
Can't get enough of the Oakland A's? Visit Oaktown Awesomer's
I don’t know why I keep clicking on this topic when I don’t remotely understand a thing that anyone is saying.
Trent Kline: Decentish. Also, my website is called ChatterBalks Dot Com and on it I make jokes about things.
Yeah, that’s probably it.
Trent Kline: Decentish. Also, my website is called ChatterBalks Dot Com and on it I make jokes about things.
I should have just skipped down to this comment.
I kept reading things like "Scripting that’s programming scripty VCP MORBRBBL" and being like "yeah, I agree with that," and then realizing it meant nothing to me and being like "why am I mindlessly agreeing with this?" and then being like "why am I even reading this?" and then being like "oh, groug said something I can actually agree with" and the being like all writing this comment.
My Dave Righetti is better than your Dave Righetti.
by howtheyscored on Aug 29, 2008 3:05 AM PDT up reply actions
Being? You can’t program on BeOS! (But you can script!)
Fred Lewis can stand under my umbrella.
31 May 2007, 21:38 EST - the last time Matteh's career W-L wasn't below.500
by S.F. Giangst on Aug 29, 2008 4:35 AM PDT up reply actions
Nah, you can’t script with Pokemon. I mean, at least not without Inscripteon (Evee’s evolved state when you feed it a desktop PC).
My Dave Righetti is better than your Dave Righetti.
by howtheyscored on Aug 29, 2008 9:49 PM PDT up reply actions

by 



















