Navigation: Jump to content areas:


Pro Quality. Fan Perspective.
Login-facebook
Around SBN: Knicks 90, Raptors 87: "Shump and Lin wouldn't let us lose."

stats tools...

Just curious for the stat heads out there: what tools do you use for your analyses? I see a lot of Excel and/or FanGraphs looking charts online, but I'm wondering if people aren't using more sophisticated statistics-specific software, like Stata or Matlab.

Also, where does your data come from? I can't imagine people are just hand copying numbers from baseball-reference.com. Is there some way to retrieve and parse large amounts of data from the mlb site?

 

This FanPost is reader-generated, and it does not necessarily reflect the views of McCovey Chronicles. If the author uses filler to achieve the minimum word requirement, a moderator may edit the FanPost for his or her own amusement.

Comment 76 comments  |  1 recs  | 

Do you like this story?

Comments

Display:

My Favorite Tool

Apologies to Xanthan

I was THE GREATEST OF ALL TIME (for 3 days in 1995).

by Mike Benjamin Hit King on Aug 27, 2008 2:57 PM PDT reply actions  

xanthan is a STATS tool

Oh no I didn’t.

"While conservatives tell you 'leave things alone and no one will lose,' and liberals tell you 'interfere a lot and no one will lose,' baseball says 'someone will lose.' Not only says it - but insists upon it! ... Democracy is lovely, but baseball's more mature." BVCE supports SF Dugout and Manny Burriss.

by BaronVonCurrentEvents on Aug 27, 2008 3:01 PM PDT up reply actions  

You did!

I seent it.

And I applauded it.

Billy Hayes: His job is better than yours.

by delorean on Aug 27, 2008 3:08 PM PDT up reply actions  

I’ll make it up to you with a fabulous opportunity to make you rich and famous. Or part of the subject of a college media project. I’m doing a thing on baseball stats websites for my journalism class and I’d like to get BayCityBall in on that.

"While conservatives tell you 'leave things alone and no one will lose,' and liberals tell you 'interfere a lot and no one will lose,' baseball says 'someone will lose.' Not only says it - but insists upon it! ... Democracy is lovely, but baseball's more mature." BVCE supports SF Dugout and Manny Burriss.

by BaronVonCurrentEvents on Aug 28, 2008 8:02 AM PDT up reply actions  

Super. I’ll shoot you an email in a few weeks and we’ll get started on things.

"While conservatives tell you 'leave things alone and no one will lose,' and liberals tell you 'interfere a lot and no one will lose,' baseball says 'someone will lose.' Not only says it - but insists upon it! ... Democracy is lovely, but baseball's more mature." BVCE supports SF Dugout and Manny Burriss.

by BaronVonCurrentEvents on Aug 28, 2008 9:14 AM PDT up reply actions  

SPSS and Excel with VBA.

Not that there aren’t better things out there, but these are what I know.

you can't block the Bocock

by oldjacket on Aug 27, 2008 3:51 PM PDT reply actions  

Can you program?

ZIPS: Milledge: 466 HR, 485 2B, 2282 hits, 278-379-524

by rfloh on Aug 27, 2008 3:53 PM PDT reply actions  

a lil'

I run other people’s Matlab scripts at work, but I’ve done some rudimentary stuff in R. As an undergrad, I used Stata in my econometrics classes and I also took an introductory course for SAS.

by FPTV on Aug 27, 2008 4:52 PM PDT up reply actions  

I don't think that's programming

..that’s scripting. Not that an R or python+scipy script can’t do pretty impressive things, mind you.

by wcw on Aug 27, 2008 5:33 PM PDT up reply actions  

right, but...

when you’re in a social sciences department at a state school, you just get in the habit of calling it “programming”.

by FPTV on Aug 27, 2008 5:48 PM PDT up reply actions  

If you can do Perl

it’s pretty much the best language for data parsing / manipulation. Though I do love Python and it is great.

Of course, other languages like Java, can do data manipulation too.

Don’t worry, you don’t have to know how to do a Quicksort, or about the nuances of different algorithms. You don’t need to be CS major to do this stuff.

ZIPS: Milledge: 466 HR, 485 2B, 2282 hits, 278-379-524

by rfloh on Aug 27, 2008 11:35 PM PDT up reply actions  

Matlab is most certainly programming

And is way overkill for any statistical analysis I’ve done.

Eugeniooooooo!!!!

by FairweatherFan on Aug 28, 2008 7:33 AM PDT up reply actions  

I was surprised to learn that most of the newer professors in my undergrad program were doing all of their econometrics work in Matlab (rather than something more specific, like Stata).

Actually, when I showed up at my current job using R for stats, my managers all gave me grief along the lines of, “You’re an engineer now. REAL engineers use MATLAB!” ;-)

by FPTV on Aug 28, 2008 10:14 AM PDT up reply actions  

There is programming and then there is programming

Matlab isn’t really programming.

ZIPS: Milledge: 466 HR, 485 2B, 2282 hits, 278-379-524

by rfloh on Aug 28, 2008 10:58 AM PDT up reply actions  

That statement

Makes me wonder what experience you have with matlab.

I think If I can write (well, help write) a flight control computer for an autonomous vehicle with it, it’s programming.

Ever used simulink ?

Eugeniooooooo!!!!

by FairweatherFan on Aug 28, 2008 2:52 PM PDT up reply actions  

scripts can work *wonders*

I recently was on a project at a reasonably successful internet shop whose systems are built almost entirely on python scripts. My work there was mostly on R scripts, which didn’t run the company, but did a whole lot with a few lines, where a C programmer would have spent weeks. There is nothing wrong with Matlab scripts, and if you used them to control a robot vehicle, bully for you.

But, you know, they’re scripts. Someone else has done the heavy lifting behind the scenes. You just make it go.

by wcw on Aug 28, 2008 5:30 PM PDT up reply actions  

hey wow distinctions that don’t make a difference

you can't block the Bocock

by oldjacket on Aug 28, 2008 7:32 PM PDT up reply actions  

in this taxonomy, Matlab is 'scripting'

To me, high-level is scripting, low-level is programming.

If you never have to allocate memory, you’re probably scripting.

by wcw on Aug 28, 2008 5:27 PM PDT up reply actions  

right now i use a combination of sed and and whatever scripting language I have handy to clean up my data… I was gonna learn Perl until my coworkers shamed me into switching to Python.

by FPTV on Aug 28, 2008 10:10 AM PDT up reply actions  

retrosheet can get you the raw data

but you’ll need to know how to program to sort through it

by NeifiChicken on Aug 27, 2008 4:04 PM PDT reply actions  

ok, I take it back

if parsing text is ‘programming’, you can program.

by wcw on Aug 27, 2008 5:33 PM PDT up reply actions  

databuilding is programming

you can't block the Bocock

by oldjacket on Aug 27, 2008 8:40 PM PDT up reply actions  

Writing a set of instructions for a computer

So that the computer does something for you is programming.

Eugeniooooooo!!!!

by FairweatherFan on Aug 28, 2008 7:34 AM PDT up reply actions  

Baseball Prospectus

Ran across this Fora TV video featuring Kevin Goldstein and Christina Kahrl of Baseball Prospectus. I hadn’t known until I’d checked out the vid that BP’s Ms. Kahrl is transgendered. Not that there’s anything wrong with that…

The Fora spot is part of a Baseball Prospectus book tour, and offers some interesting insights on the gathering of baseball data.

by biff pocoroba on Aug 27, 2008 4:16 PM PDT reply actions  

I like to use spreadsheets.

BB-Ref’s Play Index is awesome.

Other good stat sites: First Inning, Statcorner, Fangraphs, The Baseball Cube

I’m playing with the Lahman database at work, I don’t have Access at home.

by xanthan on Aug 27, 2008 4:17 PM PDT reply actions  

I assume he means MS Access

The program

Billy Hayes: His job is better than yours.

by delorean on Aug 27, 2008 5:59 PM PDT up reply actions  

woops...

that’s another way to read that comment. And here I was thinking he was just telling him to steal it from work. :-)

by FPTV on Aug 27, 2008 6:07 PM PDT up reply actions  

incidentally...

why don’t you just steal it from work?

by FPTV on Aug 27, 2008 6:07 PM PDT up reply actions  

When it comes to “stats” and “tool”, I’d reccomend using either a ruler or tape measure.

Zooperstars, they quack me up!

by Goofus on Aug 27, 2008 4:54 PM PDT reply actions  

But those devices don’t measure a curve well.

by chilibean_3 on Aug 27, 2008 7:08 PM PDT up reply actions  

baseball-reference.com

"But if he's swinging at real flies, well, in that case there are two definite solutions: 1) Fresno 2) Ritalin." - howtheyscored

by CPGiant756 on Aug 27, 2008 5:28 PM PDT reply actions  

stats

Me, I just let smarter folks with time on their hands run the numbers (cf firstinning.com, minorleaguesplits.com and statcorner.com). But when I have done stats work for actual money, I’ve tended to work in R by preference, in spss (ugh, ptui) by necessity and occasionally in octave (Matlab clone like R is an S-plus clone), python or really, anything but Excel.

by wcw on Aug 27, 2008 5:35 PM PDT reply actions  

please don't use Excel

..it’s for quick spreadsheets, not analysis.

by wcw on Aug 27, 2008 5:55 PM PDT up reply actions  

Right, but...

…the guy’s got to start somewhere, and Excel is a lot less intimidating than, say, Matlab, especially if you’re also learning the statistics concurrently.

So what is it that do you do in the day time that you’re such a software snob, anyway? :-)

by FPTV on Aug 27, 2008 6:02 PM PDT up reply actions  

I'm an investment geek

blah blah CFA blah blah data blah blah quant blah. Though I learned to despise spss on a project in the survey world, where the python shop I noted above was (and is, though I am onto something new) trying to sell a survey-based product into the buy-side and hedge-fund worlds.

by wcw on Aug 28, 2008 5:39 PM PDT up reply actions  

bah.

You can use Excel to do real analysis, just stay away from the spreadsheet functions, if possible. You can use VBA to do anything that a statisical package to do. It’s pivot tables are also a legit and quick way to do some kinds of analysis.

Granted, I think the best use of Excel is for making reports and doing summary statistics, but there’s nothing there to turn your nose up at.

you can't block the Bocock

by oldjacket on Aug 27, 2008 8:50 PM PDT up reply actions  

in my experience, excel is nice for small data sets, and quick summary stats like averages and standard deviations, but anything more complicated (like drawing a histogram) or involving more than a few hundred data points gets to be a pain in the ass pretty quickly. Then again, I don’t know how to use pivot tables or VBA.

by FPTV on Aug 27, 2008 9:35 PM PDT up reply actions  

I can see why you would be down on Excel without them. It’s a very limited analytical tool without VB and PT.

you can't block the Bocock

by oldjacket on Aug 28, 2008 8:22 AM PDT up reply actions  

um, no

You. Can. Not. Use. VBA. To. Do. Anything. That. A. Stats. Package. Can. Do.

But please, convince me: in VBA, how would you do this (from R):

library(lmer)
fit <- lmer2(Reaction ~ 1 + (1|Days) + (1|Subject), data = sleepstudy)
[fill in anything you might do with a multilevel fit..]

by wcw on Aug 28, 2008 5:37 PM PDT up reply actions  

and for that matter, how does Excel/VBA handle something like the ACS?

R is a memory hog, but if you have 64G into which to bang the ACS and its 3m rows and its hundreds of variables, you can. I don’t even want to think what Excel/VBA would make of millions of rows and hundreds of columns.

by wcw on Aug 28, 2008 5:53 PM PDT up reply actions  

Well, you certainly wouldn’t want to store such data in Excel. But VBA can handle very large amounts of data in it’s arrays.

If I’m dealing with a huge dataset and need to do an analysis that isn’t available in the in house stats package(SPSS), I usually use some combination of VBA and ADO.

BTW, I’m not claiming that VBA or VB are the BEST things to use for most kinds of analysis. But you CAN use it for a lot of things (I have one coworker who, quite irrationally, insists on using it for every project he is assigned). And if you live in a specialized corner of the data analysis universe where the options offered by a stats package aren’t always what you want, it can be quite useful.

you can't block the Bocock

by oldjacket on Aug 28, 2008 7:59 PM PDT up reply actions  

PS

what’s the ACS?

Actuarial something?

you can't block the Bocock

by oldjacket on Aug 28, 2008 7:59 PM PDT up reply actions  

American Community Survey

The 1% survey of the US now published annually by the Census. As noted, ~3m rows for people, plus a table of all the attached households. It’s not a large data set in the R-and-Stata-would-choke sense, but it strikes me as way too big for Excel + VBA. Having recently, by necessity, built an Excel + VBA sheet with a piddling few thousand rows by few score columns that runs Very Slowly When It Recalculates, I have my doubts.

by wcw on Aug 28, 2008 8:29 PM PDT up reply actions  

I don’t know R, so I can’t tell you. I would assume that your line there is some sort of model evaluation.

I can tell you that any mathematical or statistical method can be turned into a VBA function. Personally, I’ve never done anything more complicated than logistic regression and most of the time, it would be reinventing the wheel to use VBA to copy stat package functions.

you can't block the Bocock

by oldjacket on Aug 28, 2008 7:47 PM PDT up reply actions  

lmer == 'linear mixed effects in R'

Here, “can be” is the operative term. Here’s Doug Bates’s code for lmer.c: http://lme4.r-forge.r-project.org/doxygen/lmer_8c-source.html I’m sure it can be VBAized, but I defy you to implement lmer (or any other interesting R function) and actually make it work.

For that matter, why were you even wasting your time with logistic regressions in VBA? Even spss should be able to spit those out, and R’s glm() will handle them in one line.

by wcw on Aug 28, 2008 8:36 PM PDT up reply actions  

had to make a user-form program for someone else to do their own (very specific) LRs.

you can't block the Bocock

by oldjacket on Aug 28, 2008 9:04 PM PDT up reply actions  

'very specific'?

Douglas M. Bates is not some kid cooking up oddball stats in his basement.

I mean, his name is on the spine of textbooks for a reason.

by wcw on Aug 28, 2008 9:16 PM PDT up reply actions  

by very specific, I mean these LRs were predicting chance of promotion from similarly formatted datasets.

you can't block the Bocock

by oldjacket on Aug 28, 2008 9:24 PM PDT up reply actions  

my mistake

..I misread “had” as “hard”.

I’d still have scripted it in R, but if your enduser just can’t handle anything without the MSFT brand on it, I see the utility of reinventing the logit wheel.

by wcw on Aug 28, 2008 9:30 PM PDT up reply actions  

That's bullshit

I’ve built some incredibly complex AND powerful spreadsheets in excel.

Nothing to do with baseball, of course.

Eugeniooooooo!!!!

by FairweatherFan on Aug 28, 2008 7:36 AM PDT up reply actions  

yes, and so have I

..and it is always a terrible idea.

Always.

But it happens over and over again, because Excel is there, and familiar, and a sunk cost.

by wcw on Aug 28, 2008 5:38 PM PDT up reply actions  

This is a good explanation of how to download the pitch f/x data.

R is pretty awesome for statistical analysis.

by Scottsdale on Aug 27, 2008 6:05 PM PDT reply actions  

R

so i’ve seen a couple references to R now…if not by actual users, then by people who are at least curious about it. Is there a baseball-specific R group or web site?

by FPTV on Aug 28, 2008 10:13 AM PDT up reply actions  

I think there’s a whole book about R and baseball stats: Baseball Hacks

you can't block the Bocock

by oldjacket on Aug 28, 2008 11:24 AM PDT up reply actions  

I LIKE IT! thanks!

by FPTV on Aug 28, 2008 1:37 PM PDT up reply actions  

I have never heard of any baseball oriented R group. However the awesome thing about R is that being an open source project, you can usually find someone who has previously done about the same task and you are looking to do and just do what they did. It has also been my experience (at least with the Bioconductor packages) that the authors of packages are often more then happy to correspond with you if you are struggling with something.

by Scottsdale on Aug 28, 2008 11:25 AM PDT up reply actions  

I’ve been using the Lahman database for Microsoft Access for a long while. Still one of my favorites for looking up a simple query quickly. I’ve been using the Retrosheet play by play stuff here and there and it is awesome. I’ve been slowly moving Retrosheet data into a MySQL database. I’ve also been messing around with MLB’s pitch f/x stuff. And lots and lots of websites, most of the ones mentioned above.

Python and Perl in the scripting department. SQL for queries. Excel for making the data look pretty. I really don’t do anything too fancy so the above has met my needs, but I need to mess around with some of the fancier analytic tools like R.

Keiichi Yabu: Leading your San Francisco Giants in triple plays induced

by BaysideBaller on Aug 28, 2008 1:48 AM PDT reply actions  

not to be the resident software bigot yet again, but..

..Postgres, not mysql. Postgres, not mysql. Postgres, not mysql. Not mysql.

If you need an inducement, postgres has pl/r: http://www.joeconway.com/plr/

by wcw on Aug 28, 2008 5:47 PM PDT up reply actions  

All my poker database softwares use postgres

but the pitchfx database Mike Fast set up (I think I saw you post over there in the comments) is mysql. (I have very limited knowledge of scripts and parsing but I was able to get his database instructions followed fine)

Should I be looking to convert my mysql database to postgresql? How difficult/time consuming do you think it’d be (quick google search seems to be not overly hard)? I’m probably just going to wait until the offseason either way as I imagine its not that easy.

Can't get enough of the Oakland A's? Visit Oaktown Awesomer's

by iamawesomer on Aug 31, 2008 7:39 PM PDT up reply actions  

I don’t know why I keep clicking on this topic when I don’t remotely understand a thing that anyone is saying.

Trent Kline: Decentish. Also, my website is called ChatterBalks Dot Com and on it I make jokes about things.

by groug on Aug 28, 2008 6:12 PM PDT reply actions  

OCD?

Farewell, Ray. We'll miss your smile and your sugar. Welcome, Steve Hammond "Eggs". Throw strikes.
comics | cartoons | Nattowear

by Natto on Aug 28, 2008 7:19 PM PDT up reply actions  

Yeah, that’s probably it.

Trent Kline: Decentish. Also, my website is called ChatterBalks Dot Com and on it I make jokes about things.

by groug on Aug 28, 2008 9:10 PM PDT up reply actions  

I should have just skipped down to this comment.

I kept reading things like "Scripting that’s programming scripty VCP MORBRBBL" and being like "yeah, I agree with that," and then realizing it meant nothing to me and being like "why am I mindlessly agreeing with this?" and then being like "why am I even reading this?" and then being like "oh, groug said something I can actually agree with" and the being like all writing this comment.

My Dave Righetti is better than your Dave Righetti.

by howtheyscored on Aug 29, 2008 3:05 AM PDT up reply actions  

Being? You can’t program on BeOS! (But you can script!)

Fred Lewis can stand under my umbrella.
31 May 2007, 21:38 EST - the last time Matteh's career W-L wasn't below.500

by S.F. Giangst on Aug 29, 2008 4:35 AM PDT up reply actions  

I just assumed everyone was talking about Pokemon again.

by Evan on Aug 29, 2008 11:39 AM PDT up reply actions  

Nah, you can’t script with Pokemon. I mean, at least not without Inscripteon (Evee’s evolved state when you feed it a desktop PC).

My Dave Righetti is better than your Dave Righetti.

by howtheyscored on Aug 29, 2008 9:49 PM PDT up reply actions  

Comments For This Post Are Closed


User Tools

Welcome to the SB Nation blog about San Francisco Giants.

FanPosts

Community blog posts and discussion.

Recommended FanPosts

Sp-giants21_ph_t_0501991449_part6_small
The McCovey Chronicles Fantasy League, For Money.
Calvin_and_hobbes_small
2012 Adoption Draft: Who's In?
Calvin_and_hobbes_small
2012 Adoption Draft: Rules Discussion
Honus_wagner4_small
Hector & Gregor's Excellent Adventure (In the VWL)
Calvin_and_hobbes_small
Community Prospect List: The Results

Recent FanPosts

T_36396_small
2012 MLB Draft Snapshot – College Left Handed Pitchers
Img_0100_small
Cormac McCarthy novel The Road
T_36396_small
2012 MLB Draft Snapshot – HS Left handed pitchers
Small
Angel Villalona reported to have a work visa
T_36396_small
2012 MLB Draft Snapshot – The Catchers
Hidey-fern_small
Hiking on the 18th?
T_36396_small
2012 MLB Draft Snapshot - The Shortstops

+ New FanPost All FanPosts >


Manager

174246766_ea2fd78204_small Grant Brisbee

Moderators

Minime_small Natto

Fawlty_small WalrusMan

Goofus_small Goofus

Howtheyscoredcat_small howtheyscored

Det_7193_small jponry

Authors

09_small JT Jordan

Small steve S