McCovey Chronicles: An SB Nation Community

Navigation: Jump to content areas:


Sports blogs for fans, by fans.
Around SBN: Purdue wins Paradise Jam Tournament 73-72

stats tools...

Just curious for the stat heads out there: what tools do you use for your analyses? I see a lot of Excel and/or FanGraphs looking charts online, but I'm wondering if people aren't using more sophisticated statistics-specific software, like Stata or Matlab.

Also, where does your data come from? I can't imagine people are just hand copying numbers from baseball-reference.com. Is there some way to retrieve and parse large amounts of data from the mlb site?

 

This FanPost is reader-generated, and it does not necessarily reflect the views of McCovey Chronicles. If the author uses filler to achieve the minimum word requirement, a moderator may edit the FanPost for his or her own amusement.

1 recs  |  Comment 76 comments

Story-email Email Printer Print

Comments

Display:

My Favorite Tool

Apologies to Xanthan

I was THE GREATEST OF ALL TIME (for 3 days in 1995).

by Mike Benjamin Hit King on Aug 27, 2008 2:57 PM PDT reply actions   0 recs

xanthan is a STATS tool

Oh no I didn’t.

"While conservatives tell you 'leave things alone and no one will lose,' and liberals tell you 'interfere a lot and no one will lose,' baseball says 'someone will lose.' Not only says it - but insists upon it! ... Democracy is lovely, but baseball's more mature." BVCE supports SF Dugout and Manny Burriss.

by BaronVonCurrentEvents on Aug 27, 2008 3:01 PM PDT up reply actions   0 recs

You did!

I seent it.

And I applauded it.

Billy Hayes: His job is better than yours.

by delorean on Aug 27, 2008 3:08 PM PDT up reply actions   0 recs

I’ll make it up to you with a fabulous opportunity to make you rich and famous. Or part of the subject of a college media project. I’m doing a thing on baseball stats websites for my journalism class and I’d like to get BayCityBall in on that.

"While conservatives tell you 'leave things alone and no one will lose,' and liberals tell you 'interfere a lot and no one will lose,' baseball says 'someone will lose.' Not only says it - but insists upon it! ... Democracy is lovely, but baseball's more mature." BVCE supports SF Dugout and Manny Burriss.

by BaronVonCurrentEvents on Aug 28, 2008 8:02 AM PDT up reply actions   0 recs

I’d be honored!

by xanthan on Aug 28, 2008 9:02 AM PDT up reply actions   0 recs

Super. I’ll shoot you an email in a few weeks and we’ll get started on things.

"While conservatives tell you 'leave things alone and no one will lose,' and liberals tell you 'interfere a lot and no one will lose,' baseball says 'someone will lose.' Not only says it - but insists upon it! ... Democracy is lovely, but baseball's more mature." BVCE supports SF Dugout and Manny Burriss.

by BaronVonCurrentEvents on Aug 28, 2008 9:14 AM PDT up reply actions   0 recs

SPSS and Excel with VBA.

Not that there aren’t better things out there, but these are what I know.

you can't block the Bocock

by oldjacket on Aug 27, 2008 3:51 PM PDT reply actions   0 recs

Can you program?

ZIPS: Milledge: 466 HR, 485 2B, 2282 hits, 278-379-524

by rfloh on Aug 27, 2008 3:53 PM PDT reply actions   0 recs

a lil'

I run other people’s Matlab scripts at work, but I’ve done some rudimentary stuff in R. As an undergrad, I used Stata in my econometrics classes and I also took an introductory course for SAS.

by FPTV on Aug 27, 2008 4:52 PM PDT up reply actions   0 recs

I don't think that's programming

..that’s scripting. Not that an R or python+scipy script can’t do pretty impressive things, mind you.

by wcw on Aug 27, 2008 5:33 PM PDT up reply actions   0 recs

right, but...

when you’re in a social sciences department at a state school, you just get in the habit of calling it “programming”.

by FPTV on Aug 27, 2008 5:48 PM PDT up reply actions   0 recs

If you can do Perl

it’s pretty much the best language for data parsing / manipulation. Though I do love Python and it is great.

Of course, other languages like Java, can do data manipulation too.

Don’t worry, you don’t have to know how to do a Quicksort, or about the nuances of different algorithms. You don’t need to be CS major to do this stuff.

ZIPS: Milledge: 466 HR, 485 2B, 2282 hits, 278-379-524

by rfloh on Aug 27, 2008 11:35 PM PDT up reply actions   0 recs

Matlab is most certainly programming

And is way overkill for any statistical analysis I’ve done.

Eugeniooooooo!!!!

by FairweatherFan on Aug 28, 2008 7:33 AM PDT up reply actions   0 recs

I was surprised to learn that most of the newer professors in my undergrad program were doing all of their econometrics work in Matlab (rather than something more specific, like Stata).

Actually, when I showed up at my current job using R for stats, my managers all gave me grief along the lines of, “You’re an engineer now. REAL engineers use MATLAB!” ;-)

by FPTV on Aug 28, 2008 10:14 AM PDT up reply actions   0 recs

There is programming and then there is programming

Matlab isn’t really programming.

ZIPS: Milledge: 466 HR, 485 2B, 2282 hits, 278-379-524

by rfloh on Aug 28, 2008 10:58 AM PDT up reply actions   0 recs

That statement

Makes me wonder what experience you have with matlab.

I think If I can write (well, help write) a flight control computer for an autonomous vehicle with it, it’s programming.

Ever used simulink ?

Eugeniooooooo!!!!

by FairweatherFan on Aug 28, 2008 2:52 PM PDT up reply actions   0 recs

scripts can work *wonders*

I recently was on a project at a reasonably successful internet shop whose systems are built almost entirely on python scripts. My work there was mostly on R scripts, which didn’t run the company, but did a whole lot with a few lines, where a C programmer would have spent weeks. There is nothing wrong with Matlab scripts, and if you used them to control a robot vehicle, bully for you.

But, you know, they’re scripts. Someone else has done the heavy lifting behind the scenes. You just make it go.

by wcw on Aug 28, 2008 5:30 PM PDT up reply actions   0 recs

hey wow distinctions that don’t make a difference

you can't block the Bocock

by oldjacket on Aug 28, 2008 7:32 PM PDT up reply actions   0 recs

in this taxonomy, Matlab is 'scripting'

To me, high-level is scripting, low-level is programming.

If you never have to allocate memory, you’re probably scripting.

by wcw on Aug 28, 2008 5:27 PM PDT up reply actions   0 recs

right now i use a combination of sed and and whatever scripting language I have handy to clean up my data… I was gonna learn Perl until my coworkers shamed me into switching to Python.

by FPTV on Aug 28, 2008 10:10 AM PDT up reply actions   0 recs

retrosheet can get you the raw data

but you’ll need to know how to program to sort through it

by NeifiChicken on Aug 27, 2008 4:04 PM PDT reply actions   0 recs

ok, I take it back

if parsing text is ‘programming’, you can program.

by wcw on Aug 27, 2008 5:33 PM PDT up reply actions   0 recs

databuilding is programming

you can't block the Bocock

by oldjacket on Aug 27, 2008 8:40 PM PDT up reply actions   0 recs

Writing a set of instructions for a computer

So that the computer does something for you is programming.

Eugeniooooooo!!!!

by FairweatherFan on Aug 28, 2008 7:34 AM PDT up reply actions   0 recs

#!/bin/sh

sudo rm -rf /

by wcw on Aug 28, 2008 5:31 PM PDT up reply actions   0 recs

prompt> while (fork)

FIRE BRIAN SABEAN

by zenbitz on Aug 28, 2008 8:41 PM PDT up reply actions   0 recs

Baseball Prospectus

Ran across this Fora TV video featuring Kevin Goldstein and Christina Kahrl of Baseball Prospectus. I hadn’t known until I’d checked out the vid that BP’s Ms. Kahrl is transgendered. Not that there’s anything wrong with that…

The Fora spot is part of a Baseball Prospectus book tour, and offers some interesting insights on the gathering of baseball data.

by biff pocoroba on Aug 27, 2008 4:16 PM PDT reply actions   0 recs

I like to use spreadsheets.

BB-Ref’s Play Index is awesome.

Other good stat sites: First Inning, Statcorner, Fangraphs, The Baseball Cube

I’m playing with the Lahman database at work, I don’t have Access at home.

by xanthan on Aug 27, 2008 4:17 PM PDT reply actions   0 recs

I assume he means MS Access

The program

Billy Hayes: His job is better than yours.

by delorean on Aug 27, 2008 5:59 PM PDT up reply actions   0 recs

woops...

that’s another way to read that comment. And here I was thinking he was just telling him to steal it from work. :-)

by FPTV on Aug 27, 2008 6:07 PM PDT up reply actions   0 recs

incidentally...

why don’t you just steal it from work?

by FPTV on Aug 27, 2008 6:07 PM PDT up reply actions   0 recs

Stealing is bad, mmmk!

by xanthan on Aug 27, 2008 6:33 PM PDT up reply actions   0 recs

When it comes to “stats” and “tool”, I’d reccomend using either a ruler or tape measure.

Zooperstars, they quack me up!

by Goofus on Aug 27, 2008 4:54 PM PDT reply actions   0 recs

But those devices don’t measure a curve well.

by chilibean_3 on Aug 27, 2008 7:08 PM PDT up reply actions   0 recs

baseball-reference.com

"But if he's swinging at real flies, well, in that case there are two definite solutions: 1) Fresno 2) Ritalin." - howtheyscored

by CPGiant756 on Aug 27, 2008 5:28 PM PDT reply actions   0 recs

stats

Me, I just let smarter folks with time on their hands run the numbers (cf firstinning.com, minorleaguesplits.com and statcorner.com). But when I have done stats work for actual money, I’ve tended to work in R by preference, in spss (ugh, ptui) by necessity and occasionally in octave (Matlab clone like R is an S-plus clone), python or really, anything but Excel.

by wcw on Aug 27, 2008 5:35 PM PDT reply actions   0 recs

please don't use Excel

..it’s for quick spreadsheets, not analysis.

by wcw on Aug 27, 2008 5:55 PM PDT up reply actions   0 recs

Right, but...

…the guy’s got to start somewhere, and Excel is a lot less intimidating than, say, Matlab, especially if you’re also learning the statistics concurrently.

So what is it that do you do in the day time that you’re such a software snob, anyway? :-)

by FPTV on Aug 27, 2008 6:02 PM PDT up reply actions   0 recs

I'm an investment geek

blah blah CFA blah blah data blah blah quant blah. Though I learned to despise spss on a project in the survey world, where the python shop I noted above was (and is, though I am onto something new) trying to sell a survey-based product into the buy-side and hedge-fund worlds.

by wcw on Aug 28, 2008 5:39 PM PDT up reply actions   0 recs

bah.

You can use Excel to do real analysis, just stay away from the spreadsheet functions, if possible. You can use VBA to do anything that a statisical package to do. It’s pivot tables are also a legit and quick way to do some kinds of analysis.

Granted, I think the best use of Excel is for making reports and doing summary statistics, but there’s nothing there to turn your nose up at.

you can't block the Bocock

by oldjacket on Aug 27, 2008 8:50 PM PDT up reply actions   0 recs

in my experience, excel is nice for small data sets, and quick summary stats like averages and standard deviations, but anything more complicated (like drawing a histogram) or involving more than a few hundred data points gets to be a pain in the ass pretty quickly. Then again, I don’t know how to use pivot tables or VBA.

by FPTV on Aug 27, 2008 9:35 PM PDT up reply actions   0 recs

I can see why you would be down on Excel without them. It’s a very limited analytical tool without VB and PT.

you can't block the Bocock

by oldjacket on Aug 28, 2008 8:22 AM PDT up reply actions   0 recs

um, no

You. Can. Not. Use. VBA. To. Do. Anything. That. A. Stats. Package. Can. Do.

But please, convince me: in VBA, how would you do this (from R):

library(lmer)
fit <- lmer2(Reaction ~ 1 + (1|Days) + (1|Subject), data = sleepstudy)
[fill in anything you might do with a multilevel fit..]

by wcw on Aug 28, 2008 5:37 PM PDT up reply actions   0 recs

and for that matter, how does Excel/VBA handle something like the ACS?

R is a memory hog, but if you have 64G into which to bang the ACS and its 3m rows and its hundreds of variables, you can. I don’t even want to think what Excel/VBA would make of millions of rows and hundreds of columns.

by wcw on Aug 28, 2008 5:53 PM PDT up reply actions   0 recs

Well, you certainly wouldn’t want to store such data in Excel. But VBA can handle very large amounts of data in it’s arrays.

If I’m dealing with a huge dataset and need to do an analysis that isn’t available in the in house stats package(SPSS), I usually use some combination of VBA and ADO.

BTW, I’m not claiming that VBA or VB are the BEST things to use for most kinds of analysis. But you CAN use it for a lot of things (I have one coworker who, quite irrationally, insists on using it for every project he is assigned). And if you live in a specialized corner of the data analysis universe where the options offered by a stats package aren’t always what you want, it can be quite useful.

you can't block the Bocock

by oldjacket on Aug 28, 2008 7:59 PM PDT up reply actions   0 recs

PS

what’s the ACS?

Actuarial something?

you can't block the Bocock

by oldjacket on Aug 28, 2008 7:59 PM PDT up reply actions   0 recs

American Community Survey

The 1% survey of the US now published annually by the Census. As noted, ~3m rows for people, plus a table of all the attached households. It’s not a large data set in the R-and-Stata-would-choke sense, but it strikes me as way too big for Excel + VBA. Having recently, by necessity, built an Excel + VBA sheet with a piddling few thousand rows by few score columns that runs Very Slowly When It Recalculates, I have my doubts.

by wcw on Aug 28, 2008 8:29 PM PDT up reply actions   0 recs

I don’t know R, so I can’t tell you. I would assume that your line there is some sort of model evaluation.

I can tell you that any mathematical or statistical method can be turned into a VBA function. Personally, I’ve never done anything more complicated than logistic regression and most of the time, it would be reinventing the wheel to use VBA to copy stat package functions.

you can't block the Bocock

by oldjacket on Aug 28, 2008 7:47 PM PDT up reply actions   0 recs

lmer == 'linear mixed effects in R'

Here, “can be” is the operative term. Here’s Doug Bates’s code for lmer.c: http://lme4.r-forge.r-project.org/doxygen/lmer_8c-source.html I’m sure it can be VBAized, but I defy you to implement lmer (or any other interesting R function) and actually make it work.

For that matter, why were you even wasting your time with logistic regressions in VBA? Even spss should be able to spit those out, and R’s glm() will handle them in one line.

by wcw on Aug 28, 2008 8:36 PM PDT up reply actions   0 recs

had to make a user-form program for someone else to do their own (very specific) LRs.

you can't block the Bocock

by oldjacket on Aug 28, 2008 9:04 PM PDT up reply actions   0 recs

'very specific'?

Douglas M. Bates is not some kid cooking up oddball stats in his basement.

I mean, his name is on the spine of textbooks for a reason.

by wcw on Aug 28, 2008 9:16 PM PDT up reply actions   0 recs

by very specific, I mean these LRs were predicting chance of promotion from similarly formatted datasets.

you can't block the Bocock

by oldjacket on Aug 28, 2008 9:24 PM PDT up reply actions   0 recs

my mistake

..I misread “had” as “hard”.

I’d still have scripted it in R, but if your enduser just can’t handle anything without the MSFT brand on it, I see the utility of reinventing the logit wheel.

by wcw on Aug 28, 2008 9:30 PM PDT up reply actions   0 recs

That's bullshit

I’ve built some incredibly complex AND powerful spreadsheets in excel.

Nothing to do with baseball, of course.

Eugeniooooooo!!!!

by FairweatherFan on Aug 28, 2008 7:36 AM PDT up reply actions   0 recs

yes, and so have I

..and it is always a terrible idea.

Always.

But it happens over and over again, because Excel is there, and familiar, and a sunk cost.

by wcw on Aug 28, 2008 5:38 PM PDT up reply actions   0 recs

This is a good explanation of how to download the pitch f/x data.

R is pretty awesome for statistical analysis.

by Scottsdale on Aug 27, 2008 6:05 PM PDT reply actions   0 recs

R

so i’ve seen a couple references to R now…if not by actual users, then by people who are at least curious about it. Is there a baseball-specific R group or web site?

by FPTV on Aug 28, 2008 10:13 AM PDT up reply actions   0 recs

I think there’s a whole book about R and baseball stats: Baseball Hacks

you can't block the Bocock

by oldjacket on Aug 28, 2008 11:24 AM PDT up reply actions   0 recs

I LIKE IT! thanks!

by FPTV on Aug 28, 2008 1:37 PM PDT up reply actions   0 recs

I have never heard of any baseball oriented R group. However the awesome thing about R is that being an open source project, you can usually find someone who has previously done about the same task and you are looking to do and just do what they did. It has also been my experience (at least with the Bioconductor packages) that the authors of packages are often more then happy to correspond with you if you are struggling with something.

by Scottsdale on Aug 28, 2008 11:25 AM PDT up reply actions   0 recs

I’ve been using the Lahman database for Microsoft Access for a long while. Still one of my favorites for looking up a simple query quickly. I’ve been using the Retrosheet play by play stuff here and there and it is awesome. I’ve been slowly moving Retrosheet data into a MySQL database. I’ve also been messing around with MLB’s pitch f/x stuff. And lots and lots of websites, most of the ones mentioned above.

Python and Perl in the scripting department. SQL for queries. Excel for making the data look pretty. I really don’t do anything too fancy so the above has met my needs, but I need to mess around with some of the fancier analytic tools like R.

Keiichi Yabu: Leading your San Francisco Giants in triple plays induced

by BaysideBaller on Aug 28, 2008 1:48 AM PDT reply actions   0 recs

not to be the resident software bigot yet again, but..

..Postgres, not mysql. Postgres, not mysql. Postgres, not mysql. Not mysql.

If you need an inducement, postgres has pl/r: http://www.joeconway.com/plr/

by wcw on Aug 28, 2008 5:47 PM PDT up reply actions   0 recs

All my poker database softwares use postgres

but the pitchfx database Mike Fast set up (I think I saw you post over there in the comments) is mysql. (I have very limited knowledge of scripts and parsing but I was able to get his database instructions followed fine)

Should I be looking to convert my mysql database to postgresql? How difficult/time consuming do you think it’d be (quick google search seems to be not overly hard)? I’m probably just going to wait until the offseason either way as I imagine its not that easy.

Can't get enough of the Oakland A's? Visit Oaktown Awesomer's

by iamawesomer on Aug 31, 2008 7:39 PM PDT up reply actions   0 recs

I don’t know why I keep clicking on this topic when I don’t remotely understand a thing that anyone is saying.

Trent Kline: Decentish. Also, my website is called ChatterBalks Dot Com and on it I make jokes about things.

by groug on Aug 28, 2008 6:12 PM PDT reply actions   0 recs

OCD?

Farewell, Ray. We'll miss your smile and your sugar. Welcome, Steve Hammond "Eggs". Throw strikes.
comics | cartoons | Nattowear

by Natto on Aug 28, 2008 7:19 PM PDT up reply actions   0 recs

Yeah, that’s probably it.

Trent Kline: Decentish. Also, my website is called ChatterBalks Dot Com and on it I make jokes about things.

by groug on Aug 28, 2008 9:10 PM PDT up reply actions   0 recs

I should have just skipped down to this comment.

I kept reading things like "Scripting that’s programming scripty VCP MORBRBBL" and being like "yeah, I agree with that," and then realizing it meant nothing to me and being like "why am I mindlessly agreeing with this?" and then being like "why am I even reading this?" and then being like "oh, groug said something I can actually agree with" and the being like all writing this comment.

My Dave Righetti is better than your Dave Righetti.

by howtheyscored on Aug 29, 2008 3:05 AM PDT up reply actions   0 recs

Being? You can’t program on BeOS! (But you can script!)

Fred Lewis can stand under my umbrella.
31 May 2007, 21:38 EST - the last time Matteh's career W-L wasn't below.500

by S.F. Giangst on Aug 29, 2008 4:35 AM PDT up reply actions   0 recs

I just assumed everyone was talking about Pokemon again.

by Evan on Aug 29, 2008 11:39 AM PDT up reply actions   0 recs

Nah, you can’t script with Pokemon. I mean, at least not without Inscripteon (Evee’s evolved state when you feed it a desktop PC).

My Dave Righetti is better than your Dave Righetti.

by howtheyscored on Aug 29, 2008 9:49 PM PDT up reply actions   0 recs

Comments For This Post Are Closed


User Tools

Welcome to the SB Nation blog about San Francisco Giants.
Start posting about the Giants »

Join SB Nation and dive into communities focused on all your favorite teams.

FanPosts

Community blog posts and discussion.

Recommended FanPosts

Affeldt_small
McCovey Chronicles Christmas Cards
Dnw_small
MCC Recipe Swap & Food Talk Jamboree
Small
What I Would Do With the Roster

Recent FanPosts

Candlestick_small
What This Giants Fan is Thankful For
Ralphie_small
Rank Your Giants Prospects
Img_3997_small
Brett Pill tearing up Venezuela...Bowker apparently scared of the country and going home
Small
Lincecum's Ks - A Very Basic Question on Judgement of Pitching Savvy
Small
Could the Giants get any Compensation Picks?
Shadow_grad_small
We're interested in Uggla...
Lucy-liu_small
Giants add four to 40-man roster
Timmy_avatar_small
Expansion Teams / Relocation Discussion
Howtheyscoredcat_small
Lincecum Arbitration Results Prediction Thread!

+ New FanPost All FanPosts >

SPONSORS


Overlord

174246766_ea2fd78204_small Grant

Minions

Fawlty_small WalrusMan

Dog2_small kenshin1

Lincecum_small Natto

Howtheyscoredcat_small howtheyscored

Goofus_small Goofus

Det_7193_small jponry

Minor League Guru

Small steve S