I'm a huge fan of pandas (http://pandas.pydata.org/) for data analysis. It offers a lot of the basic functionality of R, but everything is in python. The original author of pandas, Wes McKinney, even wrote a book about it: Python for Data Analysis (http://shop.oreilly.com/product/0636920023784.do).
One caveat I would mention about data analysis would be that statistics is not just number crunching. It is really a bit of an art to making sure you are looking at the right sample of data in the right way, and ensure you are accounting for all potential biases. Surprisingly, I have noticed as I've gotten more experience doing data analysis, it takes me longer to do and I make less confident assertions. But on the other hand, I now very rarely make assertions which were incorrect, which is extremely important. I believe that incorrect data analysis is significantly worse than no data analysis.
So, the advice I would give to people getting started is whenever you come to a conclusion by analyzing a particular piece of data, ask your "if I look at the data differently, can I come to the opposite conclusion?". You would be surprised how often the answer to this question is yes, and that is a good indicator that you a) need more data or b) cannot make a significant conclusion. This can be especially difficult when you are already sure you know the answer to a question even before you do the data analysis, but you really have to be disciplined about it.
> One caveat I would mention about data analysis would be that statistics is not just number crunching
Agreed. Analysis that is well presented tells a story. The story begins with where the data comes from, then describes how the data was analyzed, and finally how the results relate back to the real world. If any of these pieces are missing, or unconvincing, it's a good sign that something is off, like you say, with either a) The data, or b) the significance of the conclusion.
I emphasize the story aspect because the act of laying out all of the assumptions in a semi-narrative form goes a long way towards deobfuscating the potentially confusing statistics. It is also the sort of discipline that can help to lay bare all of the assumptions that could make you sure that you know the answer.
Thanks. I'll check out Pandas. R syntax was a bit of a steep learning curve. This would be a nice bridge. I'm curious about the depth of statistical libraries and graphing libraries. ggplot is pretty great.
My goal in writing the article is to address the concern that you raise about finding the right conclusion. Only if most people are on the same page about analysis can the team see through the kind of twisting you mention.
I definitely have to agree. There is nothing more dangerous than plugging some numbers into R, typing t.test(some_numbers) and proclaiming "Hey look at this great p-value!" without understanding what a p-value is actually telling you.
The most important part of doing statistics right is actually understanding what you're doing, and you're far better off using less powerful tools that you understand than pumping data into fancy functions that you can't really explain.
And more important than just raw number crunching: understanding statistics gives you an improved intuition regarding data. The real value in statistics is the new tools it provides for thinking about problems.
A better solution than "learn R" would be to read through something like Head First Statics and the move on from there to more advanced stuff, and only then start hacking around in R.
Could not agree more. Most stats tools are so complicated that one spends more time learning/thinking about the logistics of running an analysis ("Should I run a chi-squared or a Fisher's Exact Test? And how do I get it to run?") than about the part that demands real human creativity: thinking through the right questions to ask, biases in the sample, etc.
Disclosure/shameless-promotion, I'm the cofounder of https://www.statwing.com, an easy to use stats tool.
Statistics is incredibly important since every scientific publication and public policy decision these days refers to statistical results to justify their conclusion.
Given this, a class in "bad statistics" would be even more useful - how numbers are presented using a veneer of statistical analysis to fraudulently imply incorrect conclusions that benefit the interest group or organization publishing or financing the study. As case studies such a class would have no shortage of examples, for example the more than 50% of peer reviewed journal articles whose findings were actually false. (per Ioannidis, http://www.plosmedicine.org/article/info:doi/10.1371/journal...)
I say teach them the business. Statistics is a fine tool, but you need to know what you are doing. You can build a regression line to predict sales in a given month, but you migh forget important drivers or to just ask people with experience.
On top of that, remember that looking at the past to predict the future does not always work. First, because you might not have a past (if you are a start up...); second, because new tech and facts might screw up your regression analysis very easily, especially in a fast moving sector (and if you are a start up...).
Said that, I find simple descriptive statistics very useful, I wish people in business knew at least what a variance is.
One of the most difficult things to do is tell the numerically illiterate how your conclusions are useful. Many people have a huge distrust of math and believe intuition trumps all considerations. I think it's a cultural thing. Don't hire people that are afraid of quantification. They don't need to know the exact procedures, but they shouldn't be violently striking against quantification. Your sales team doesn't really need to understand statistics beyond that it serves a certain function, focusing on certain measurable targets.
birken has made this point already but as a more general one. If you are already using Python in any way then use that instead of R, you will be much happier.
If you are not already using python then it still might be the right tool over using R. R has great functionality but my god are there ever warts in the R language and ecosystem.
Also: I'll add my support to the opinion that you need to be very careful that you understand the statistics you are doing or you will be asking your tools to lie to you and not know it. A whole lot of the statistics you are going to want to do will be more advanced than stats 101.
Problem: then they will realize they are getting a shitty deal with 20,000 shares out of 80 million and want a real slice. Better to leave them numerically illiterate.
One caveat I would mention about data analysis would be that statistics is not just number crunching. It is really a bit of an art to making sure you are looking at the right sample of data in the right way, and ensure you are accounting for all potential biases. Surprisingly, I have noticed as I've gotten more experience doing data analysis, it takes me longer to do and I make less confident assertions. But on the other hand, I now very rarely make assertions which were incorrect, which is extremely important. I believe that incorrect data analysis is significantly worse than no data analysis.
So, the advice I would give to people getting started is whenever you come to a conclusion by analyzing a particular piece of data, ask your "if I look at the data differently, can I come to the opposite conclusion?". You would be surprised how often the answer to this question is yes, and that is a good indicator that you a) need more data or b) cannot make a significant conclusion. This can be especially difficult when you are already sure you know the answer to a question even before you do the data analysis, but you really have to be disciplined about it.