Best products from r/statistics

We found 115 comments on r/statistics discussing the most recommended products. We ran sentiment analysis on each of these comments to determine how redditors feel about different products. We found 465 products and ranked them based on the amount of positive reactions they received. Here are the top 20.

Top comments mentioning products on r/statistics:

u/k0wzking · 2 pointsr/statistics

My understanding is that this is more of a traditional statistics problem than a data science problem (though the two overlap substantially). It sounds like you will be dealing with (at most) a few hundred “squads” and maybe 10-20 pertaining variables (e.g., workload, time taken to complete work load, size of team, age of team members, educational qualifications of team members). There is no strict boarder segregating data science and traditional statistics, but generally speaking traditional statistics aim to analyse 100s-1000s of data points with 10s of variables, whereas data science or machine learning procedures aim to assess millions of data points with 1000s of variables available. Again, this is not a strict definition and you can almost always apply a data science procedure to a traditional statistics problem (and visa-versa).
This being said, this sub is an okay place to seek resources. I would highly recommend checking out Stack Exchange and the machine learning sub. You may want to purchase a textbook to facilitate your learning. My favourites include Applied Linear Statistical Models by Kutner and friends, and Python Machine Learning by Sebastian Raschika. The former is a traditional stats textbook and the latter is a data science/machine learning textbook. You may be able to find a good portion of these books on Google Books for free. This might help you decide which one to buy and what direction to go in. Additionally, I have made some videos on particular data analysis procedures with the aim of facilitating application oriented understanding rather than complete mathematical understanding, you may find some of these videos useful.

I would say that your proposed project is potentially a good one, but we’d need more information to gauge its feasibility. As a start, see what variables you have available to you and explore your data a bit (maybe look at all their bivariate relationships). Data analysis itself requires a lot of “looking at” and “interpreting what you see”. Just doing this basic task will give you a much better idea of what you are dealing with, what variables are possibly related to your outcome of interest, and how feasible it is to gain insight into the problem of interest. Explore and get to know your data, and if you are stuck after that then definitely come back and ask more questions. In the end, you may not be able to accurately “predict” anything, but you can definitely calculate probabilities of successfully completing a sprint based on workload + conditions.

Best of luck, sounds like a fun project :)

u/shujaa-g · 5 pointsr/statistics

You're pretty good when it comes to linear vs. generalized linear models--and the comparison is the same regardless of whether you use mixed models or not. I don't agree at all with your "Part 3".

My favorite reference on the subject is Gelman & Hill. That book prefers to the terminology of "pooling", and considers models that have "no pooling", "complete pooling", or "partial pooling".

One of the introductory datasets is on Radon levels in houses in Minnesota. The response is the (log) Radon level, the main explanatory variable is the floor of the house the measurement was made: 0 for basement, 1 for first floor, and there's also a grouping variable for the county.

Radon comes out of the ground, so, in general, we expect (and see in the data) basement measurements to have higher Radon levels than ground floor measurements, and based on varied soil conditions, different overall levels in different counties.

We could fit 2 fixed effect linear models. Using R formula psuedocode, they are:

  1. radon ~ floor
  2. radon ~ floor + county (county as a fixed effect)

    The first is the "complete pooling" model. Everything is grouped together into one big pool. You estimate two coefficients. The intercept is the mean value for all the basement measurements, and your "slope", the floor coefficient, is the difference between the ground floor mean and the basement mean. This model completely ignores the differences between the counties.

    The second is the "no pooling" estimate, where each county is in it's own little pool by itself. If there are k counties, you estimate k + 1 coefficients: one intercept--the mean value in your reference county, one "slope", and k - 1 county adjustments which are the differences between the mean basement measurements in each county to the reference county.

    Neither of these models are great. The complete pooling model ignores any information conveyed by the county variable, which is wasteful. A big problem with the second model is that there's a lot of variation in how sure we are about individual counties. Some counties have a lot of measurements, and we feel pretty good about their levels, but some of the counties only have 2 or 3 data points (or even just 1). What we're doing in the "no pooling" model is taking the average of however many measurement there are in each county, even if there are only 2, and declaring that to be the radon level for that county. Maybe Lincoln County has only two measurements, and they both happen to be pretty high, say 1.5 to 2 standard deviations above the grand mean. Do you really think that this is good evidence that Lincoln County has exceptionally high Radon levels? Your model does, it's fitted line goes straight between the two Lincoln county points, 1.75 standard deviations above the grand mean. But maybe you're thinking "that could just be a fluke. Flipping a coin twice and seeing two heads doesn't mean the coin isn't fair, and having only two measurements from Lincoln County and they're both on the high side doesn't mean Radon levels there are twice the state average."

    Enter "partial pooling", aka mixed effects. We fit the model radon ~ floor + (1 | county). This means we'll keep the overall fixed effect for the floor difference, but we'll allow the intercept to vary with county as a random effect. We assume that the intercepts are normally distributed, with each county being a draw from that normal distribution. If a county is above the statewide mean and it has lots of data points, we're pretty confident that the county's Radon level is actually high, but if it's high and has only two data points, they won't have the weight to pull up the measurement. In this way, the random effects model is a lot like a Bayesian model, where our prior is the statewide distribution, and our data is each county.

    The only parameters that are actually estimated are the floor coefficient, and then the mean and SD of the county-level intercept. Thus, unlike the complete pooling model, the partial pooling model takes the county info into account, but it is far more parsimonious than the no pooling model. If we really care about the effects of each county, this may not be the best model for us to use. But, if we care about general county-level variation, and we just want to control pretty well for county effects, then this is a great model!

    Of course, random effects can be extended to more than just intercepts. We could fit models where the floor coefficient varies by county, etc.

    Hope this helps! I strongly recommend checking out Gelman and Hill.
u/efrique · 2 pointsr/statistics

> she didn't know how useful it would be

probably more employable than geography

> Do you guys have any recommendations on where to develop my knowledge and skills?

There's a bunch of free and inexpensive stuff around .. but there's also a lot of bad free/inexpensive stuff around; you have to be a bit discerning (which is hard when you're trying to learn it).

It might sound a bit old-school but I'd suggest going to a university library and finding some decent stats texts; you probably want to avoid the stuff that says "11th edition".

Find several that you like and work with those for a while

Some you might look for:

Statistics, Freedman, Pisani & Purves (any edition)

Introduction to the Practice of Statistics, Moore & McCabe (5th edition or earlier)

For a bit of theory (you'll need a bit of mathematics for this but not a ton of it):

Introduction To The Theory Of Statistics, Mood Graybill And Boes

These are all old books. You should be able to get them second hand for cheap, or read them in a library. They'll be a good grounding, but you'll need to be able to ask questions as well.

Places like this one and can be handy resources. I've seen determined people teach themselves a lot of statistics with only a bit of guidance so it can certainly be done.

> Do I need programming, if so, what would be the best programming language to learn?

It would be best to learn some, yes, because modern statistics relies on it heavily. You don't necessarily have to do it immediately but getting an early start (and using it to help with learning stats) will be better than leaving it really long.

Two main things are widely used ... R and Python. Both are free. The second is more of a mainstream programming language, the first is a statistics package as well as a more specialized language

Learn one or both; my own suggestion would be to try R but other people may have different advice.

If you want to be a programmer rather than a statistician who uses code to solve statistical problems, python would be the better choice.

u/COOLSerdash · 9 pointsr/statistics
u/coffeecoffeecoffeee · 20 pointsr/statistics

You absolutely will need R and/or SAS to do any work beyond basic statistics. You'll have to know how to do data munging and how to reshape your data to get it in the right form. It's 2018. You have to know how to clean your own data. Additionally, you'll be asked to repeat complicated analyses, or questions like "How did you calculate this number in this analysis from six months ago?" A point-and-click interface doesn't give you a record that makes it easy to do these things. In a programming language, you can rerun complex procedures in the press of a button. Programming can be a little scary at first, but once you get the hang of it, you'll wonder how you lived without it.

Fear not though! There are a ton of fantastic resources to learn how to code. If you've never programmed before, my recommendation would be to go through the Codecademy intro Python tutorial. Even if you never want to use Python after this, you'll learn about variables, conditions, loops, data types, functions, and language-specific features. These are ideas that exist in every programming language (well SAS has macros, not functions, but you get my point.)

I also recommend using R for all of your statistics homework, even if the professor doesn't require it. That's how I learned R. It'll put you in a position where you have to learn how the language works and where the functions you want to use are. Once you have the basics of R down, check out R for Data Science. It's a very modern book on R that encourages you to use user-friendly packages to do data analysis. As for SAS, it's a terrible language that's losing market share to R and is popular because it's popular. It can help to know the basics, but it's a language I leave off my resume and LinkedIn because I never want to touch it again.

I'd also recommend learning SQL at some point. Most datasets will be in databases you'll have to query for the data you want. My favorite book for this is SQL in 10 Minutes, which is a book of 10-minute lessons, where each one is on a SQL concept. Don't worry about the specific SQL dialect since they're virtually all the same. Once you're comfortable with basic queries and joins you're in good shape.

u/[deleted] · 10 pointsr/statistics


"Doing Bayesian Data Analysis" by Kruschke. The instruction is really clear and there are code examples, and a lot of the mainstays of NHST are given a Bayesian analogue, so that should have some relevance to you.

"Bayesian Data Analysis" by Gelman. This one is more rigorous (notice the obvious lack of puppies on the cover) but also very good.

Free stuff:

"Think Bayes" by our own resident Bayesian apostle, Allen Downey. This book introduces Bayesian stats from a computational perspective, meaning it lays out problems and solves them by writing Python code. Very easy to follow, free, and just a great resource.

Lecture: "Bayesian Statistics Made (As) Simple (As Possible)" again by Prof. Downey. He's a great teacher.

u/El-Dopa · 1 pointr/statistics

If you are looking for something very calculus-based, this is the book I am familiar with that is most grounded in that. Though, you will need some serious probability knowledge, as well.

If you are looking for something somewhat less theoretical but still mathematical, I have to suggest my favorite. Statistics by William L. Hays is great. Look at the top couple of reviews on Amazon; they characterize it well. (And yes, the price is heavy for both books.... I think that is the cost of admission for such things. However, considering the comparable cost of much more vapid texts, it might be worth springing for it.)

u/WhenTheBitchesHearIt · 7 pointsr/statistics

John Fox's book is great. It's mostly linear regression models for continuous variables, but the GLM section is very helpful. If I remember correctly, the second edition is way more helpful with GLM than the first.

For categorical variables Scott Long's book is wonderfully helpful.

Unfortunately both are expensive. Hopefully your library has them.

Any more specificity in what types of variables you might be working with or what your data is like? Knowing what type of link function you're looking for my give you better results from some of the uber statisticians here.

u/mikethechampion · 3 pointsr/statistics

I would highly recommend the following book: Mostly harmless econometrics

It is very problem driven book and will help build up your knowledge base to know what models are appropriate for a given situation or dataset.

You will then need to start practicing in a statistical program to gain the practical skills of applying those models to real data. Excel works, but I don't know a good book to recommend to guide you through using excel on real problems.

I recommend Stata to new data analysts and have them pick up "microeconomics using stata"; once they've worked through these two books they get excited and start grabbing data all over and begin running models, its exciting to watch new data modellers apply tools they're learning. R is free and open source but is more difficult to learn, if you're willing to ask there are tons of people willing to help you through R.

u/M_Bus · 2 pointsr/statistics

Not long! For this purpose I highly highly recommend Richard McElreath's Statistical Rethinking (this one here). It's SO good. The math is exceptionally straightforward for someone familiar with regression, and it's huge on developing intuition. Bonus: he sets you up with all the tools you need to do your own analyses, and there are tons of examples that he works from a lot of different angles. He even does hierarchical regression.

It's an easy math book to read cover to cover by yourself, to be honest. He really holds your hand the whole way through.

Jesus, he should pay me to rep his book.

u/mathnstats · 2 pointsr/statistics

Did any of your calc classes include multivariate/vector calculus? E.g. things dealing with double and triple integrals.

If not, take another calc class or two; calculus is very important for statistics. It shouldn't be too hard to pick up the rest of the necessary calc since you've already got a good calc background.

If so, start taking probability and statistics courses in your school's math department if you can. The mathematical way (read: the right way) of understanding probability and statistics is based on probability distributions (like the normal distribution), defined by their probability functions. As such, you can use calculus to obtain a myriad of information from them! For instance, among many other things, within the first one or 2 courses, you'd likely be able to answer at least the Spearman's coefficient question, the Bernoulli process question, and the MLE question.

If you don't have room in your schedule to do the stats course, you could get a textbook and try learning on your own. There are tons of excellent resources. Hogg, Tanis, and Zimmerman is pretty good for an introduction, though I'm sure there's better out there.

u/berf · 1 pointr/statistics

I don't understand the question. Isn't this easy? Just follow the KISS principle (keep it simple, stupid). They're presumably seen histograms somewhere. Just present the kernel density estimate, presumably with optimal bandwidth chosen by cross-validation or something, which is way too complicated to explain, as a better competitor to the histogram. Explain the kernel density estimate as a better estimate of the "theoretical histogram" (I get this terminology from Freedman, Pisani, and Purves, an excellent "statistics for poets" book), which is what the histogram would be if you had an infinite amount of data. No one believes the theoretical histogram actually has jumps like a histogram (estimate), so why not use a smooth estimate like the kernel density estimate? That's almost ELI5.

u/kanak · 1 pointr/statistics

I'm in a similar situation (requiring to be proficient in statistics), and here's what I'm doing.

  1. Started with a really basic book that told me more about the ideas than the math. I used Andy Field's Statistics with SPSS book because it was recommended somewhere on reddit. The book definitely skims on the maths, but it gives you a good idea about the different tests and concepts.

  2. Following Berkeley's Statistics Major/Minor route. Specifically:

    a. Stats 133 - Computing with Data: A course on using R, SQL, and other technologies useful in statistics.

    b. Stats 102 - Intro to Statistics I found multiple versions of this course, but I'm going to pick this one because it uses this interesting book which emphasizes case studies

    c. Stats 135 - Concepts of Statistics : More advanced treatment of the same concepts from 102.

    d. If you want to brush up on probability, you should look at Stats 101 and Stats 134.

    e. After this level, they have a series of electives, such as Stochastic Processes (Stats 150), Linear Modelling Lab (151A and 151B), Sampling Surveys Lab (152), Time Series Lab (153), Game Theory (155), and seminars.

    The classes don't have videos or audios, but they have syllabuses, lecture notes and assignments. So far I've found them to be more than sufficient.
u/RemarkableSprinkles · 2 pointsr/statistics

This book by Andy Field is by far my favorite. His writing style is really laid back and funny, which helps me concentrate as statistics can be pretty dry/boring. And he is good at explaining the statistic theories in an easy way. If you dont want to use spss while learning he has a statistics book in which he doesnt use a statistics program as part of teaching (I haven’t read that one though). He also had books on how to use Stata, R etc.

u/Bromskloss · 1 pointr/statistics

> There are some philosophical reasons and some practical reasons that being a "pure" Bayesian isn't really a thing as much as it used to be. But to get there, you first have to understand what a "pure" Bayesian is: you develop reasonable prior information based on your current state of knowledge about a parameter / research question. You codify that in terms of probability, and then you proceed with your analysis based on the data. When you look at the posterior distributions (or posterior predictive distribution), it should then correctly correspond to the rational "new" state of information about a problem because you've coded your prior information and the data, right?

Sounds good. I'm with you here.

> However, suppose you define a "prior" whereby a parameter must be greater than zero, but it turns out that your state of knowledge is wrong?

Isn't that prior then just an error like any other, like assuming that 2 + 2 = 5 and making calculations based on that?

> What if you cannot codify your state of knowledge as a prior?

Do you mean a state of knowledge that is impossible to encode as a prior, or one that we just don't know how to encode?

> What if your state of knowledge is correctly codified but makes up an "improper" prior distribution so that your posterior isn't defined?

Good question. Is it settled how one should construct the strictly correct priors? Do we know that the correct procedure ever leads to improper distributions? Personally, I'm not sure I know how to create priors for any problem other than the one the prior is spread evenly over a finite set of indistinguishable hypotheses.

The thing about trying different priors, to see if it makes much of a difference, seems like a legitimate approximation technique that needn't shake any philosophical underpinnings. As far as I can see, it's akin to plugging in different values of an unknown parameter in a formula, to see if one needs to figure out the unknown parameter, or if the formula produces approximately the same result anyway.

> read this book. I promise it will only try to brainwash you a LITTLE.

I read it and I loved it so much for its uncompromising attitude. Jaynes made me a militant radical. ;-)

I have an uncomfortable feeling that Gelman sometimes strays from the straight and narrow. Nevertheless, I looked forward to reading the page about Prior Choice Recommendations that he links to in one of the posts you mention. In it, though, I find the puzzling "Some principles we don't like: invariance, Jeffreys, entropy". Do you know why they write that?

u/pgoetz · 1 pointr/statistics

I would try Mathematical Statistics and Data Analysis by Rice. The standard intro text for Mathematical Statistics (this is where you get the proofs) is Wackerly, Mendenhall, and Schaeffer but I find this book to be a bit too dry and theoretical (and I'm in math). Calculus is less important than a thorough understanding of how random variables work. Rice has a couple of pretty good chapters on this, but it will require some mathematical maturity to read this book. Good luck!

u/jmcq · 2 pointsr/statistics

Depending on how strong your math/stats background is you might consider Statistical Inference by Casella and Berger. It's what we use for our first year PhD Mathematical Statistics course.

That might be a little too difficult if you're not very comfortable with probability theory and basic statistics. If you look at the first few chapters on Amazon and it seems like too much I recommend Mathematical Statistics and Data Analysis by Rice which I guess I would consider a "prequel" to the Casella text. I worked through this in an advanced statistics undergrad course (along with Mostly Harmless Econometrics and the Goldberger's course in Econometrics).

Let's see, if you're interested in Stochastic Models (Random Walks, Markov Chains, Poisson Processes etc), I recommend Introduction to Stochastic Modeling by Taylor and Karlin. Also something I worked through as an undergrad.

u/beaverteeth92 · 2 pointsr/statistics

The absolute best book I've found for someone with a frequentist background and undergraduate-level math skills is Doing Bayesian Data Analysis by John Kruschke. It's a fantastic book that goes into mathematical depth only when it needs to while also building your intuition.

The second edition is new and I'd recommend it over the first because of its improved code. It uses JAGS and STAN instead of Bugs, which is Windows-only now.

u/Here4TheCatPics · 2 pointsr/statistics

I've used a book by Gelman for self study. Great author, very good at using meaningful graphics -- which may be an effective way to convey ideas to students.

u/RogerSmithII · 2 pointsr/statistics

Thanks. The program is Data Science and prereqs are Calc, Lin Alg and basic stats.

I started my review using but the book assumes you have basic stats. I took these courses 5+ years ago so I only vaguely remember the material.

Good example with hetero/homoskedasticity. I want to make sure I understand things like random variables and different types of distributions.

u/Bomb3213 · 1 pointr/statistics

This imo is a good book for basic probability and mathematical statistics. Super easy read with a lot of examples. [You also mentioned pdf's for books and someone told you library gensis. I can promise this one is on there :)]

u/PM_ME_YOUR_WOMBATS · 1 pointr/statistics

Somewhat facetiously, I'd say the probability that an individual who has voted in X/12 of the last elections will vote in the next election is (X+1)/14. That would be my guess if I had no other information.

As the proverb goes: it's difficult to make predictions, especially about the future. We don't have any votes from the next election to try to discern what relationship those votes have to any of the data at hand. Of course that isn't going to stop people who need to make decisions. I'm not well-versed in predictive modeling (being more acquainted with the "make inference about the population from the sample" sort of statistics) but I wonder what would happen if you did logistic regression with the most recent election results as the response and all the other information you have as predictors. See how well you can predict the recent past using the further past, and suppose those patterns will carry forward into the future. Perhaps someone else can propose a more sophisticated solution.

I'm not sure how this data was collected, but keep in mind that a list of people who have voted before is not a complete list of people who might vote now, since there are some first-time voters in every election.

If you want to get serious about data modeling in social science, you might check out this book by statistician/political scientist Andrew Gelman.

u/marmle · 4 pointsr/statistics

The short version is that in a bayesian model your likelihood is how you're choosing to model the data, aka P(x|\theta) encodes how you think your data was generated. If you think your data comes from a binomial, e.g. you have something representing a series of success/failure trials like coin flips, you'd model your data with a binomial likelihood. There's no right or wrong way to choose the likelihood, it's entirely based on how you, the statistician, thinks the data should be modeled. The prior, P(\theta), is just a way to specify what you think \theta might be beforehand, e.g. if you have no clue in the binomial example what your rate of success might be you put a uniform prior over the unit interval. Then, assuming you understand bayes theorem, we find that we can estimate the parameter \theta given the data by calculating P(\theta|x)=P(x|\theta)P(\theta)/P(x) . That is the entire bayesian model in a nutshell. The problem, and where mcmc comes in, is that given real data, the way to calculate P(x) is usually intractable, as it amounts to integrating or summing over P(x|\theta)P(\theta), which isn't easy when you have multiple data points (since P(x|\theta) becomes \prod_{i} P(x_i|\theta) ). You use mcmc (and other approximate inference methods) to get around calculating P(x) exactly. I'm not sure where you've learned bayesian stats from before, but I've heard good things , for gaining intuition (which it seems is what you need), about Statistical Rethinking (, the authors website includes more resources including his lectures. Doing Bayesian data analysis ( also seems to be another beginner friendly book.

u/mez001 · 0 pointsr/statistics

Hi, i see you already have read some good books. I would also recommend Bishop's pattern recognition book for ML

However, not sure whether they teach this in grad stat . Good Luck !


u/siddboots · 9 pointsr/statistics

It is hard to provide a "comprehensive" view, because there's so much disperate material in so many different fields that draw upon probability theory.

Feller is an approachable classic that covers all of the main results in traditional probability theory. It certainly feels a little dated, but it is full of the deep central limit insights that are rarely explained in full in other texts. Feller is rigorous, but keeps applications at the center of the discussion, and doesn't dwell too much on the measure-theoretical / axiomatic side of things. If you are more interested in the modern mathematical theory of probability, try Probability with Martingales.

On the other hand, if you don't care at all about abstract mathematical insights, and just want to be able to use probabilty theory directly for every-day applications, then I would skip both of the above, and look into Bayesian probabilistic modelling. Try Gelman, et. al..

Of course, there's also machine learning. It draws on a lot of probability theory, but often teaches it in a very different way to a traditional probability class. For a start, there is much more emphasis on multivariate models, so linear algebra is much more central. (Bishop is a good text).

u/blair_necessities · 3 pointsr/statistics

If your just looking for a concept overview the cartoon guide to statistics is great. It's easy to read and filled with great visuals and examples.

If you want to learn how to do intro statistics/practice, look no further than khan Academy.

u/blossom271828 · 5 pointsr/statistics

The book that you want the person to look up is Applied Linear Statistical Models. It is a great reference book and gets into the nitty gritty calculations for figuring out the appropriate degrees of freedom in some pretty ugly experimental designs.

u/CodeNameSly · 3 pointsr/statistics

Casella and Berger is one of the go-to references. It is at the advanced undergraduate/first year graduate student level. It's more classical statistics than data science, though.

Good statistical texts for data science are Introduction to Statistical Learning and the more advanced Elements of Statistical Learning. Both of these have free pdfs available.

u/belarius · 3 pointsr/statistics

Casella & Berger is the go-to reference (as Smartless has already pointed out), but you may also enjoy Jaynes. I'm not sure I'd say it's quick but if gaps are your concern, it's pretty drum-tight.

u/lenwood · 1 pointr/statistics

I'm doing the same. Here are a couple of resources that you may find helpful.

u/alex628 · 1 pointr/statistics

This is the best recent book I have read

This is coming from someone who loves the Casella and Burger text.

u/Sarcuss · 6 pointsr/statistics

I would say: Go for it as long as you are interested in the job :)

For study references for remembering R and Statistics, I think all you would need would be:

For R, data cleaning and the such: and for basic statistics with R probably either Daalgard for Applied Statistics with R and something like OpenIntroStats or Freedman for review of stats

u/clarinetist001 · 4 pointsr/statistics

If you really need it dumbed down, I would recommend Asimow and Maxwell. This text has a solutions manual. Note that this is specifically tailored toward actuarial exams - i.e., people that have to learn the material quickly but not necessarily for grad school. (And yes, the website is legit. I've done some contract work for them in the past and have ordered books through them.)

If you don't mind something more mathematical, I would recommend Wackerly et al.

u/mrdevlar · 2 pointsr/statistics

If you want a math book with that perspective, I'd recommend E.T. Jaynes "Probability Theory: The Logic of Science" he devolves into quite a lot of discussions about that topic.

If you want a popular science book on the subject, try "The Theory That Would Not Die".

Bayesian statistics has, in my opinion, been the force that has attempted to reverse this particular historical trend. However, that viewpoint is unlikely to be shared by all in this area. So take my viewpoint with a grain of salt.

u/gatherinfer · 2 pointsr/statistics

A lot of the recommendations in this thread are good, I'd like to add "Bayesian Data Analysis 3rd edition" by Gelman et al. Useful if you encounter Bayesian models, especially hierarchical/multilevel models.

u/marketfailure · 1 pointr/statistics

In my graduate econometrics course we used Mostly Harmless Econometrics. It's focused on the question of causal inference, and specifically how to do empirically rigorous studies when variables aren't exogenous. It covers a bunch of best practices in design experiments. It's not focused on networks, which is a rapidly emerging field of study in the social sciences. However, it does a very good job of explaining possible sources of error in statistical inference and research design.

u/DiogenicOrder · 8 pointsr/statistics

How would you rather split beginner vs intermediate/advanced ?

My feeling was that Ben Lambert's book would be a good intro and that Bayesian Data Analysis would be a good next ?

u/hngryhngryhippo · 3 pointsr/statistics

Andy Field is both very informative and entertaining. I highly recommend this book.

u/dabomb4real · 1 pointr/statistics

I don't understand how my example of spurious correlation among randomly generated numbers doesn't already meet that burden. That's a data generating process that is not causal by design but produces your preferred observed signal.

Your additions of "repeated", "different times" and "different places" only reduce likelihood of finding a set with your preferred signal (or similarly require checking more pairs). There's literally a cottage industry around finding these funny noncausal relationships

If you're imagining something more elaborate about what it means to move "reliably" together, Mostly Harmless Econometrics walks through how every single thing you might be thinking of is really just trying to get back to Rubin style randomized treatment assignment

u/editorijsmi · 1 pointr/statistics

you can check the following books

  1. Bayesian Methodology: An overview with the help of R software: Tool for data science professionals

  2. Essentials of Bio-Statistics: An overview with the help of Software
u/ViewofDelft · 1 pointr/statistics

Surprisingly effective intro to probability

might be too informal for your purposes though...

u/clm100 · 2 pointsr/statistics

Honestly, ignore the "for engineering" part of "Statistics for Engineering." They're largely the same content.

How much calculus have you taken? Does the class use calculus?

First, the cartoon guide to statistics is surprisingly helpful for some people.

For a more traditional textbook, you might try Devore's main intro book.

Almost every student finds statistics confusing and it's either difficult to teach, or just difficult to learn. It's also a fractal discipline, since you can keep going deeper and deeper, but it's generally just going over the same few concepts with additional depth. If you end up in a class that's not well suited to your mathematical background it's especially frustrating.

Good luck.

u/Adamworks · 1 pointr/statistics

I'm assuming this is some sort of experimental psychology?

Probably everything in this book:

or this website:

Same guy, great book.

u/blimpy_stat · 11 pointsr/statistics

Applied Linear Statistical Models by Kutner is a far better reference for statistical modeling compared to ISLR/ESLR or any kind of "machine learning" text, but it sounds as though you did a stat masters since you're asking about stat modeling instead of the new buzzwords. The latter options are certainly more narrow.
Considered a cornerstone, of sorts.

u/NegativeNail · 1 pointr/statistics

PDF WARN: Introduction to Math Stat by Hogg

Not to be confused with Probability and Math Stat by Tannis and Hogg which is a "first semester" course.

Good blend of theory and "talky-ness", good exercises that test your understanding, most should be do-able from just applying the basics.

u/statmama · 9 pointsr/statistics

Seconding /u/khanable_ -- most of statistical theory is built on matrix algebra, especially regression. Entry-level textbooks usually use simulations to explain concepts because it's really the only way to get around assuming your audience knows linear algebra.

My Ph.D. program uses Casella and Berger as the main text for all intro classes. It's incredibly thorough, beginning with probability and providing rigorous proofs throughout, but you would need to be comfortable with linear algebra and at least the basic principles of real analysis. That said, this is THE book that I refer to whenever I have a question about statistical theory-- it's always on my desk.

u/AdActa · 2 pointsr/statistics

A good bet would be "Mostly harmless Econometrics"

Not overly theoretic and very focussed on practical applications.

u/gwippold · 3 pointsr/statistics

You could read the IBM manual OR you could buy this much more user friendly book:

u/Slippery_Slope_Guy · 3 pointsr/statistics

It requires study so you might not have any sudden moments of clarity, but this is pretty much the Bible of regression.

Highly recommended.

u/josquindesprez · 1 pointr/statistics

If you want an extremely practical book to complement BDA3, try Statistical Rethinking.

It's got some of the clearest writing I've seen in a stats book, and there are some good R and STAN code examples.

u/TheLeaderIsGood · 1 pointr/statistics

This one? Damn, it's £40-ish. Any highlights or is it just a case of this book is the highlight?

It's on my wishlist anyway. Thanks.

u/lykonjl · 4 pointsr/statistics

Jaynes: Probability Theory. Perhaps 'rigorous' is not the first word I'd choose to describe it, but it certainly gives you a thorough understanding of what Bayesian methods actually mean.

u/RAPhisher · 4 pointsr/statistics

In addition to linear regression, do you need a reference for future use/other topics? Casella/Berger is a good one.

For linear regression, I really enjoyed A Modern Approach to Regression with R.

u/MortalitySalient · 3 pointsr/statistics

For research methods in behavioral and social sciences, you probably can't get better than Shadish, Cook, and Campbell's : Experimental and Quasi-Experimental Design for generalized causal inference. As far as stats go, the Andy Field books are good and he has one for R, one for SAS, and one for SPSS. I prefer the John Fox book on Applied Regression analysis and the corresponding r book. Here are some links:

u/oz0509 · 2 pointsr/statistics

I agree with all of the above. Also, here's the Linear Models tome we used:

u/AllezCannes · 2 pointsr/statistics

They're not free, but Doing Bayesian Data Analysis and Statistical Rethinking are worth their weight in gold.

u/Jimmy_Goose · 1 pointr/statistics

If you want to go deep, try a math stats book. Although there would probably be disagreement in this subreddit on which is the best, they are all pretty much the same. Maybe try an early edition of the Wackerly book (I think that one is most widely used one). A lot of people would suggest Casella and Berger, but I would suspect those people have never taught a course and forget that that book does require a bit of mathematical maturity. Go with an undergrad book.

For R, I would suggest either going through a tutorial (such as the swirl package), or, what I am assuming how most people learned it, buying an applied stats book and just doing the problems in R. You have to go through the hump of learning it, but you learn programming by doing it. After you are done with math stats, a good next step in the applied direction is Regression and ANOVA/Design. Regression, there are a ton of books. But again, the first few chapters of most books are the same. I would try and find a cheapo book with a modern typeset. ANOVA... probably want to go with Montgomery. I don't know the others too well though.

u/kiwipete · 2 pointsr/statistics

An intermediate resource between the Downey book and the Gelman book is Doing Bayesian Analysis. It's a bit more grounded in mathematics and theory than the Downey, but a little less mathy than the Gelman.

u/SupportVectorMachine · 5 pointsr/statistics

A very user-friendly treatment that hits every criterion you mention is John Kruschke's Doing Bayesian Data Analysis, Second Edition.

u/CrazyStatistician · 10 pointsr/statistics

Bayesian Data Analysis and Hoff are both well-respected. The first is a much bigger book with lots of applications, the latter is more of an introduction to the theory and methods.

u/lrnz13 · 1 pointr/statistics

I’m finishing up my stats degree this summer. For math, I took 5 courses: single variable calculus , multi variable calculus, and linear algebra.

My stat courses are divided into three blocks.

First block, intro to probability, mathematical stats, and linear models.

Second block, computational stats with R, computation & optimization with R, and Monte Carlo Methods.

Third block, intro to regression analysis, design and analysis of experiments, and regression and data mining.

And two electives of my choice: survey sampling & statistical models in finance.

Here’s a book for intro to probability. There’s also lectures available on YouTube: search MIT intro to probability.

For a first course in calculus search on YouTube: UCLA Math 31A. You should also search for Berkeley’s calculus lectures; the professor is so good. Here’s the calc book I used.

For linear algebra, search MIT linear algebra. Here’s the book.

The probability book I listed covers two courses in probability. You’ll also want to check out this book.

If you want to go deeper into stats, for example, measure theory, you’re going to have to take real analysis & a more advanced course on linear algebra.

u/shaggorama · 1 pointr/statistics

Start here:


    If you have access to a decent academic library, I'd recommend reading skimming the chapters on hypothesis testing in Casella & Berger or Hogg, Mckean & Craig. I believe HMC also discusses risk and loss functions in the context of bayesian statistics, although I'm not sure of C&B does. Unfortunately I don't have any good recommendations for you specific to bayesian hypothesis testing, my bayesian experienced only lightly touched on hypothesis testing.

    Very, very generally, the idea is that you want to determine how likely each hypothesis is given your data. By taking a ratio of these probabilities, you can determine proportionally which how likely each hypothesis is relative to the other one, given the data. If you set up your ratio (here called a bayes factor) as p(H0|X)/P(H1|X), then a BF close to or larger than '1' indicates you can't reject the null hypothesis, since it is more likely than the alternative. As the BF approaches 0, your confidence in a choice to reject the null hypothesis increases, since this indicates that the denominator is large and therefore the alternative hypothesis is more likely given your data.

    The question then becomes how do you establish the cutoff value: exactly how small does the BF need to be to definitively reject the null hypothesis? This is where the risk functions come in. The risk functions help you to decide on a decision function, which is really just a point estimator. You will use this result to determine the cutoff value for rejecting the null hypothesis and compare this value to your calculated BF, which here serves as the test statistic.
u/klaxion · 5 pointsr/statistics

Recommendation - don't learn statistics through "statistics for biology/ecology".

Go straight to statistics texts, the applied ones aren't that hard and they usually have fewer of the lost-in-translation errors (e.g. the abuse of p-values in all of biology).

Try Gelman and Hill -

Faraway - Practical Regression and Anova using (free)

Categorical data analysis