Books>Computers & technology books>Databases & big data books

Reddit mentions: The best data mining books

We found 160 Reddit comments discussing the best data mining books. We ran sentiment analysis on each of these comments to determine how redditors feel about different products. We found 58 products and ranked them based on the amount of positive reactions they received. Here are the top 20.

Gallery view

1. Mining the Social Web: Data Mining Facebook, Twitter, Linkedin, Google+, Github, And More

Features:

O Reilly Media

Mining the Social Web: Data Mining Facebook, Twitter, Linkedin, Google+, Github, And More

Specs:

Height	9.19 Inches
Length	7 Inches
Number of items	1
Release date	October 2013
Weight	1.61 Pounds
Width	0.94 Inches

▼ Read Reddit mentions

2. Machine Learning: An Algorithmic Perspective (Chapman & Hall/Crc Machine Learning & Pattern Recognition)

Features:

Used Book in Good Condition

Machine Learning: An Algorithmic Perspective (Chapman & Hall/Crc Machine Learning & Pattern Recognition)

Specs:

Height	9.75 Inches
Length	6.75 Inches
Number of items	1
Weight	1.4991433816 Pounds
Width	1 Inches

▼ Read Reddit mentions

3. Beautiful Visualization: Looking at Data through the Eyes of Experts (Theory in Practice)

Features:

O Reilly Media

Beautiful Visualization: Looking at Data through the Eyes of Experts (Theory in Practice)

Specs:

Height	9.19 Inches
Length	7 Inches
Number of items	1
Weight	1.46827866492 Pounds
Width	0.68 Inches

▼ Read Reddit mentions

4. Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference (Addison-Wesley Data & Analytics) (Addison-Wesley Data & Analytics)

Addison-Wesley Professional

Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference (Addison-Wesley Data & Analytics) (Addison-Wesley Data & Analytics)

Specs:

Height	8.9 Inches
Length	6.9 Inches
Number of items	1
Release date	October 2015
Weight	0.77602716224 pounds
Width	0.5 Inches

▼ Read Reddit mentions

5. Introduction to Data Mining

Specs:

Height	9.4 Inches
Length	7.7 Inches
Number of items	1
Weight	2.9982867632 Pounds
Width	1.6 Inches

▼ Read Reddit mentions

6. Doing Data Science: Straight Talk from the Frontline

Features:

O'Reilly Media

Doing Data Science: Straight Talk from the Frontline

Specs:

Height	9 Inches
Length	6 Inches
Number of items	1
Weight	1.2 Pounds
Width	0.78 Inches

▼ Read Reddit mentions

7. R for Everyone: Advanced Analytics and Graphics (Addison-Wesley Data and Analytics)

Features:

Used Book in Good Condition

R for Everyone: Advanced Analytics and Graphics (Addison-Wesley Data and Analytics)

Specs:

Height	9.25 Inches
Length	7.25 Inches
Number of items	1
Weight	1.2786811196 Pounds
Width	0.5 Inches

▼ Read Reddit mentions

8. Cassandra High Availability

Features:

Imported; Wear OS by Google; OLED (High Visibility); New Timepiece Mode; 50 M Water Resistant
Digital Compass; Altimeter; Barometer; Activity Tracker; Dual Layer LCD Structure; Original Watch Face; Microphone; Casio Moment Setter App
Touchscreen
Case Diameter: 60.5mm ; Environmental Durability MIL-STD-810G (United States military standard issued by the U.S. Department of Defense),*2 low-temperature resistance (-10 °C)
Water resistant to 50m (5 bar) (165ft: in general, suitable for short periods of recreational swimming, but not diving or snorkeling

Specs:

Height	9.25 Inches
Length	7.5 Inches
Number of items	1
Release date	December 2014
Weight	0.73 Pounds
Width	0.42 Inches

▼ Read Reddit mentions

9. Guerrilla Analytics: A Practical Approach to Working with Data

Features:

Jeep Soft Tops: Soft tops feature options of quality fabric with sewn seams using marine grade thread. Each seam is welded shut with durable heat seal tape to keep moisture out
Easy Install: Simply slip off your old Jeep soft top & replace with any Rampage soft top featuring reinforced stitching, marine grade thread on the heavy pull areas, & heavy duty 30 mil windows
Three Year Warranty: Our Rampage products are covered by a 3 year warrant providing customers with repair or replacements subject to certain common exclusions
Rampage Jeep Parts: Count on Rampage to provide you with proven Jeep parts & accessories like bumpers, interior accessories, steps, & more for varying Jeep models
Offroad Accessories: Get serious about off-roading with a variety of bumpers, trail gear & recovery tools to help you out of sticky situations. We offer tube doors, mirror kits, steps, rock sliders & other accessories

Guerrilla Analytics: A Practical Approach to Working with Data

Specs:

Height	9 Inches
Length	6 Inches
Number of items	1
Release date	October 2014
Weight	0.9038952742 Pounds
Width	0.63 Inches

▼ Read Reddit mentions

10. Pandas for Everyone: Python Data Analysis: Python Data Analysis (Addison-Wesley Data & Analytics Series)

Features:

100% natural ingredients
All-day shower fresh feeling with just one application
Creamy consistency
No drying time & no white marks
Certified Vegan & Cruelty Free

Pandas for Everyone: Python Data Analysis: Python Data Analysis (Addison-Wesley Data & Analytics Series)

Specs:

Height	8.9 Inches
Length	7 Inches
Number of items	1
Release date	January 2018
Weight	1.4991433816 Pounds
Width	1 Inches

▼ Read Reddit mentions

11. Hadoop: The Definitive Guide

Specs:

Height	9.19 Inches
Length	7 Inches
Number of items	1
Release date	May 2012
Weight	1 Pounds
Width	1.5 Inches

▼ Read Reddit mentions

12. Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions (Developer Reference)

Features:

Used Book in Good Condition

Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions (Developer Reference)

Specs:

Height	8.95 Inches
Length	7.45 Inches
Number of items	1
Weight	0.96562470756 Pounds
Width	0.7 Inches

▼ Read Reddit mentions

13. Hadoop

Features:

Used Book in Good Condition

Specs:

Height	9.19 Inches
Length	7 Inches
Number of items	1
Weight	1.6 Pounds
Width	1.04 Inches

▼ Read Reddit mentions

14. Machine Learning: An Algorithmic Perspective, Second Edition (Chapman & Hall/Crc Machine Learning & Pattern Recognition)

Features:

CRC Press

Machine Learning: An Algorithmic Perspective, Second Edition (Chapman & Hall/Crc Machine Learning & Pattern Recognition)

Specs:

Height	10 Inches
Length	6.9 Inches
Number of items	1
Weight	0.00220462262 Pounds
Width	1 Inches

▼ Read Reddit mentions

15. Successful Business Intelligence, Second Edition: Unlock the Value of BI & Big Data

Features:

McGraw-Hill Osborne Media

Successful Business Intelligence, Second Edition: Unlock the Value of BI & Big Data

Specs:

Height	9.1 Inches
Length	8.8 Inches
Number of items	1
Weight	1.36466140178 Pounds
Width	1 Inches

▼ Read Reddit mentions

16. Genetic Algorithms and Genetic Programming: Modern Concepts and Practical Applications (Numerical Insights)

Features:

23-Inch 16:9 widescreen IPS LCD monitor for photographers, designers, CAD/CAM engineers, and medical applications
IPS panel delivers superior image quality accurate colors and high contrast ratio even at super wide viewing angles of 178 degrees
1080p Full HD resolution with 16:9 aspect ratio for perfect image reproduction
90 degrees pivot, 5 inch height adjust, 360 degrees swivel and tilt functions for desktop comfort and efficiency
4-port USB hub for easy connectivity of USB devices such as digital cameras, USB flash drives and mice

Genetic Algorithms and Genetic Programming: Modern Concepts and Practical Applications (Numerical Insights)

Specs:

Height	9.21 Inches
Length	6.14 Inches
Number of items	1
Weight	1.4991433816 Pounds
Width	0.88 Inches

▼ Read Reddit mentions

17. Patterns of Data Modeling (Emerging Directions in Database Systems and Applications)

Features:

Wiley

Patterns of Data Modeling (Emerging Directions in Database Systems and Applications)

Specs:

Height	9.13 Inches
Length	6.13 Inches
Number of items	1
Release date	June 2010
Weight	0.89948602896 Pounds
Width	0.6 Inches

▼ Read Reddit mentions

18. Google Compute Engine: Managing Secure and Scalable Cloud Computing

Specs:

Height	9.19 Inches
Length	7 Inches
Number of items	1
Release date	December 2014
Weight	0.89948602896 Pounds
Width	0.56 Inches

▼ Read Reddit mentions

19. The Art of Data Analysis: How to Answer Almost Any Question Using Basic Statistics

Features:

The Touchstone DOCSIS 3.0 Residential Gateway is an 8x4 advanced gateway product which combines a 4-port Gigabit Router, and a 802.11n wireless access point into a single device capable of supporting both home and small office applications.
Using multi-processor technology, the DG860P2 can simultaneously achieve high bandwidth DOCSIS performance without affecting the Wi-Fi and other gateway performance.
The DG860P2 distinguishes itself with capabilities that minimize an operator's support needs. Multiple provisioning methods (SNMP, Configuration File, Remote WebGUI access, TR-069, and TFTP) allow a custom designed setup to be efficiently applied to the end user.
Multiple local and remote access levels (User and Technician) also provide more ease and flexibility for manual configuration and control. The DG860P2 is a more compact version of the original DG860.

The Art of Data Analysis: How to Answer Almost Any Question Using Basic Statistics

Specs:

Height	9.299194 Inches
Length	6.149594 Inches
Number of items	1
Release date	May 2013
Weight	0.65697754076 Pounds
Width	0.460629 Inches

▼ Read Reddit mentions

20. Hadoop Operations

Features:

O Reilly Media

Specs:

Height	9.19 Inches
Length	7 Inches
Number of items	1
Release date	October 2012
Weight	1.1 Pounds
Width	0.8 Inches

▼ Read Reddit mentions

🎓 Reddit experts on data mining books

The comments and opinions expressed on this page are written exclusively by redditors. To provide you with the most relevant data, we sourced opinions from the most knowledgeable Reddit users based the total number of upvotes and downvotes received across comments on subreddits where data mining books are discussed. For your reference and for the sake of transparency, here are the specialists whose opinions mattered the most in our ranking.

shaggorama

Total score: 35

Number of comments: 13

Relevant subreddits: 5

Data_cruncher

Total score: 14

Number of comments: 2

Relevant subreddits: 2

wolf2600

Total score: 11

Number of comments: 4

Relevant subreddits: 3

OracleDBA

Total score: 6

Number of comments: 3

Relevant subreddits: 1

Temujin_123

Total score: 6

Number of comments: 2

Relevant subreddits: 2

welshfargo

Total score: 5

Number of comments: 4

Relevant subreddits: 2

efrique

Total score: 4

Number of comments: 2

Relevant subreddits: 1

monumentshort

Total score: 4

Number of comments: 2

Relevant subreddits: 2

Bayes_the_Lor

Total score: 3

Number of comments: 2

Relevant subreddits: 1

datadude

Total score: 2

Number of comments: 2

Relevant subreddits: 2

Interested in what Redditors like? Check out our Shuffle feature

Shuffle: random products popular on Reddit

Top Reddit comments about Data Mining:

u/shaggorama · 2 points r/math

Ok then, I'm going to assume that you're comfortable with linear algebra, basic probability/statistics and have some experience with optimization.

Check out Hastie, Tibshirani, & Friedman - Elements of Statistical Learning (ESLII): it's basically the data science bible, and it's free to read online or download.
Andrew Gelman - Data Analysis Using Regression and Multilevel/Hierarchical Models has a narrower scope on GLMs and hierarchical models, but it does an amazing treatment and discusses model interpretation really well and also includes R and stan code examples (this book ain't free).
Max Kuhn - Applied Predictive Modeling is also supposed to be really good and should strike a middle ground between those two books: it will discuss a lot of different modeling techniques and also show you how to apply them in R (this book is essentially a companion book for the caret package in R, but is also supposed to be a great textbook for modeling in general).

I'd start with one of those three books. If you're feeling really ambitious, pick up a copy of either:
Christopher Bishop - Pattern Recognition and Machine Learning - Bayes all the things.
Kevin Murphy - Machine Learning: A Probabilistic Perspective - Also fairly bayesian perspective, but that's the direction the industry is moving lately. This book has (basically) EVERYTHING.

Or get both of those books. They're both amazing, but they're not particularly easy reads.

If these book recommendations are a bit intense for you:
Pang Ning Tan - Introduction to Data Mining - This book is, as it's title suggests, a great and accessible introduction to data mining. The focus in this book is much less on constructing statistical models than it is on various classification and clustering techniques. Still a good book to get your feet wet. Not free
James, Witten, Hastie & Tibshirani - Introduction to Statistical Learning - This book is supposed to be the more accessible version (i.e. less theoretical) version of ESLII. Comes with R code examples, also free.
Additionally:
If you don't already know SQL, learn it.
If you don't already know python, R or SAS, learn one of those (I'd start with R or python). If you're proficient in some other programming language like java or C or fortran you'll probably be fine, but you'd find R/python in particular to be very useful.

u/Archawn · 2 points r/MachineLearning

Pick up a numerical analysis book of some sort. I learned out of Sauer, "Numerical Analysis" which I wasn't incredibly happy with and so I can't recommend it, but it covered the basics with example code in Matlab. Someone else here can probably recommend a better numerical methods textbook.

Some great free resources:

Probabilistic Programming & Bayesian Methods for Hackers (hard copy: Amazon)
Barber, "Bayesian Reasoning and Machine Learning" (hard copy: Cambridge University Press

Other than that I don't know of any resources that could help with implementing machine learning algorithms; it mostly draws from numerical linear algebra / optimization, and the implementation details are left to the reader to grapple with.

For me, it works best to buy a good theory book (Koller & Friedman!) or read a good paper, and look around for blog posts written about a topic of interest that come with examples and sample code. It's unfortunate that everything is so disorganized this way but that's how it seems to be. I'd be happy to learn that I'm wrong here, though :)

Some blogs I can recommend:
Mathieu's Log
Edwin Chen
Math ∩ Programming
Randal S. Olson
Almost Looks Like Work

Hope this is helpful! This is basically how I've been learning ML on my own for work.

u/syntonicC · 13 points r/datascience

I used R for about 4 years before I moved to Python to use it for deep learning. I have been using Python for about 2 years now.

>Are R and Python considered redundant, or are there some situations where one will be preferred over the other? If I become proficient at using Python for data wrangling, analysis, and visualization, will I have any reason to continue using R?

It depends. I haven't really found anything that I can do in Python that I could not already do in R. I still use R because I like it better as a functional programming language and because it has a wide variety of more specific statistical packages (many for biology) that are just not available for Python yet. There are some specific cases where I just find it more intuitive and simpler to implement a solution in R. And generally, I just prefer ggplot2 over any of the various Python plotting packages. Also, R has high level API for things like TensorFlow so it's not like you can't do deep learning in R.

The biggest advantage for Python is its speed and ability to work within a larger programming framework. A lot of companies tend to use Python because the models they build are integrated into a larger system that needs the capabilities of a fully-fledged programming language. Python is generally faster and has better management of big data sets in memory. R is actually moving more in the direction to fix these issues but there are still limitations.

>Where should I start? I'm looking for a resource that isn't aimed at complete beginners, since I've been using R for a few years, and took a C class before that. At the same time I wouldn't claim to be an experienced programmer. I'm interested in learning Python both for data analysis and for general programming.

I learned Python syntax using Learn Python 3 the Hard Way. I learned about Pandas and data wrangling etc using Pandas for Everyone and Pandas Cookbook. If I was to suggest just one book, it would be Pandas for Everyone. You can learn Python syntax from YouTube, MOOCs, or online tutorials. The Pandas Cookbook is just extra practice. To be honest though, the general conventions used by Pandas for data analysis and manipulation are very similar to R in many ways. Especially if you've used anything in Hadley Wickham's Tidyverse. Finally, I made a Pandas cheatsheet while I was learning and including equivalent R functions in some places. I would be happy to share this Google Sheets file with you if you are interested.

>What IDE(s) should I use, and what are some must learn packages? I'm hoping to find something similar to RStudio.

I started off using PyCharm. I've heard good things about Spyder. But now, I actually still use RStudio! It is fully integrated with Python thanks to the Reticulate package. You can pass data structures between the languages and use both in RMarkdown. You can also use virtual environments which are popular with Python. Once you install the package:

library(reticulate)
use_virtualenv("path_to_my_virtual_env") # Start virtual environment

You can now run Python scripts directly in the RStudio console

# If you want a Python REPL to use interactively just like in R run:<br />
repl_python()<br />

It's really easy to use and even comes with auto-complete and everything else.

Hope that helped.

u/Temujin_123 · 3 points r/learnprogramming

Books (WARNING: these may be a bit dated since this field is changing quite rapidly):

Hadoop: The Definitive Guide
Cassandra: The Definitive Guide
Mongo: The Definitive Guide

But be careful with the loaded term "NoSQL". Don't get sucked up into the buzz. The main things the NoSQL movements popularized were:

Recognition that your data and your queries have attributes and requirements INDEPENDENT of any storage solution. You need to understand those FIRST before you can talk about what kind of database engine to store it in. That and it opened up the idea that 3NF isn't always the best idea (and sometimes can be the worst thing to do to your data).
(This is where the term 'Big Data' starts to come in). NoSQL often is tightly coupled with distributed computing. That's because the size/volume of certain datasets simply cannot be practically stored on a single physical server. So you have to focus on ways to send your code to the data rather than the other way around.

And often, there's no reason to use Hadoop/Mongo/Cassandra/etc. when your dataset doesn't call for it. Use the right tool for the job.

The books are just starting points too. Look in the chapter notes to find a wealth of links to original papers and essays like so:

Paxos (used in Zookeeper):
Phi Accural Failure Detection - used in Cassandra
- http://ddg.jaist.ac.jp/pub/HDY+04.pdf
Codd's "A Relational Model of Data for Large Shard Data Banks"
- http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
Jim Gray's "The Transaction Concept" - and oldie, but a good one
- http://research.microsoft.com/en-us/um/people/gray/papers/theTransactionConcept.pdf
Gregor Hohpe's "Starbucks Does Not Use Two-Phase Commit"
- http://www.eaipatterns.com/ramblings/18_starbucks.html
Flixster sharding strategies
- http://lsvp.wordpress.com/2008/06/20
Berkley "The Case for Shared Nothing" - Cassandra concept
- http://db.cs.berkeley.edu/papers/hpts85-nothing.pdf
Amazon's Dynamo paper
- http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
Dwight Merriman (video on CAP spectrum)
- http://bit.ly/7r6kRg
Facebook Cassandra paper
- http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf
Twitter Cassandra analytics
- http://engineering.twitter.com/2010/07/cassandra-at-twitter-today.html
Cloudkick using Cassandra for metrics
- https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra
Staged Event-Driven Architecture (SEDA) - baked into the Cassandra architecture
- http://www.eecs.harvard.edu/~mdw/proj/seda
  
  Oh, yeah. And take good notes in something like Evernote so you can cut/paste braindumps to NoSQL whitepapers/essays like the above.
  
  And if you're wondering how someone could have time to go through the above, all I can say is STOP WATCHING TV--that's what I did. It's amazing what you can do when you stop caring about the latest reality TV buzz or who won what award in sports or entertainment.

u/eyesay · 1 point r/GradSchool

OP, it's great that you're recognizing this need early! I was in the same position as you (0 programming experience), so perhaps I can offer you my strategy:

I started off learning R, because a lot of biological data analysis packages have already been written for it (see Bioconductor, http://www.bioconductor.org). R and it's corresponding IDE (RStudio) are easy and free to download:
http://cran.us.r-project.org
https://www.rstudio.com

For a super basic introduction to R (this will take you only a few hours), see Code School's 'Try R' tutorial:
http://tryr.codeschool.com

All Bioconductor packages come with a reference manual and sample data that you can use to practice. Also, most have a corresponding publication that explains the algorithm/stats behind it. I just went through and picked a few packages that seemed like they would be useful for the type of data my lab analyzes.

When I finished going through those and running practice data sets, I decided I wanted to actually learn R from scratch so I could understand what each function does. To that end, I bought a few O'Reilly books:
Learning R (http://shop.oreilly.com/product/mobile/0636920028352.do)
R for Everyone (http://www.amazon.com/gp/aw/d/0321888030?pc_redir=1398333875&amp;robot_redir=1)

I've been dedicating my Mondays to going through the books, chapter by chapter, and doing all the exercises. It's been really helpful, and I'm finding it easier and easier to understand the Bioconductor packages.

Finally, I enlisted some other grad students that had no experience but also wanted to learn, and together, we started a weekly meetup in which we each select and demonstrate a Bioconductor package. This basically forces us to keep up with learning and master the packages, since we don't want to lose face :)

Next, I'm planning on diving into Python, but for now, learning R has proven very useful.

Hope this helps a bit, and good luck!

u/entropia3x · 8 points r/BusinessIntelligence

Window functions, not windows function. The idea is that it applies the function over a "window" of the rows in the dataset. They are very useful for solving complicated source data problems. Check out this book for many examples (ignore the SQL Server 2012,it's so highly useful and relevant on any version of SQL Server that supports them). I highly recommend anything by Itzik Ben-Gan for SQL, and anything by Andy Leonard for SSIS, and Marco Russo, Alberto Ferrari, Chris Webb for DAX / SSAS (even OLAP).

https://www.amazon.com/dp/0735658366/ref=cm_sw_r_cp_awdb_t1_-NtwCbKA67Y7W

10 years of BI experience here... I would say that BI covers a huge range of topics and the questions they asked may be relevant, but I would focus more on how the interviewee goes about finding solutions rather than wrote knowledge. You can be book smart but not know how to think and how to break down complex problems into solvable pieces.

I recently hired a guy that knew all about dimensional modeling and a good bit about SSIS. He interviewed excellent and seemed like a good fit. Two days in, I regretted the hire because his attention to detail was non-existent and his ability to solve problems matched that of a goldfish. Lesson learned and now I ask different types of interview questions, more of a scenario based interview rather than a technical pop-quiz.

Keep learning and don't let this interview get you down. If you show that you have a passion for solving difficult puzzles then you will find a place in BI.

Also... It sounds like a DBA gave you the technical interview. Brush it off and keep moving. ;)

u/RobMagus · 5 points r/statistics

This is a fairly useful review that I believe is available via google scholar for free: Wainer, H., & Thissen, D. (1981). Graphical data analysis. Annual review of psychology, 32, 191–241.

Tufte is useful for a historical overview and for inspiration, but he has a particular style that doesn't necessarily match up with the way that you or your audience think.

Hadley Wickham developed ggplot2 and his site is a good place to start browsing for guides to using it.

There's a pretty good o'reilly book on visualization as well, and Stephen Few's book does a really good job of enumerating the various ways you can express trends in data.

u/laserswithsharks · 3 points r/BusinessIntelligence

Without a warehouse there will be no data governance over your reporting. So everyone does something different and no numbers match. Not only is this confusing to business, they are less likely to "buy-in" to the reporting you provide.

I strongly recommend Successful Business Intelligence for reading. Its light (non-technical) reading and author provides a lot of use cases and examples.

Your best BI tool will be spreadsheets. Its not sexy by any means compared to PowerBI or Tableau, but its the #1 BI tool. You can build a multi-million dollar portal with impressive visualizations, but if the user wants a spreadsheet- but can't get one- then your investment is squandered.

So get good at Excel... power pivots, vlookup, match, index all that

I also see other good resources here for you to check out.

Good luck!

u/RockinRoel · 9 points r/programming

I’ll try to explain. The basic layout of a genetic algorithm is rather simple:

Start with a (random) starting population, consisting of candidate solutions. The fitness of these solutions (how good they are) can somehow be determined — most often this is the most computationally expensive step — which is domain-specific.
Pairs of candidate solutions (parents) are selected in the selection step. There are multiple ways to select these. Usually, those of higher fitness should have a greater chance to be selected. However, to have some diversity, there is still a chance for solutions that are less fit to be selected (because they may still contain parts that are valuable). If you’re trying to get red apples
Now, from these pairs, children are created through crossover. Depending on the domain, there must be some representation of solutions and ways to combine two of them. Finding these is one of the most tricky parts of genetic algorithms.
Mutation. With a certain (low) probability these children are mutated through a small change, so (hopefully desirable) properties can be created that may have not been present yet in the population.
These children become the new population, and it goes back to step 2. The algorithm terminates depending on a certain fitness being reached, or after a certain number of steps,…

Some forms of genetic algorithms may differ from that basic layout, but that’s basically how they work. If it has selection, mutation and crossover, it’s a genetic algorithm. Without crossover, it’s an evolutionary method, but not a genetic algorithm.

Source: The genetic algorithms class I’m taking this semester. We’re using this book. I can certainly recommend it, but it’s probably quite geared towards people with a CS background (as are a lot of books on the subject).

Also, this is an interesting Scientific American article on the subject, if you can get hold of a PDF or something.

u/adcqds · 1 point r/datascience

The pymc3 documentation is a good place to start if you enjoy reading through mini-tutorials: pymc3 docs

Also these books are pretty good, the first is a nice soft introduction to programming with pymc & bayesian methods, and the second is quite nice too, albeit targeted at R/STAN.

u/el_chief · 2 points r/Database

For your particular application I would look at OpenStreetMaps. Otherwise...

David Hay's

[Enterprise Model Patterns][1]. This is a beast of a book, but has some great patterns.
[Conventions of Thought][2]. More stuff on MRP.
[A Meta-Data Map][3] . Haven't read this one.

Len Silverston's
[Data Model Resource Book Vol. 1][4]. Your main data model patterns.
[Data Model Resource Book Vol. 2][5]. Case studies by industry.
[Data Model Resource Book Vol. 3][6]. A deeper explanation of Vol 1.

Michael Blaha's [Patterns of Data Modeling][7]. This one has some interesting temporal, graph, and tree models.

Martin Fowler's [Analysis Patterns][8]. This one skims some of the other patterns, but gives accounting a solid treatment.

They are all well-rated, and I have read all but one, and they are all very good. Several of them are available on [safaribooksonline][9].

Also, OASIS's [Universal Business Language][10], schemas

[1]: http://www.amazon.com/Enterprise-Model-Patterns-Describing-Version/dp/1935504053/ref=sr_1_1?s=books&amp;ie=UTF8&amp;qid=1346950468&amp;sr=1-1&amp;keywords=enterprise%20model%20patterns
[2]: http://www.amazon.com/Data-Model-Patterns-David-Hay/dp/0932633749/ref=pd_sim_b_2?ie=UTF8&amp;refRID=1PQPGE4E6T2RPR2XTN80
[3]: http://www.amazon.com/Data-Model-Patterns-Metadata-Management/dp/0120887983/ref=pd_sim_b_3?ie=UTF8&amp;refRID=1PQPGE4E6T2RPR2XTN80
[4]: http://www.amazon.com/Data-Model-Resource-Book-Vol/dp/0471380237/ref=pd_sim_b_3?ie=UTF8&amp;refRID=08T9TEZJNZM2EMKZV3AB
[5]: http://www.amazon.com/Data-Model-Resource-Book-Vol/dp/0471353485/ref=pd_sim_b_3?ie=UTF8&amp;refRID=1D5TDG7479G7TQMBPNWF
[6]: http://www.amazon.com/Data-Model-Resource-Book-Vol/dp/0470178450/ref=pd_sim_b_4?ie=UTF8&amp;refRID=08T9TEZJNZM2EMKZV3AB
[7]: http://www.amazon.com/Patterns-Modeling-Emerging-Directions-Applications/dp/1439819890/ref=sr_1_1?s=books&amp;ie=UTF8&amp;qid=1346950554&amp;sr=1-1&amp;keywords=patterns%20of%20data%20modeling
[8]: http://www.amazon.com/Analysis-Patterns-Reusable-Object-Models/dp/0201895420/ref=sr_1_1?s=books&amp;ie=UTF8&amp;qid=1346961699&amp;sr=1-1&amp;keywords=analysis+patterns
[9]: http://my.safaribooksonline.com/search?q=data%20model
[10]: http://en.wikipedia.org/wiki/Universal_Business_Language

u/sazken · 2 points r/GetStudying

Yo, I'm not getting that image, but at a base level I can tell you this -

I don't know you if you know any R or Python, but there are good NLP (Natural Language Processing) libraries available for both

Here's a good book for Python: http://www.nltk.org/book/

A link to some more: http://nlp.stanford.edu/~manning/courses/DigitalHumanities/DH2011-Manning.pdf

And for R, there's http://www.springer.com/us/book/9783319207018
and
https://www.amazon.com/Analysis-Students-Literature-Quantitative-Humanities-ebook/dp/B00PUM0DAA/ref=sr_1_9?ie=UTF8&amp;qid=1483316118&amp;sr=8-9&amp;keywords=humanities+r

There's also this https://www.amazon.com/Mining-Social-Web-Facebook-LinkedIn/dp/1449367615/ref=asap_bc?ie=UTF8 for web scraping with Python

I know the R context better, and using R, you'd want to do something like this:
Scrape a bunch of sites using the R library 'rvest'
Put everything into a 'Corpus' using the 'tm' library
Use some form of clustering (k-nearest neighbor, LDA, or Structural Topic Model using the libraries 'knn', 'lda', or 'stm' respectively) to draw out trends in the data

And that's that!

u/monumentshorts · 3 points r/programming

Don't be. Basically everything here is either about conditional probability (naive bayes) or weights*inputs + bias > threshold calculations. The weights inputs stuff is basic neural networks. Support vectors machines, perceptrons, kernel transformations, are all about finding linearly separable classes (in some appropriate dimension).

For more info this will help http://en.m.wikipedia.org/wiki/Perceptron

If you are interested I really have enjoyed reading http://www.amazon.com/gp/aw/d/1420067184/ref=redir_mdp_mobile which explains really well a lot of machine learning stuff and demystifies the math

For some practical examples here are some blog posts I've written:

K means step by step in f#: http://onoffswitch.net/k-means-step-by-step-in-f/

Automatic fogbugz triage with naive bayes: http://onoffswitch.net/fogbugz-priority-prediction-naive-bayes/

I share the links mostly cause its nice to see a worked through, practical example with code. In the end a lot of these algorithms aren't that hard to implement with some matrix math

Anyways, hope this helps!

u/rich_leodis · 1 point r/googlecloud

SDK reference - this book, yes its old, but it tells you how to talk to the API in python.

For the SDK you basically only need to learn GCloud SDK i.e. (gcloud, gsutil, bq and kubectl). However what you need to learn is dependent on whether you need compute (in which case gcloud + gsutil) or kubernetes (kubectl)

If you need containers - learn Cloud Build (its magic)
If you need CI/CD - Jenkins/Spinnaker
If you need to provision infrastructure - Terraform/Deployment Manager
If you need a Relational DB - Cloud SQL
If you need a security - Forseti/Cloud armour

Save yourself some effort - make sure everything logs to stackdriver, so you can trace/debug as necessary.

Where practical use Googles managed services, also look at the Marketplace options rather than configuring your own application instances. Go onto Youtube and watch Cloud Next sessions - they have lots of useful information. If you want to practice, get a Qwiklabs account and run through the many lab based exercises on how to use GCP.

u/chrisvacc · 2 points r/datascience

Great replacment for Statistics 101: The Art of Data Analysis: How to Answer Almost Any Question Using Basic Statistics by Kristin H. Jarman.

Book was hilarious.. She uses Data Analysisfor a bunch of hilarious tasks. For example running a frequency distribution on “Yo Momma" Jokes”on the net or uses probability to jokingly track Big Foot.

Frequency Distribution of Yo Momma Jokes

Back to Big Foot, she uses probability to figure out everything on her Bigfoot search from buying the right camera to the ‘best places to look.’ It’s hilarious.

>I quit my job and head off in search of the creature. With visions of fame and fortune running through my head, I cash in my savings, say goodbye to my family, and drive away in my newly purchased vintage mini-bus.
>
>As I leave the city limits, my thoughts turn to the task ahead. Bigfoot exists, there’s no doubt about it. He’s out there, waiting to be discovered. And who better than a statistician-turned-monster-hunter to discover him? I’ve got scientific objectivity, some newly acquired free time, and a really good GPS from Sergeant Bub’s Army Surplus store.
>
>It’s too late to get my job back, and my husband isn’t taking my calls, so it seems I have no choice but to continue my search. I decide I’m going to do it right. I may never find the proof I’m looking for, but I’ll give it my best, most scientific effort. Whatever evidence I find will stand up to the scrutiny of my ex-boss, my family, and all those newspaper reporters who’ll be pounding on my door, begging for interviews.

u/greenspans · 13 points r/DataHoarder

I'm pretty sure this book talked about how easy it was to scrape facebook before they locked down their API.

https://www.amazon.com/Mining-Social-Web-Facebook-LinkedIn/dp/1449367615/

A lot of people probably did this. I remember a talk given in my city, the guy had a few thousand people signup to his app and got millions of entries to his graph database

https://maxdemarzi.com/2013/01/28/facebook-graph-search-with-cypher-and-neo4j/

Popular game devs probably got oodles of data. Must have been awesome having a social graph of the US

u/tolbertam · 2 points r/cassandra

"Cassandra High Availability" by Robbie Strickland has been highly recommended by others. It's on the top of my list of tech books to check out.

I'd recommend reading the Dynamo and Bigtable papers if you haven't already as C* uses a lot of ideas from those papers.

The Last Pickle has a great blog that has a lot of nice deep dive articles.

The Cassandra wiki has some nice detailed pages describing the design and read/write paths.

u/cleric04 · 1 point r/hadoop

If you have an O'reilly subscription there is a pretty good video with Ben and a few other Cloudera guys thats probably a good quickstart on cluster security A Practitioner's Guide to Securing Hadoop

Hadoop Operations is a good book but getting a bit outdated. I'd say just start setting up a cluster and maybe read it if you have some free time.

u/glancedattit · 1 point r/visualization

I would check out Ben Fry's book first.

Then Beautiful Visualization.

There is another good McCandless eyecandy.

Manuel Lima did an amazing book on network visualization with excellent essays from other people. Visual Complexity. Network vis is very difficult and if you want to "game up" understanding the taxonomy he built for network vis will give you a real perspective on the taxonomy in other types of vis.

There are things outside of the "take data and render visualization" world that are critical to data vis, imo. For moving data vis, start with the godfather, Muybridge

And look way way back for the long human history of data vis in cartography with stuff like Cartographia.

Hope to see some more books and discvoer a reading list on this thread! Great idea for a post.

u/Aidtor · 1 point r/datascience

If you want to be valuable to companies post graduation you should learn more about programming (design templates, how to write tests, how to go from a paper to code). I recommend this book as a good starting place. Once you're comfortable with how the different methods work, pick up this book.

u/Bayes_the_Lord · 2 points r/statistics

How far in are you? I've been wanting to do more stuff with PyMC3. I recently bought Bayesian Methods for Hackers to do so but a Slack study group sounds very helpful if I were to order this book as well.

Edit: I'd just like to confirm this group does plenty of Python despite the book only using R and whatever Stan is.

u/deong · 3 points r/compsci

I haven't personally read it, but the table of contents of this book looks like it has some potential to be a more modern version of what the Mitchell book aimed to provide.

I may be teaching a course in ML next year; I may order a copy to see if it's any good. Has anyone else looked at it?

u/wscottsanders · 3 points r/learnpython

You might look at the O'Reilly book "Mining the Social Web". I've found it very helpful and the author has even responded to my questions about how to get the virtual machine with the ipython notebooks up and running. Has code examples, explanations about the underlying technologies, and intros to things like natural language processing included.

http://www.amazon.com/Mining-Social-Web-Facebook-LinkedIn/dp/1449367615

u/mtelesha · 1 point r/statistics

I am a book guy myself. Ebooks to be specific. I really liked the R is for everyone http://www.amazon.com/gp/aw/d/0321888030?pc_redir=1405401911&amp;robot_redir=1

Pick one and stick with it. I really just recommend R for a number of reasons but Python is also a good choice just R works for me.

You need a personal project to work on. Mine was an end of the year report.

u/dr1fter · 5 points r/IWantToLearn

Hoowhee, how did this text get so... wally.

Bots are usually (fairly) simple programs, so Python will make it easy to get at all the common functionality you want (maybe looking for pattern matches in a piece of text, some math/analytics, saving files to your hard drive, converting images...) and in practice you'll mostly only be limited by what you can figure out how to do in your language of choice, regardless of the bot you're writing.

Wherever possible, you should use official APIs (which will often support Python these days), or at least third-party APIs that are built on top of the official ones. The APIs are sort of like a mediator between your bot and the service, or a menu of remotely-accessible functionality -- for twitter it might include things like "get the list of tweet IDs posted by this user ID in the last month" and "get the full text and metadata for this tweet ID." The set of functionality in that API determines what is and isn't possible for your bot to do (and depending on the service, it might actually hide a lot of complexity around sending messages to multiple servers, authenticating the request, etc)

When there's no API (or if the official API doesn't let you do something that you know should be possible) you usually have to switch to scraping. It's error-prone (could break any time) and frowned on by a lot of services (which is why you have to think about rate limiting and bans -- you may well be violating their terms of service, and either way you're using the service in unintended ways that might interfere with its normal functioning). "Unofficial APIs" are often just scrapers under the hood, tidied up into something that looks more like a normal API. I've written a ton of little scrapers in Python -- it really is a great tool for the job.

I suppose the other case is that some services can be built in standardized ways, so you don't need an official API from that particular company, because anyone else's API for that standard should be interoperable. That's common for databases, for example, but probably not the services you're talking about -- the popular web services are usually either proprietary, or a "standard" they invented that no one else actually uses, so you're basically stuck with the official API anyways.

For a lot of examples of integrating with public APIs, you can try Mining the Social Web from O'Reilly. I didn't actually spend a lot of time with that book personally (I wasn't expecting the sort of "cookbook" format with lots of examples and code) but it might cover some of the APIs that you're interested in.

u/machinedunlearned · 1 point r/datascience

I teach applied math, stats, and computation courses to B.S. degree seeking students. Two observations:

First, the theory is not pointless. I know, you didn't say or imply that. But I just have to get that out there.
Second, my observation is that many applied mathematics/statistics courses at the undergraduate level do an astoundingly poor job of connecting mathematics to real world data. I can go on about this for days, but I'll stop with saying that (a) this is an incredible disservice to students across all disciplines that require any math, and (b) there are mathematicians and statisticians who are aware of the problem and working to make it better.

Okay, rant concluded, book recommendations! First, try Doing Data Science by O'Neil and Schutt. This assumes some knowledge of linear algebra, stats, and programming. Examples are given in R. I think this book is very good at bringing out the idea that data science involves both theory and experience, and is good at bringing out the
feel" of working on data science problems.

Second, if your math background is calling for plenty of math, you might take a look at Machine Learning from a Probabilistic Perspective. This takes a closer look at the data modeling process, which is somewhat lacking in more CS-oriented texts on ML. Requires good knowledge of probability, obviously.

u/zith · 1 point r/MachineLearning

I'm afraid I can't answer your question specifically, but Stephen Marsland has written a great introductory level machine learning book that has all of the example code written in python.

u/Wrennnn_n · 2 points r/IWantToLearn

I'm reading a book called Beautiful Visualization that I strongly recommend.

http://www.amazon.com/Beautiful-Visualization-Looking-through-Practice/dp/1449379869

Think about any time someone tells a story with stats. Is it misleading? How could it be more objective. What are the trade offs? How could people misinterpret your visual?

u/Mandelliant · 2 points r/learnprogramming

It really depends on the book. Some books, like Automate the Boring Stuff with Python, are a broken up into a series of projects that get progressively more advanced. On the other hand, O'Reilly's Hadoop: the Definitive Guide is more of a general walkthrough of principals and fundamentals. u/Updatebjarni gave really good advice

u/homebeer · 4 points r/hadoop

I think this book is helpful.
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/ref=sr_1_1?ie=UTF8&amp;qid=1418697151&amp;sr=8-1

Cloudera usually has some good docs/blogs. Try poking around there.
http://blog.cloudera.com/blog/category/hadoop/

u/ictatha · 5 points r/BusinessIntelligence

I'd recommend Guerrilla Analytics: A Practical Approach to Working with Data. I bought it on a recommendation from somewhere on reddit (possibly this sub). It provides good software-neutral guidance for building data capabilities within an organization.

u/ogrisel · 1 point r/Python

You can find it here on amazon.com:

http://www.amazon.com/gp/product/1420067184?ie=UTF8&amp;tag=oliviergrisel-20&amp;linkCode=as2&amp;camp=1789&amp;creative=390957&amp;creativeASIN=1420067184

(Please feel free to strip the reference to oliviergrisel-20 in the URL should you want not to tip me through the amazon affiliates program)

u/daturkel · 4 points r/datascience

The book Doing Data Science, cowritten by Cathy O'Neil (of Weapons of Math Destruction may be of interest to you.

> In many of these chapter-long lectures, data scientists from companies such as Google, Microsoft, and eBay share new algorithms, methods, and models by presenting case studies and the code they use. If you’re familiar with linear algebra, probability, and statistics, and have programming experience, this book is an ideal introduction to data science.

I haven't read the whole thing yet, but it's well-written and has a nice survey of topics.

u/dunesidebee · 2 points r/datascience

Doing data science by Cathy O’Neal
https://www.amazon.com/Doing-Data-Science-Straight-Frontline/dp/1449358659/ref=mp_s_a_1_3?crid=32WET5G295L42&amp;keywords=doing+data+science&amp;qid=1556807847&amp;s=gateway&amp;sprefix=doing+data+sci&amp;sr=8-3

She does a great job explaining the process of teaching data science using the principles of data science.

u/rsoccermemesarecance · 2 points r/datascience

Mining the Social Web

Not exactly what you're looking for but it's very helpful, imo

u/MhilPickleson · 2 points r/java

Cassandra High Availability is a solid new book I'd recommend. It's very up to date with a section on CQL and Spark connector.

u/trystanr · 2 points r/graphic_design

Beautiful Visualisation Here

O'Reilly Designing Interfaces Here

u/Mirber · 1 point r/datamining

From my experience this is an excellent book and I'm eagerly awaiting the second edition--which comes out in Feb 2017? :(

http://www.amazon.com/Introduction-Data-Mining-Pang-Ning-Tan/dp/0321321367

u/anonypanda · 1 point r/consulting

I picked this up on the recommendation of a colleague. Very useful.

https://www.amazon.co.uk/Guerrilla-Analytics-Practical-Approach-Working/dp/0128002182

u/Westarmy · 1 point r/AskProgramming

Python and NumPy.

I will link some tutorials soon (Using a phone)

There's a book called:
Machine Learning: An Algorithmic Perspective

https://www.amazon.co.uk/Machine-Learning-Algorithmic-Perspective-Recognition/dp/1466583282

A useful book to have.

u/xamdam · 8 points r/MachineLearning

Marsland's ML book sounds like what you're looking for. https://smile.amazon.com/gp/product/1466583282/ref=oh_aui_search_detailpage?ie=UTF8&psc=1

u/CanYouPleaseChill · 1 point r/Python

I highly recommend Pandas for Everyone: Python Data Analysis by Daniel Chen. It's extremely practical and well-organized.

u/j6keey · 1 point r/computerscience

In terms of AI, I used this text for one of my AI/machine learning topics. Would recommend Artificial-Intelligence-Modern-Approach

Some other suggested readings from that topic:

Introduction Data Mining

Artificial Intelligence

u/equinox932 · 5 points r/Romania

Vezi si fast.ai, au 4 cursuri foarte bune. Apoi si asta e bun. Hugo Larochelle avea un curs de retele neuronale, un pic mai vechi.

La carti as adauga si The Hundred Page Machine Learning Book si asta , probabil cea mai buna carte practica, da asteapta editia a 2a, cu tensorflow 2.0, are tf.keras.layers, sequential model, practic tf 2 include keras si scapi de kkturile alea de sessions. Asa, si ar mai fi si asta, asta si asta. Nu pierde timp cu cartea lui Bengio de deep learning, e o mizerie superficiala. Spor la invatat si sa vedem cat mai multi romani cu articole pe ML si DL!

u/kerosion · 1 point r/technology

Correct. I felt this would be a fairly accessible recent example of data mining I could reference for sake of brevity, rather than going into detail on classification and prediction algorithms. The Target story was close enough to examples in the Introduction to Data Mining textbook we used at my University.

u/howsyourweird · 4 points r/compsci

Two Python ML resources below. The former mixes math with working code. The latter is new and appears to be more of a guide to applying scikit-learn and/or milk specifically.

http://www.amazon.com/Machine-Learning-Algorithmic-Perspective-Recognition/dp/1420067184
http://metarabbit.wordpress.com/2013/05/31/building-machine-learning-systems-with-python/

EDIT: All of the code from the first one is here: http://www-ist.massey.ac.nz/smarsland/MLBook.html

Reddit mentions: The best data mining books

1. Mining the Social Web: Data Mining Facebook, Twitter, Linkedin, Google+, Github, And More

2. Machine Learning: An Algorithmic Perspective (Chapman & Hall/Crc Machine Learning & Pattern Recognition)

3. Beautiful Visualization: Looking at Data through the Eyes of Experts (Theory in Practice)

4. Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference (Addison-Wesley Data & Analytics) (Addison-Wesley Data & Analytics)

5. Introduction to Data Mining

6. Doing Data Science: Straight Talk from the Frontline

7. R for Everyone: Advanced Analytics and Graphics (Addison-Wesley Data and Analytics)

8. Cassandra High Availability

9. Guerrilla Analytics: A Practical Approach to Working with Data

10. Pandas for Everyone: Python Data Analysis: Python Data Analysis (Addison-Wesley Data & Analytics Series)

11. Hadoop: The Definitive Guide

12. Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions (Developer Reference)

13. Hadoop

14. Machine Learning: An Algorithmic Perspective, Second Edition (Chapman & Hall/Crc Machine Learning & Pattern Recognition)

15. Successful Business Intelligence, Second Edition: Unlock the Value of BI & Big Data

16. Genetic Algorithms and Genetic Programming: Modern Concepts and Practical Applications (Numerical Insights)

17. Patterns of Data Modeling (Emerging Directions in Database Systems and Applications)

18. Google Compute Engine: Managing Secure and Scalable Cloud Computing

19. The Art of Data Analysis: How to Answer Almost Any Question Using Basic Statistics

20. Hadoop Operations

🎓 Reddit experts on data mining books

idea-bulb Interested in what Redditors like? Check out our Shuffle feature

Top Reddit comments about Data Mining:

You can now run Python scripts directly in the RStudio console

Summary

1. Mining the Social Web: Data Mining Facebook, Twitter, Linkedin, Google+, Github, And More

2. Machine Learning: An Algorithmic Perspective (Chapman & Hall/Crc Machine Learning & Pattern Recognition)

3. Beautiful Visualization: Looking at Data through the Eyes of Experts (Theory in Practice)

4. Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference (Addison-Wesley Data & Analytics) (Addison-Wesley Data & Analytics)

5. Introduction to Data Mining

6. Doing Data Science: Straight Talk from the Frontline

7. R for Everyone: Advanced Analytics and Graphics (Addison-Wesley Data and Analytics)

8. Cassandra High Availability

9. Guerrilla Analytics: A Practical Approach to Working with Data

10. Pandas for Everyone: Python Data Analysis: Python Data Analysis (Addison-Wesley Data & Analytics Series)

11. Hadoop: The Definitive Guide

12. Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions (Developer Reference)

13. Hadoop

14. Machine Learning: An Algorithmic Perspective, Second Edition (Chapman & Hall/Crc Machine Learning & Pattern Recognition)

15. Successful Business Intelligence, Second Edition: Unlock the Value of BI & Big Data

16. Genetic Algorithms and Genetic Programming: Modern Concepts and Practical Applications (Numerical Insights)

17. Patterns of Data Modeling (Emerging Directions in Database Systems and Applications)

18. Google Compute Engine: Managing Secure and Scalable Cloud Computing

19. The Art of Data Analysis: How to Answer Almost Any Question Using Basic Statistics

20. Hadoop Operations

Interested in what Redditors like? Check out our Shuffle feature