Reddit mentions: The best data mining books
We found 160 Reddit comments discussing the best data mining books. We ran sentiment analysis on each of these comments to determine how redditors feel about different products. We found 58 products and ranked them based on the amount of positive reactions they received. Here are the top 20.
1. Mining the Social Web: Data Mining Facebook, Twitter, Linkedin, Google+, Github, And More
- O Reilly Media
Features:
Specs:
Height | 9.19 Inches |
Length | 7 Inches |
Number of items | 1 |
Release date | October 2013 |
Weight | 1.61 Pounds |
Width | 0.94 Inches |
2. Machine Learning: An Algorithmic Perspective (Chapman & Hall/Crc Machine Learning & Pattern Recognition)
- Used Book in Good Condition
Features:
Specs:
Height | 9.75 Inches |
Length | 6.75 Inches |
Number of items | 1 |
Weight | 1.4991433816 Pounds |
Width | 1 Inches |
3. Beautiful Visualization: Looking at Data through the Eyes of Experts (Theory in Practice)
- O Reilly Media
Features:
Specs:
Height | 9.19 Inches |
Length | 7 Inches |
Number of items | 1 |
Weight | 1.46827866492 Pounds |
Width | 0.68 Inches |
4. Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference (Addison-Wesley Data & Analytics) (Addison-Wesley Data & Analytics)
Addison-Wesley Professional
Specs:
Height | 8.9 Inches |
Length | 6.9 Inches |
Number of items | 1 |
Release date | October 2015 |
Weight | 0.77602716224 pounds |
Width | 0.5 Inches |
5. Introduction to Data Mining
Specs:
Height | 9.4 Inches |
Length | 7.7 Inches |
Number of items | 1 |
Weight | 2.9982867632 Pounds |
Width | 1.6 Inches |
6. Doing Data Science: Straight Talk from the Frontline
- O'Reilly Media
Features:
Specs:
Height | 9 Inches |
Length | 6 Inches |
Number of items | 1 |
Weight | 1.2 Pounds |
Width | 0.78 Inches |
7. R for Everyone: Advanced Analytics and Graphics (Addison-Wesley Data and Analytics)
- Used Book in Good Condition
Features:
Specs:
Height | 9.25 Inches |
Length | 7.25 Inches |
Number of items | 1 |
Weight | 1.2786811196 Pounds |
Width | 0.5 Inches |
8. Cassandra High Availability
- Imported; Wear OS by Google; OLED (High Visibility); New Timepiece Mode; 50 M Water Resistant
- Digital Compass; Altimeter; Barometer; Activity Tracker; Dual Layer LCD Structure; Original Watch Face; Microphone; Casio Moment Setter App
- Touchscreen
- Case Diameter: 60.5mm ; Environmental Durability MIL-STD-810G (United States military standard issued by the U.S. Department of Defense),*2 low-temperature resistance (-10 °C)
- Water resistant to 50m (5 bar) (165ft: in general, suitable for short periods of recreational swimming, but not diving or snorkeling
Features:
Specs:
Height | 9.25 Inches |
Length | 7.5 Inches |
Number of items | 1 |
Release date | December 2014 |
Weight | 0.73 Pounds |
Width | 0.42 Inches |
9. Guerrilla Analytics: A Practical Approach to Working with Data
- Jeep Soft Tops: Soft tops feature options of quality fabric with sewn seams using marine grade thread. Each seam is welded shut with durable heat seal tape to keep moisture out
- Easy Install: Simply slip off your old Jeep soft top & replace with any Rampage soft top featuring reinforced stitching, marine grade thread on the heavy pull areas, & heavy duty 30 mil windows
- Three Year Warranty: Our Rampage products are covered by a 3 year warrant providing customers with repair or replacements subject to certain common exclusions
- Rampage Jeep Parts: Count on Rampage to provide you with proven Jeep parts & accessories like bumpers, interior accessories, steps, & more for varying Jeep models
- Offroad Accessories: Get serious about off-roading with a variety of bumpers, trail gear & recovery tools to help you out of sticky situations. We offer tube doors, mirror kits, steps, rock sliders & other accessories
Features:
Specs:
Height | 9 Inches |
Length | 6 Inches |
Number of items | 1 |
Release date | October 2014 |
Weight | 0.9038952742 Pounds |
Width | 0.63 Inches |
10. Pandas for Everyone: Python Data Analysis: Python Data Analysis (Addison-Wesley Data & Analytics Series)
- 100% natural ingredients
- All-day shower fresh feeling with just one application
- Creamy consistency
- No drying time & no white marks
- Certified Vegan & Cruelty Free
Features:
Specs:
Height | 8.9 Inches |
Length | 7 Inches |
Number of items | 1 |
Release date | January 2018 |
Weight | 1.4991433816 Pounds |
Width | 1 Inches |
11. Hadoop: The Definitive Guide
Specs:
Height | 9.19 Inches |
Length | 7 Inches |
Number of items | 1 |
Release date | May 2012 |
Weight | 1 Pounds |
Width | 1.5 Inches |
12. Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions (Developer Reference)
- Used Book in Good Condition
Features:
Specs:
Height | 8.95 Inches |
Length | 7.45 Inches |
Number of items | 1 |
Weight | 0.96562470756 Pounds |
Width | 0.7 Inches |
13. Hadoop
- Used Book in Good Condition
Features:
Specs:
Height | 9.19 Inches |
Length | 7 Inches |
Number of items | 1 |
Weight | 1.6 Pounds |
Width | 1.04 Inches |
14. Machine Learning: An Algorithmic Perspective, Second Edition (Chapman & Hall/Crc Machine Learning & Pattern Recognition)
- CRC Press
Features:
Specs:
Height | 10 Inches |
Length | 6.9 Inches |
Number of items | 1 |
Weight | 0.00220462262 Pounds |
Width | 1 Inches |
15. Successful Business Intelligence, Second Edition: Unlock the Value of BI & Big Data
- McGraw-Hill Osborne Media
Features:
Specs:
Height | 9.1 Inches |
Length | 8.8 Inches |
Number of items | 1 |
Weight | 1.36466140178 Pounds |
Width | 1 Inches |
16. Genetic Algorithms and Genetic Programming: Modern Concepts and Practical Applications (Numerical Insights)
- 23-Inch 16:9 widescreen IPS LCD monitor for photographers, designers, CAD/CAM engineers, and medical applications
- IPS panel delivers superior image quality accurate colors and high contrast ratio even at super wide viewing angles of 178 degrees
- 1080p Full HD resolution with 16:9 aspect ratio for perfect image reproduction
- 90 degrees pivot, 5 inch height adjust, 360 degrees swivel and tilt functions for desktop comfort and efficiency
- 4-port USB hub for easy connectivity of USB devices such as digital cameras, USB flash drives and mice
Features:
Specs:
Height | 9.21 Inches |
Length | 6.14 Inches |
Number of items | 1 |
Weight | 1.4991433816 Pounds |
Width | 0.88 Inches |
17. Patterns of Data Modeling (Emerging Directions in Database Systems and Applications)
- Wiley
Features:
Specs:
Height | 9.13 Inches |
Length | 6.13 Inches |
Number of items | 1 |
Release date | June 2010 |
Weight | 0.89948602896 Pounds |
Width | 0.6 Inches |
18. Google Compute Engine: Managing Secure and Scalable Cloud Computing
Specs:
Height | 9.19 Inches |
Length | 7 Inches |
Number of items | 1 |
Release date | December 2014 |
Weight | 0.89948602896 Pounds |
Width | 0.56 Inches |
19. The Art of Data Analysis: How to Answer Almost Any Question Using Basic Statistics
- The Touchstone DOCSIS 3.0 Residential Gateway is an 8x4 advanced gateway product which combines a 4-port Gigabit Router, and a 802.11n wireless access point into a single device capable of supporting both home and small office applications.
- Using multi-processor technology, the DG860P2 can simultaneously achieve high bandwidth DOCSIS performance without affecting the Wi-Fi and other gateway performance.
- The DG860P2 distinguishes itself with capabilities that minimize an operator's support needs. Multiple provisioning methods (SNMP, Configuration File, Remote WebGUI access, TR-069, and TFTP) allow a custom designed setup to be efficiently applied to the end user.
- Multiple local and remote access levels (User and Technician) also provide more ease and flexibility for manual configuration and control. The DG860P2 is a more compact version of the original DG860.
Features:
Specs:
Height | 9.299194 Inches |
Length | 6.149594 Inches |
Number of items | 1 |
Release date | May 2013 |
Weight | 0.65697754076 Pounds |
Width | 0.460629 Inches |
20. Hadoop Operations
- O Reilly Media
Features:
Specs:
Height | 9.19 Inches |
Length | 7 Inches |
Number of items | 1 |
Release date | October 2012 |
Weight | 1.1 Pounds |
Width | 0.8 Inches |
🎓 Reddit experts on data mining books
The comments and opinions expressed on this page are written exclusively by redditors. To provide you with the most relevant data, we sourced opinions from the most knowledgeable Reddit users based the total number of upvotes and downvotes received across comments on subreddits where data mining books are discussed. For your reference and for the sake of transparency, here are the specialists whose opinions mattered the most in our ranking.
Ok then, I'm going to assume that you're comfortable with linear algebra, basic probability/statistics and have some experience with optimization.
caret
package in R, but is also supposed to be a great textbook for modeling in general).I'd start with one of those three books. If you're feeling really ambitious, pick up a copy of either:
Or get both of those books. They're both amazing, but they're not particularly easy reads.
If these book recommendations are a bit intense for you:
Additionally:
Pick up a numerical analysis book of some sort. I learned out of Sauer, "Numerical Analysis" which I wasn't incredibly happy with and so I can't recommend it, but it covered the basics with example code in Matlab. Someone else here can probably recommend a better numerical methods textbook.
Some great free resources:
Other than that I don't know of any resources that could help with implementing machine learning algorithms; it mostly draws from numerical linear algebra / optimization, and the implementation details are left to the reader to grapple with.
For me, it works best to buy a good theory book (Koller & Friedman!) or read a good paper, and look around for blog posts written about a topic of interest that come with examples and sample code. It's unfortunate that everything is so disorganized this way but that's how it seems to be. I'd be happy to learn that I'm wrong here, though :)
Some blogs I can recommend:
Hope this is helpful! This is basically how I've been learning ML on my own for work.
I used R for about 4 years before I moved to Python to use it for deep learning. I have been using Python for about 2 years now.
>Are R and Python considered redundant, or are there some situations where one will be preferred over the other? If I become proficient at using Python for data wrangling, analysis, and visualization, will I have any reason to continue using R?
It depends. I haven't really found anything that I can do in Python that I could not already do in R. I still use R because I like it better as a functional programming language and because it has a wide variety of more specific statistical packages (many for biology) that are just not available for Python yet. There are some specific cases where I just find it more intuitive and simpler to implement a solution in R. And generally, I just prefer ggplot2 over any of the various Python plotting packages. Also, R has high level API for things like TensorFlow so it's not like you can't do deep learning in R.
The biggest advantage for Python is its speed and ability to work within a larger programming framework. A lot of companies tend to use Python because the models they build are integrated into a larger system that needs the capabilities of a fully-fledged programming language. Python is generally faster and has better management of big data sets in memory. R is actually moving more in the direction to fix these issues but there are still limitations.
>Where should I start? I'm looking for a resource that isn't aimed at complete beginners, since I've been using R for a few years, and took a C class before that. At the same time I wouldn't claim to be an experienced programmer. I'm interested in learning Python both for data analysis and for general programming.
I learned Python syntax using Learn Python 3 the Hard Way. I learned about Pandas and data wrangling etc using Pandas for Everyone and Pandas Cookbook. If I was to suggest just one book, it would be Pandas for Everyone. You can learn Python syntax from YouTube, MOOCs, or online tutorials. The Pandas Cookbook is just extra practice. To be honest though, the general conventions used by Pandas for data analysis and manipulation are very similar to R in many ways. Especially if you've used anything in Hadley Wickham's Tidyverse. Finally, I made a Pandas cheatsheet while I was learning and including equivalent R functions in some places. I would be happy to share this Google Sheets file with you if you are interested.
>What IDE(s) should I use, and what are some must learn packages? I'm hoping to find something similar to RStudio.
I started off using PyCharm. I've heard good things about Spyder. But now, I actually still use RStudio! It is fully integrated with Python thanks to the Reticulate package. You can pass data structures between the languages and use both in RMarkdown. You can also use virtual environments which are popular with Python. Once you install the package:
library(reticulate)
use_virtualenv("path_to_my_virtual_env") # Start virtual environment
You can now run Python scripts directly in the RStudio console
It's really easy to use and even comes with auto-complete and everything else.
Hope that helped.
Books (WARNING: these may be a bit dated since this field is changing quite rapidly):
But be careful with the loaded term "NoSQL". Don't get sucked up into the buzz. The main things the NoSQL movements popularized were:
And often, there's no reason to use Hadoop/Mongo/Cassandra/etc. when your dataset doesn't call for it. Use the right tool for the job.
The books are just starting points too. Look in the chapter notes to find a wealth of links to original papers and essays like so:
Oh, yeah. And take good notes in something like Evernote so you can cut/paste braindumps to NoSQL whitepapers/essays like the above.
And if you're wondering how someone could have time to go through the above, all I can say is STOP WATCHING TV--that's what I did. It's amazing what you can do when you stop caring about the latest reality TV buzz or who won what award in sports or entertainment.
OP, it's great that you're recognizing this need early! I was in the same position as you (0 programming experience), so perhaps I can offer you my strategy:
I started off learning R, because a lot of biological data analysis packages have already been written for it (see Bioconductor, http://www.bioconductor.org). R and it's corresponding IDE (RStudio) are easy and free to download:
http://cran.us.r-project.org
https://www.rstudio.com
For a super basic introduction to R (this will take you only a few hours), see Code School's 'Try R' tutorial:
http://tryr.codeschool.com
All Bioconductor packages come with a reference manual and sample data that you can use to practice. Also, most have a corresponding publication that explains the algorithm/stats behind it. I just went through and picked a few packages that seemed like they would be useful for the type of data my lab analyzes.
When I finished going through those and running practice data sets, I decided I wanted to actually learn R from scratch so I could understand what each function does. To that end, I bought a few O'Reilly books:
Learning R (http://shop.oreilly.com/product/mobile/0636920028352.do)
R for Everyone (http://www.amazon.com/gp/aw/d/0321888030?pc_redir=1398333875&robot_redir=1)
I've been dedicating my Mondays to going through the books, chapter by chapter, and doing all the exercises. It's been really helpful, and I'm finding it easier and easier to understand the Bioconductor packages.
Finally, I enlisted some other grad students that had no experience but also wanted to learn, and together, we started a weekly meetup in which we each select and demonstrate a Bioconductor package. This basically forces us to keep up with learning and master the packages, since we don't want to lose face :)
Next, I'm planning on diving into Python, but for now, learning R has proven very useful.
Hope this helps a bit, and good luck!
Window functions, not windows function. The idea is that it applies the function over a "window" of the rows in the dataset. They are very useful for solving complicated source data problems. Check out this book for many examples (ignore the SQL Server 2012,it's so highly useful and relevant on any version of SQL Server that supports them). I highly recommend anything by Itzik Ben-Gan for SQL, and anything by Andy Leonard for SSIS, and Marco Russo, Alberto Ferrari, Chris Webb for DAX / SSAS (even OLAP).
https://www.amazon.com/dp/0735658366/ref=cm_sw_r_cp_awdb_t1_-NtwCbKA67Y7W
10 years of BI experience here... I would say that BI covers a huge range of topics and the questions they asked may be relevant, but I would focus more on how the interviewee goes about finding solutions rather than wrote knowledge. You can be book smart but not know how to think and how to break down complex problems into solvable pieces.
I recently hired a guy that knew all about dimensional modeling and a good bit about SSIS. He interviewed excellent and seemed like a good fit. Two days in, I regretted the hire because his attention to detail was non-existent and his ability to solve problems matched that of a goldfish. Lesson learned and now I ask different types of interview questions, more of a scenario based interview rather than a technical pop-quiz.
Keep learning and don't let this interview get you down. If you show that you have a passion for solving difficult puzzles then you will find a place in BI.
Also... It sounds like a DBA gave you the technical interview. Brush it off and keep moving. ;)
This is a fairly useful review that I believe is available via google scholar for free: Wainer, H., & Thissen, D. (1981). Graphical data analysis. Annual review of psychology, 32, 191–241.
Tufte is useful for a historical overview and for inspiration, but he has a particular style that doesn't necessarily match up with the way that you or your audience think.
Hadley Wickham developed ggplot2 and his site is a good place to start browsing for guides to using it.
There's a pretty good o'reilly book on visualization as well, and Stephen Few's book does a really good job of enumerating the various ways you can express trends in data.
Without a warehouse there will be no data governance over your reporting. So everyone does something different and no numbers match. Not only is this confusing to business, they are less likely to "buy-in" to the reporting you provide.
I strongly recommend Successful Business Intelligence for reading. Its light (non-technical) reading and author provides a lot of use cases and examples.
Your best BI tool will be spreadsheets. Its not sexy by any means compared to PowerBI or Tableau, but its the #1 BI tool. You can build a multi-million dollar portal with impressive visualizations, but if the user wants a spreadsheet- but can't get one- then your investment is squandered.
So get good at Excel... power pivots, vlookup, match, index all that
I also see other good resources here for you to check out.
Good luck!
I’ll try to explain. The basic layout of a genetic algorithm is rather simple:
Some forms of genetic algorithms may differ from that basic layout, but that’s basically how they work. If it has selection, mutation and crossover, it’s a genetic algorithm. Without crossover, it’s an evolutionary method, but not a genetic algorithm.
Source: The genetic algorithms class I’m taking this semester. We’re using this book. I can certainly recommend it, but it’s probably quite geared towards people with a CS background (as are a lot of books on the subject).
Also, this is an interesting Scientific American article on the subject, if you can get hold of a PDF or something.
The pymc3 documentation is a good place to start if you enjoy reading through mini-tutorials: pymc3 docs
Also these books are pretty good, the first is a nice soft introduction to programming with pymc & bayesian methods, and the second is quite nice too, albeit targeted at R/STAN.
For your particular application I would look at OpenStreetMaps. Otherwise...
David Hay's
Len Silverston's
Michael Blaha's [Patterns of Data Modeling][7]. This one has some interesting temporal, graph, and tree models.
Martin Fowler's [Analysis Patterns][8]. This one skims some of the other patterns, but gives accounting a solid treatment.
They are all well-rated, and I have read all but one, and they are all very good. Several of them are available on [safaribooksonline][9].
Also, OASIS's [Universal Business Language][10], schemas
[1]: http://www.amazon.com/Enterprise-Model-Patterns-Describing-Version/dp/1935504053/ref=sr_1_1?s=books&ie=UTF8&qid=1346950468&sr=1-1&keywords=enterprise%20model%20patterns
[2]: http://www.amazon.com/Data-Model-Patterns-David-Hay/dp/0932633749/ref=pd_sim_b_2?ie=UTF8&refRID=1PQPGE4E6T2RPR2XTN80
[3]: http://www.amazon.com/Data-Model-Patterns-Metadata-Management/dp/0120887983/ref=pd_sim_b_3?ie=UTF8&refRID=1PQPGE4E6T2RPR2XTN80
[4]: http://www.amazon.com/Data-Model-Resource-Book-Vol/dp/0471380237/ref=pd_sim_b_3?ie=UTF8&refRID=08T9TEZJNZM2EMKZV3AB
[5]: http://www.amazon.com/Data-Model-Resource-Book-Vol/dp/0471353485/ref=pd_sim_b_3?ie=UTF8&refRID=1D5TDG7479G7TQMBPNWF
[6]: http://www.amazon.com/Data-Model-Resource-Book-Vol/dp/0470178450/ref=pd_sim_b_4?ie=UTF8&refRID=08T9TEZJNZM2EMKZV3AB
[7]: http://www.amazon.com/Patterns-Modeling-Emerging-Directions-Applications/dp/1439819890/ref=sr_1_1?s=books&ie=UTF8&qid=1346950554&sr=1-1&keywords=patterns%20of%20data%20modeling
[8]: http://www.amazon.com/Analysis-Patterns-Reusable-Object-Models/dp/0201895420/ref=sr_1_1?s=books&ie=UTF8&qid=1346961699&sr=1-1&keywords=analysis+patterns
[9]: http://my.safaribooksonline.com/search?q=data%20model
[10]: http://en.wikipedia.org/wiki/Universal_Business_Language
Yo, I'm not getting that image, but at a base level I can tell you this -
Here's a good book for Python: http://www.nltk.org/book/
A link to some more: http://nlp.stanford.edu/~manning/courses/DigitalHumanities/DH2011-Manning.pdf
And for R, there's http://www.springer.com/us/book/9783319207018
and
https://www.amazon.com/Analysis-Students-Literature-Quantitative-Humanities-ebook/dp/B00PUM0DAA/ref=sr_1_9?ie=UTF8&qid=1483316118&sr=8-9&keywords=humanities+r
There's also this https://www.amazon.com/Mining-Social-Web-Facebook-LinkedIn/dp/1449367615/ref=asap_bc?ie=UTF8 for web scraping with Python
I know the R context better, and using R, you'd want to do something like this:
And that's that!
Don't be. Basically everything here is either about conditional probability (naive bayes) or weights*inputs + bias > threshold calculations. The weights inputs stuff is basic neural networks. Support vectors machines, perceptrons, kernel transformations, are all about finding linearly separable classes (in some appropriate dimension).
For more info this will help http://en.m.wikipedia.org/wiki/Perceptron
If you are interested I really have enjoyed reading http://www.amazon.com/gp/aw/d/1420067184/ref=redir_mdp_mobile which explains really well a lot of machine learning stuff and demystifies the math
For some practical examples here are some blog posts I've written:
K means step by step in f#: http://onoffswitch.net/k-means-step-by-step-in-f/
Automatic fogbugz triage with naive bayes: http://onoffswitch.net/fogbugz-priority-prediction-naive-bayes/
I share the links mostly cause its nice to see a worked through, practical example with code. In the end a lot of these algorithms aren't that hard to implement with some matrix math
Anyways, hope this helps!
SDK reference - this book, yes its old, but it tells you how to talk to the API in python.
For the SDK you basically only need to learn GCloud SDK i.e. (gcloud, gsutil, bq and kubectl). However what you need to learn is dependent on whether you need compute (in which case gcloud + gsutil) or kubernetes (kubectl)
Save yourself some effort - make sure everything logs to stackdriver, so you can trace/debug as necessary.
Where practical use Googles managed services, also look at the Marketplace options rather than configuring your own application instances. Go onto Youtube and watch Cloud Next sessions - they have lots of useful information. If you want to practice, get a Qwiklabs account and run through the many lab based exercises on how to use GCP.
Great replacment for Statistics 101: The Art of Data Analysis: How to Answer Almost Any Question Using Basic Statistics by Kristin H. Jarman.
Book was hilarious.. She uses Data Analysisfor a bunch of hilarious tasks. For example running a frequency distribution on “Yo Momma" Jokes”on the net or uses probability to jokingly track Big Foot.
Frequency Distribution of Yo Momma Jokes
Back to Big Foot, she uses probability to figure out everything on her Bigfoot search from buying the right camera to the ‘best places to look.’ It’s hilarious.
>I quit my job and head off in search of the creature. With visions of fame and fortune running through my head, I cash in my savings, say goodbye to my family, and drive away in my newly purchased vintage mini-bus.
>
>As I leave the city limits, my thoughts turn to the task ahead. Bigfoot exists, there’s no doubt about it. He’s out there, waiting to be discovered. And who better than a statistician-turned-monster-hunter to discover him? I’ve got scientific objectivity, some newly acquired free time, and a really good GPS from Sergeant Bub’s Army Surplus store.
>
>It’s too late to get my job back, and my husband isn’t taking my calls, so it seems I have no choice but to continue my search. I decide I’m going to do it right. I may never find the proof I’m looking for, but I’ll give it my best, most scientific effort. Whatever evidence I find will stand up to the scrutiny of my ex-boss, my family, and all those newspaper reporters who’ll be pounding on my door, begging for interviews.
I'm pretty sure this book talked about how easy it was to scrape facebook before they locked down their API.
https://www.amazon.com/Mining-Social-Web-Facebook-LinkedIn/dp/1449367615/
A lot of people probably did this. I remember a talk given in my city, the guy had a few thousand people signup to his app and got millions of entries to his graph database
https://maxdemarzi.com/2013/01/28/facebook-graph-search-with-cypher-and-neo4j/
Popular game devs probably got oodles of data. Must have been awesome having a social graph of the US
"Cassandra High Availability" by Robbie Strickland has been highly recommended by others. It's on the top of my list of tech books to check out.
I'd recommend reading the Dynamo and Bigtable papers if you haven't already as C* uses a lot of ideas from those papers.
The Last Pickle has a great blog that has a lot of nice deep dive articles.
The Cassandra wiki has some nice detailed pages describing the design and read/write paths.
If you have an O'reilly subscription there is a pretty good video with Ben and a few other Cloudera guys thats probably a good quickstart on cluster security A Practitioner's Guide to Securing Hadoop
Hadoop Operations is a good book but getting a bit outdated. I'd say just start setting up a cluster and maybe read it if you have some free time.
I would check out Ben Fry's book first.
Then Beautiful Visualization.
There is another good McCandless eyecandy.
Manuel Lima did an amazing book on network visualization with excellent essays from other people. Visual Complexity. Network vis is very difficult and if you want to "game up" understanding the taxonomy he built for network vis will give you a real perspective on the taxonomy in other types of vis.
There are things outside of the "take data and render visualization" world that are critical to data vis, imo. For moving data vis, start with the godfather, Muybridge
And look way way back for the long human history of data vis in cartography with stuff like Cartographia.
Hope to see some more books and discvoer a reading list on this thread! Great idea for a post.
If you want to be valuable to companies post graduation you should learn more about programming (design templates, how to write tests, how to go from a paper to code). I recommend this book as a good starting place. Once you're comfortable with how the different methods work, pick up this book.
How far in are you? I've been wanting to do more stuff with PyMC3. I recently bought Bayesian Methods for Hackers to do so but a Slack study group sounds very helpful if I were to order this book as well.
Edit: I'd just like to confirm this group does plenty of Python despite the book only using R and whatever Stan is.
I haven't personally read it, but the table of contents of this book looks like it has some potential to be a more modern version of what the Mitchell book aimed to provide.
I may be teaching a course in ML next year; I may order a copy to see if it's any good. Has anyone else looked at it?
You might look at the O'Reilly book "Mining the Social Web". I've found it very helpful and the author has even responded to my questions about how to get the virtual machine with the ipython notebooks up and running. Has code examples, explanations about the underlying technologies, and intros to things like natural language processing included.
http://www.amazon.com/Mining-Social-Web-Facebook-LinkedIn/dp/1449367615
I am a book guy myself. Ebooks to be specific. I really liked the R is for everyone http://www.amazon.com/gp/aw/d/0321888030?pc_redir=1405401911&robot_redir=1
Pick one and stick with it. I really just recommend R for a number of reasons but Python is also a good choice just R works for me.
You need a personal project to work on. Mine was an end of the year report.
Hoowhee, how did this text get so... wally.
Bots are usually (fairly) simple programs, so Python will make it easy to get at all the common functionality you want (maybe looking for pattern matches in a piece of text, some math/analytics, saving files to your hard drive, converting images...) and in practice you'll mostly only be limited by what you can figure out how to do in your language of choice, regardless of the bot you're writing.
Wherever possible, you should use official APIs (which will often support Python these days), or at least third-party APIs that are built on top of the official ones. The APIs are sort of like a mediator between your bot and the service, or a menu of remotely-accessible functionality -- for twitter it might include things like "get the list of tweet IDs posted by this user ID in the last month" and "get the full text and metadata for this tweet ID." The set of functionality in that API determines what is and isn't possible for your bot to do (and depending on the service, it might actually hide a lot of complexity around sending messages to multiple servers, authenticating the request, etc)
When there's no API (or if the official API doesn't let you do something that you know should be possible) you usually have to switch to scraping. It's error-prone (could break any time) and frowned on by a lot of services (which is why you have to think about rate limiting and bans -- you may well be violating their terms of service, and either way you're using the service in unintended ways that might interfere with its normal functioning). "Unofficial APIs" are often just scrapers under the hood, tidied up into something that looks more like a normal API. I've written a ton of little scrapers in Python -- it really is a great tool for the job.
I suppose the other case is that some services can be built in standardized ways, so you don't need an official API from that particular company, because anyone else's API for that standard should be interoperable. That's common for databases, for example, but probably not the services you're talking about -- the popular web services are usually either proprietary, or a "standard" they invented that no one else actually uses, so you're basically stuck with the official API anyways.
For a lot of examples of integrating with public APIs, you can try Mining the Social Web from O'Reilly. I didn't actually spend a lot of time with that book personally (I wasn't expecting the sort of "cookbook" format with lots of examples and code) but it might cover some of the APIs that you're interested in.
I teach applied math, stats, and computation courses to B.S. degree seeking students. Two observations:
Okay, rant concluded, book recommendations! First, try Doing Data Science by O'Neil and Schutt. This assumes some knowledge of linear algebra, stats, and programming. Examples are given in R. I think this book is very good at bringing out the idea that data science involves both theory and experience, and is good at bringing out the
feel" of working on data science problems.
Second, if your math background is calling for plenty of math, you might take a look at Machine Learning from a Probabilistic Perspective. This takes a closer look at the data modeling process, which is somewhat lacking in more CS-oriented texts on ML. Requires good knowledge of probability, obviously.
I'm afraid I can't answer your question specifically, but Stephen Marsland has written a great introductory level machine learning book that has all of the example code written in python.
I'm reading a book called Beautiful Visualization that I strongly recommend.
http://www.amazon.com/Beautiful-Visualization-Looking-through-Practice/dp/1449379869
Think about any time someone tells a story with stats. Is it misleading? How could it be more objective. What are the trade offs? How could people misinterpret your visual?
It really depends on the book. Some books, like Automate the Boring Stuff with Python, are a broken up into a series of projects that get progressively more advanced. On the other hand, O'Reilly's Hadoop: the Definitive Guide is more of a general walkthrough of principals and fundamentals. u/Updatebjarni gave really good advice
I think this book is helpful.
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/ref=sr_1_1?ie=UTF8&qid=1418697151&sr=8-1
Cloudera usually has some good docs/blogs. Try poking around there.
http://blog.cloudera.com/blog/category/hadoop/
I'd recommend Guerrilla Analytics: A Practical Approach to Working with Data. I bought it on a recommendation from somewhere on reddit (possibly this sub). It provides good software-neutral guidance for building data capabilities within an organization.
You can find it here on amazon.com:
http://www.amazon.com/gp/product/1420067184?ie=UTF8&tag=oliviergrisel-20&linkCode=as2&camp=1789&creative=390957&creativeASIN=1420067184
(Please feel free to strip the reference to oliviergrisel-20 in the URL should you want not to tip me through the amazon affiliates program)
The book Doing Data Science, cowritten by Cathy O'Neil (of Weapons of Math Destruction may be of interest to you.
> In many of these chapter-long lectures, data scientists from companies such as Google, Microsoft, and eBay share new algorithms, methods, and models by presenting case studies and the code they use. If you’re familiar with linear algebra, probability, and statistics, and have programming experience, this book is an ideal introduction to data science.
I haven't read the whole thing yet, but it's well-written and has a nice survey of topics.
Doing data science by Cathy O’Neal
https://www.amazon.com/Doing-Data-Science-Straight-Frontline/dp/1449358659/ref=mp_s_a_1_3?crid=32WET5G295L42&keywords=doing+data+science&qid=1556807847&s=gateway&sprefix=doing+data+sci&sr=8-3
She does a great job explaining the process of teaching data science using the principles of data science.
Mining the Social Web
Not exactly what you're looking for but it's very helpful, imo
Cassandra High Availability is a solid new book I'd recommend. It's very up to date with a section on CQL and Spark connector.
Beautiful Visualisation Here
O'Reilly Designing Interfaces Here
From my experience this is an excellent book and I'm eagerly awaiting the second edition--which comes out in Feb 2017? :(
http://www.amazon.com/Introduction-Data-Mining-Pang-Ning-Tan/dp/0321321367
I picked this up on the recommendation of a colleague. Very useful.
https://www.amazon.co.uk/Guerrilla-Analytics-Practical-Approach-Working/dp/0128002182
Python and NumPy.
I will link some tutorials soon (Using a phone)
There's a book called:
Machine Learning: An Algorithmic Perspective
https://www.amazon.co.uk/Machine-Learning-Algorithmic-Perspective-Recognition/dp/1466583282
A useful book to have.
Marsland's ML book sounds like what you're looking for. https://smile.amazon.com/gp/product/1466583282/ref=oh_aui_search_detailpage?ie=UTF8&psc=1
I highly recommend Pandas for Everyone: Python Data Analysis by Daniel Chen. It's extremely practical and well-organized.
In terms of AI, I used this text for one of my AI/machine learning topics. Would recommend Artificial-Intelligence-Modern-Approach
Some other suggested readings from that topic:
Introduction Data Mining
Artificial Intelligence
Vezi si fast.ai, au 4 cursuri foarte bune. Apoi si asta e bun. Hugo Larochelle avea un curs de retele neuronale, un pic mai vechi.
La carti as adauga si The Hundred Page Machine Learning Book si asta , probabil cea mai buna carte practica, da asteapta editia a 2a, cu tensorflow 2.0, are tf.keras.layers, sequential model, practic tf 2 include keras si scapi de kkturile alea de sessions. Asa, si ar mai fi si asta, asta si asta. Nu pierde timp cu cartea lui Bengio de deep learning, e o mizerie superficiala. Spor la invatat si sa vedem cat mai multi romani cu articole pe ML si DL!
Correct. I felt this would be a fairly accessible recent example of data mining I could reference for sake of brevity, rather than going into detail on classification and prediction algorithms. The Target story was close enough to examples in the Introduction to Data Mining textbook we used at my University.
Two Python ML resources below. The former mixes math with working code. The latter is new and appears to be more of a guide to applying scikit-learn and/or milk specifically.
EDIT: All of the code from the first one is here: http://www-ist.massey.ac.nz/smarsland/MLBook.html