Best products from r/bigdata

We found 11 comments on r/bigdata discussing the most recommended products. We ran sentiment analysis on each of these comments to determine how redditors feel about different products. We found 10 products and ranked them based on the amount of positive reactions they received. Here are the top 20.

Top comments mentioning products on r/bigdata:

u/admiralwaffles · 1 pointr/bigdata

Glad it's useful! I'll copy some of a reply I gave to somebody who PM'd me about advice for a data science career, because it's pertinent to you:

You need to understand where you want to go -- more science-y or more business-y. See, science-y type of analytics are heavy on the stats, applying really advanced methods to glean some counterintiutive and/or non-obvious insight. Business-y type stuff is digging through the data to understand what it's telling you and to build a bit of a story to figure out what the business is doing, and then measuring success after something changes. Both have their value. Essentially: science side tells you about the data, but the business side tells you how to make decisions based on the data. You'll fall somewhere on that spectrum, so just play to your strengths.

Once you've determined this, you need to learn a few things:

  1. Excel. Excel is the greatest data tool ever invented and I'll fight anybody who says different. Learn all about formulas, pivot tables, and whatnot. Excel is so deep, but really understand that Data tab. There is no better tool to connect to your data and just play around with it to figure out what you're looking at.
  2. Python, specifically NumPy and Pandas. Those are the two modules that will let you play with data very quickly. Pandas puts data in tables and allows you to operate on them. NumPy handles very complex calculations very quickly. Also, learn about Jupyter notebooks -- they're wonderful when doing analysis.
  3. Business! You may already know this, but you need to understand what the analytics are looking for. Data's only value is to use it to make better decisions. That's it. Data has no inherent value, and you need to understand how it can be leveraged. Even if you're more interested in the science of it, you still should have some grounding in how data is used.
  4. Bonus: GIS stuff. QGIS is a good tool and it's free. You can also do a ton of GIS stuff in Python with Shapely and Matplotlib (honestly, like 85% of my GIS work is in Python). This is especially helpful if you have some really interesting things geographically. Just be cognizant that every geographic insight isn't useful.

    As for some resources, here are some courses I think would be good from MIT CourseWare (full disclosure: I haven't sat through these specific courses, but these are the topics that are important):

  5. Statistical Thinking and Data Analysis
  6. Data, Models, and Decisions
  7. Communicating with Data (Little dated, but still valuable)

    You may also want to read up on machine learning. I like the O'Reilly book on it, but there are tons of books out there about it now.

    Hope that helps!
u/thunderdome · 3 pointsr/bigdata

First of all, I don't know how large your company is or how much data exactly you are dealing with, but I'll give some general advice based on what you've said.

It sounds like your company operates the way many do: with individual data marts that were created organically for different needs as they arose. What you need is a centralized data warehouse that brings these data marts together into some type of star-schema setup. Pick up this book. It's basically an introduction to dimensional modeling but more importantly it lays out how to navigate the politics of a large organization such as yours to get a data warehouse created. You will need someone at C-level most likely to make the push but the benefits are well worth it. It's not a small project, you'll need people to admit you need a major overhaul and be willing to invest in it.

>but getting these people to allow me to connect to the data is the most difficult thing (why is that?!).

Probably because whoever is managing the data realizes that there is no way an end user can make useful queries into the database due to it's disjointed and poorly maintained manner.

>In fact, I think they just hired some external enterprise data company, which I feel like is the wrong approach. Information is the most important thing. I feel like this is one thing a company should be managing in-house and not outsourcing. Is that completely wrong?

I would tend to agree, it's important to have people in the organization that understand the data structure fully. It's not unusual to bring in consultants to provide expertise on specific things (like the ETL process) but if I were a large financial company I would want our in house team to have a good handle on the data coming in and being stored.

>Fields are constantly being overwritten so that history isn't maintained. People aren't notified of the overwritten name changes so existing reports aren't capturing all information properly.

At the very least convince them to create expiration fields so you can expire old records without losing information.

u/Temujin_123 · 3 pointsr/bigdata

I'm partial to Cloudera or Horton Works. Both have training courses.

  • Cloudera (note they have a course tailored specifically for data analysts)
  • Horton Works

    I personally like good 'ol books. I've taken the Coursera intro and Hive/Pig training courses and while they were invaluable, nothing quite replaces sitting down and working your way through books like Hadoop: The Defininitive Guide or MongoDB: The Definitive Guide. I highly recommend Safari Books Online if you enjoy online reading. Perhaps some of your professional development money could go to paying for an account for that. For those who don't have the money for that, don't underestimate the usefulness of your public library. I currently have 3 books out from my local library on graph/network science (Linked is awesome and a great start for anyone interested in Networks/Graphs).

    One thing I'll mention is that Hadoop has really become more of an ecosystem than a produce. HDFS, MapReduce, Pig, Hive, Sqoop, Flume, HBase, Storm, etc. Just saying "Hadoop" is like just saying JQuery. Half the battle with JQuery is knowing how to use the best plugins. It's the same with Hadoop.
u/wtf1sh · 6 pointsr/bigdata

This books has no math or programming. It concentrates on existing real-word applications of Big Data technologies, and how it is helping to make life better. I recommend it a lot : http://www.amazon.com/Big-Data-Revolution-Transform-Think/dp/0544227751/ref=sr_1_1?ie=UTF8&qid=1422216718&sr=8-1&keywords=big+data

u/doubleocherry · 1 pointr/bigdata

I wrote a book on this very subject. You can read the first chapter here for free.

u/pmrr · 2 pointsr/bigdata

I wouldn't be scared of functional programming or Scala. If you've been writing PySpark jobs then you're probably using Python in a functional way itself. From the testing I've done with Spark and Scala it's almost impossible to not write functional Spark jobs as that's how Spark is designed. I would equally say a lot of Scala devs probably aren't using Scala in a purely functional way anyway.

I say just get stuck in using Spark/Scala based on your PySpark knowledge and see how far you get. If you get stuck and feel you need a cleaner understanding of functional programming / scala, try this book:

http://www.amazon.co.uk/Functional-Programming-Scala-Paul-Chiusano/dp/1617290653/ref=sr_1_3?ie=UTF8&qid=1451992580&sr=8-3&keywords=scala

u/KuroSaru · 2 pointsr/bigdata

All good books, but id recommend getting the In Action series of books

http://www.amazon.co.uk/Hadoop-Action-Chuck-Lam/dp/1935182196

Just a little sad at the lack of a book on HBase in your collection :(!