#2 in Data warehousing books
Use arrows to jump to the previous/next product

Reddit mentions of The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data

Sentiment score: 4
Reddit mentions: 7

We found 7 Reddit mentions of The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Here are the top ones.

The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data
Buying options
View on Amazon.com
or
    Features:
  • Wiley
Specs:
Height9.200769 Inches
Length7.299198 Inches
Number of items1
Weight2.00179733896 Pounds
Width1.200785 Inches

idea-bulb Interested in what Redditors like? Check out our Shuffle feature

Shuffle: random products popular on Reddit

Found 7 comments on The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data:

u/thibaut_barrere · 6 pointsr/rails

It's hard to provide a full answer just based on available information, but roughly you have many different ways to achieve what you have in mind.

Some families of ways to handle this:

  • ETL (Extract Transform Load) -> extract from CSV, during the process transform / clean / remap any field, if you don't do incremental reload you could also dedupe in the transform step, then load a "clean dataset" into either Postgres, ElasticSearch, etc
  • ELT (Extract Load Transform) -> extract from CSV, dump right into ES or PG (mostly unchanged), then modify there (or query a "temporary dataset" to do a sort of ETL, clean / filter etc, and pour the data into a more "final" destination in the same datastore

    What's the most adequate way to do this depends on various factors:

  • Do you want to deduplicate inside a single CSV (which can be achieved in memory before loading), or cross-CSVs (in which case you need a business key, with unique constraint, and do "upserts", or at least verify if you must drop the rows by checking id presence before)
  • Do you have many different CSV formats or are them quite different? (if they are quite different, it's often more easy to go ETL, to have a very flexible & well tested way to verify the mappings & conversions etc)
  • Are the outputs mostly largely similar with a bit of different fields, or mostly completely different?

    Finally, here are some tools which can help:

  • My own gem https://www.kiba-etl.org (which I use both for ETL & ELT scenarios)
  • Ruby Sequel https://github.com/jeremyevans/sequel (which is actually used by my upcoming Kiba Pro offer for database related tasks)
  • pgloader https://github.com/dimitri/pgloader
  • embulk http://www.embulk.org/docs/

    If you are into this for the long term, it can be worth reading a book that I often mention, which is the ETL book by Ralph Kimball. While quite old, it provides interesting patterns.

    Happy to detail this more if you provide more input!
u/johngabbradley · 4 pointsr/BusinessIntelligence

Dimensional modeling is under valued in today's climate. Any complex models on a large scale will be more effective when modeled optimally.

https://www.amazon.com/Data-WarehouseÂ-ETL-Toolkit-Techniques-Extracting/dp/0764567578

u/welshfargo · 2 pointsr/Database

Informatica is widely used for ETL tools, but more important is understanding the challenges. How to design staging tables, what update strategy to use, how to design restartable jobs, and so on. This book refers to data warehousing, but the techniques are applicable to most scenarios.

u/sathley90 · 2 pointsr/databases

Also The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data https://www.amazon.com/dp/0764567578/ref=cm_sw_r_cp_apa_lnUSAbDC5NK4X

u/[deleted] · 1 pointr/Database

Beautiful. Thank you. Everything you just described here, falls in the realm of ETL? I just ordered a book on the subject. The logic does make sense explained this way.

http://www.amazon.com/The-Data-Warehouse-ETL-Toolkit/dp/0764567578/ref=sr_1_1?ie=UTF8&qid=1331798466&sr=8-1

u/hagemajr · 1 pointr/AskReddit

Awesome! I kind of fell into the job. I was initially hired as a web developer, and didn't even know what BI was, and then got recruited by one of the BI managers and fell in love. To me, it is one of the few places in IT where what you create will directly impact the choices a business will make.

Most of what I do is ETL work (taking data from multiple systems, and loading them into a single warehouse), with a few cubes (multidimensional data analaysis) and SSRS report models (logical data model built on top of a relational data store used for ad hoc report creation). I also do a bit of report design, and lots of InfoPath 2010 + SharePoint 2010 custom development.

We use the entire Microsoft BI stack here, so SQL Server Integration (SSIS), Analysis (SSAS), and Reporting Services (SSRS). Microsoft is definitely up and coming in the BI world, but you might want to try to familiarize yourself with Oracle BI, Business Objects, or Cognos. Unfortunately, most of these tools are very expensive and not easy to get up and running. I would suggest you familiarize yourself with the concepts, and then you will be able to use any tool to apply them.

For data warehousing, check out the Kimball books:

Here and here and here

For reporting, get good with data visualizations, anything by Few or Tufte, like:

Here and here

For integration, check these out:

Here and here

Also, if you're interested in Microsoft BI (SSIS, SSAS, SSRS) check out this site. It has some awesome videos around SSAS that are easy to follow along with.

Also, check out the MSDN BI Blog: http://blogs.msdn.com/b/bi/

Currently at work, but if you have more questions, feel free to shoot me a message!