#96 in Computers & technology books
Use arrows to jump to the previous/next product
Reddit mentions of Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Sentiment score: 20
Reddit mentions: 30
We found 30 Reddit mentions of Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. Here are the top ones.
Buying options
View on Amazon.comor
- Wiley
Features:
Specs:
Height | 9.17321 inches |
Length | 7.00786 inches |
Number of items | 1 |
Weight | 2.15 Pounds |
Width | 1.2389739 inches |
Here's my list of the classics:
General Computing
Computer Science
Software Development
Case Studies
Employment
Language-Specific
C
Python
C#
C++
Java
Linux Shell Scripts
Web Development
Ruby and Rails
Assembly
> When is it okay to get complacent in your job and when is it not?
That's 100% up to you. Different strokes for different folks and all that.
> How important is it to constantly be working on or learning new stuff?
Extremely important. So much so that I give almost no pushback if my people wanna spend a few days per month at a conference/training. Company will even pay for most of it. Find a company that has a line-item in the budget for professional development -- dollars that are specifically intended to be spent by the end of the year on training, conferences, etc.
And that's not exclusive to software/data/compsci. Any skilled labor is changing constantly. Professional development is important.
> For the data engineers out there what skills should I perfect that will make me employable / desirable anywhere?
Become familiar with a variety of query languages and syntax. SQL, Elastic, AQL, N1QL, a time series DB -- the specific one doesn't really matter, just know more than "basic SQL joins" that you'll see in an undergrad database course.
Recommended reading: Designing Data Intensive Applications.
Designing Data-Intensive Applications seems to be the industry standard, although it's not Go specific.
Much of this stuff is learnable outside of work, too, at least at a superficially-passable level. Trust me.
Pick up a few seminal books and read them with vigor. That's all you need to do.
Here are some books I can personally recommend from my library:
Software Design
​
Software Industry
https://landing.google.com/sre/books/
​
SRE book is free, workbook is not.
https://cloud.google.com/solutions/best-practices-for-operating-containers
https://cloud.google.com/solutions/about-capacity-optimization-with-global-lb
Some of this is google cloud specific but the principles are the same with on-prem or a different provider. "State-of-the-art" deployments are usually learned by using best practices since each distributed app's deployment will vary. These books will help with best practices:
https://www.amazon.com/Microservices-Patterns-examples-Chris-Richardson/dp/1617294543/
https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/
https://www.amazon.com/Designing-Distributed-Systems-Patterns-Paradigms/dp/1491983647/
Designing Data Intensive Applications is your ticket here. It takes you through a lot of the algorithms and architecture present in the distributed technologies out there.
In a data engineering role you will probably just be munging data through a pipeline making it useful for the analysts/scientists to use, so a book recommendation for that depends on the technology you will be using. Here are some of my favorite resources for the various tools I used in my experience as a Data Engineer:
Good luck in your new position!
i've been reading Designing Data-Intensive Applications by Martin Kleppman and i would recommend to all backend developers out there that want to step up their game.
(i also love that it's a language agnostic book)
Hey, DE here with lots of experience, and I was self taught. I can be pretty specific about the subfield and what is necessary to know and not know. In an inversion of the normal path I did a mid career M.Sc in CS so it was kind of amusing to see what was and was not relevant in traditional CS. Prestigious C.S. programs prepare you for an academic career in C.S. theory but the down and dirty of moving and processing data use only a specific subset. You can also get a lot done without the theory for a while.
If I had to transition now, I'd look into a bootcamp program like Insight Data Engineering. At least look at their syllabus. In terms of CS fundamentals... https://teachyourselfcs.com/ offers a list of resources you can use over the years to fill in the blanks. They put you in front of employers, force you to finish a demo project.
Data Engineering is more fundamentally operational in nature that most software engineering You care a lot about things happening reliably across multiple systems, and when using many systems the fragility increases a lot. A typical pipeline can cross a hundred actual computers and 3 or 4 different frameworks.doesn't need a lot of it. (Also I'm doing the inverse transition as you... trying to understand multivariate time series right now)
I have trained jr coders to be come data engineers and I focus a lot on Operating System fundamentals: network, memory, processes. Debugging systems is a different skill set than debugging code, it's often much more I/O centric. It's very useful to be quick on the command line too as you are often shelling in to diagnose what's happening on this computer or that. Checking 'top', 'netstat', grepping through logs. Distributed systems are a pain. Data Eng in production is like 1/4 linux sysadmin.
It's good to be a language polyglot. (python, bash commands, SQL, Java)
Those massive java stack traces are less intimidating when you know that Java's design encourages lots of deep class hierarchies, and every library you import introduces a few layers to the stack trace. But usually the meat and potatoes method you need to look at is at the top of a given thread. Scala is only useful because of Spark, and the level of Scala you need to know for Spark is small compared to the full extent of the language. Mostly you are programatically configuring a computation graph.
Kleppman's book is a great way to skip to relevant things in large system design.
https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
It's very worth understanding how relational databases work because all the big distributed systems are basically subsets of relational database functionality, compromised for the sake of the distributed-ness. The fundamental concepts of how the data is partitioned, written to disk, caching, indexing, query optimization and transaction handling all apply. Whether the input is SQL or Spark, you are usually generate the same few fundamental operations (google Relational Algebra) and asking the system to execute it the best way it knows how. We face the same data issues now we did in the 70s but at a larger scale.
Keeping up with the framework or storage product fashion show is a lot easier when you have these fundamentals. I used Ramakrishnan, Database Management Systems. But anything that puts you in the position of asking how database systems work from the inside is extremely relevant even for "big data" distributed systems.
https://www.amazon.com/Database-Management-Systems-Raghu-Ramakrishnan/dp/0072465638
I also saw this recently and by the ToC it covers lots of stuff.
https://www.amazon.com/Database-Internals-Deep-Distributed-Systems-ebook/dp/B07XW76VHZ/ref=sr_1_1?keywords=database+internals&qid=1568739274&s=gateway&sr=8-1
But to keep in mind... the designers of these big data systems all had a thorough grounding in the issues of single node relational databases systems. It's very clarifying to see things through that lens.
Designing Data Intensive Applications by Martin Kleppmann is a solid overview of the field and gives you plenty more references for further investigation. It starts on singe-host databases and expands out to all kinds of distributed systems. Starting on single host systems is important because it helps you appreciate the designs of the distributed systems that replaced them.
Edit: markdown is hard
By now already dated but a good top-to-bottom introduction into Postgres in the real world is PostgreSQL 9.0 High Performance.
Most of the things Postgres does is exposed via system tables & views - for example
pg_stat_activity
&pg_locks
.The rest of the documentation is great as well, give it a read.
If you are new to system administration & architecture, you may want to put Designing data intensive applications on your shopping list as well to broaden your horizon.
If you have Postgres-specific questions you can ask them here or reach out to the community.
edit: fixed links
This book is a very good for Distributed Systems at a high level.
> scalability was a rare issue
Designing Data-Intensive Applications is a great book. Get yourself into some good personal habits, learn to cook efficiently, find a good gym near your new job, and spend some time sitting in the park reading that book.
Data Engineering is different everywhere and task dependent. The best advice I can give is have SQL be your second language. Then depending on your role or daily tasks you would be looking at extra materials.
General Insightful Reads:
https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
If you're doing backend/server side work, there's no better book than:
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems https://www.amazon.com/dp/1449373321/ref=cm_sw_r_cp_apa_h5nPBbZ9ZAWG9
In terms of learning what it takes to level up, I highly recommend the following books:
The Senior Software Engineer: 11 Practices of an Effective Technical Leader https://www.amazon.com/dp/0990702804/ref=cm_sw_r_cp_apa_o6nPBbVY8XDM9
The Effective Engineer: How to Leverage Your Efforts In Software Engineering to Make a Disproportionate and Meaningful Impact https://www.amazon.com/dp/0996128107/ref=cm_sw_r_cp_apa_n7nPBbB1ZDP2H
If you want to learn it systematically, consider the following:
The popular DDIA book: Designing Data-Intensive Applications gives you some insights into data systems, which are the main reason why people study those difficult distributed theories.
The underestimated textbook: Distributed Systems: An Algorithmic Approach shows you the reasoning behind the scene and gives you a taste of the algorithms used in distributed systems.
When you think it's finally over: Distributed Algorithms talks about the system models and algorithms in a more formal way.
this one is good https://amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/
A few books that I have been going through: Programming Interviews Exposed, The Algorithm Design Manual, and Designing Data-Intensive Applications.
If you really want to push the envelope on TC, especially as a more experienced dev, you're going to need to ace the system design interview(s).
I'm still learning this myself, but a good book you might want to check out is Designing Data-Intensive Applications. I've also heard good things about Grokking the System Design Interview.
Good luck! I'm going through the studying process as well, it's brutal.
I strongly recommend this book: https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
> Data Intensive systems book
Are you referring to this book? Seems like a good book according to Amazon.
Honestly, I just read a lot of blog posts. Sometimes for fun, but most of the time when I'm trying to solve a specific problem. I also make sure to document what I'm learning in github (like this (not mine)) and throw up any personal projects I work on. I also try to creatively mention in interviews that I'm self-taught and always ready to learn more. I know I've gotten lucky along the way, but I also spend hours and hours applying to jobs.
If you want hard resources: the Kimball approach was one of the first things I got familiar with and Designing Data-Intensive Application is a great modern day resource. Both are pretty dry, but once you find yourself in a situation where their knowledge applies, you'll be thankful for it a thousand times over. I've even had the Kimball approach come up in an interview....so, you never know.
Edit: I also like to watch all of the PyCon videos that even remotely relate to data.
It's normal thing in distributed systems. It's pretty logical :)
PostgreSQL is a classic single-server database, not a distributed database; it supports multiple replication strategies, I think the closest one to MongoDB is this: https://www.postgresql.org/docs/current/warm-standby.html#SYNCHRONOUS-REPLICATION (notice: it's not even the default setting 😀)
Timescale uses PostgreSQL under the hood, so same thing as above.
There's little backend stuff in CTCI besides the parallelism/concurrency chapter, unfortunately. You may have heard of them, but I'm a big fan of Computer Systems: A Programmer's Perspective and Designing Data-Intensive Applications.
Robert Martin books are good read "Clean Code" and his architecture book.
Learn design patterns: Head First Design Patterns: A Brain-Friendly Guide
Supplement with leetcode: Elements of programming interviews
You need some linux in your life: https://www.amazon.com/gp/product/0134277554/ref=ox_sc_act_title_1?smid=ATVPDKIKX0DER&psc=1
Get some system design knowledge: https://www.amazon.com/gp/product/1449373321?pf_rd_p=183f5289-9dc0-416f-942e-e8f213ef368b&pf_rd_r=NZSW6YF36GPNR9EM27XB
You need some CI/CD knowledge: The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations
Clean Architecture: https://www.amazon.com/dp/0134494164/ (also read Clean Code if you haven't).
Designing Data-Intensive Applications: https://www.amazon.com/dp/1449373321/
This one has some good info:
https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
And prob this:
http://erlang.org/download/armstrong_thesis_2003.pdf
Weird how I just finished the book Designing Data-Intensive Applications, and it ended with a section on ethics in computer science/big data that ties into this article really well. I'll add some of the sources from that section of the book here if people are curious. Cathy's book is in there, too.
There's also this fiasco: https://www.theguardian.com/technology/2018/jan/12/google-racism-ban-gorilla-black-people
We're definitely having our machine babies learn from our own racist, classist, sexist data and giving people with malicious intent access to unprecedented amounts of data.
Some forms of machine learning, like decision trees or random forests, have output that resembles a flow-chart which is nicer for humans because you can follow the decisions the algorithm is making. Deep learning with neural nets is real hard to understand. The model is basically just a ton of numbers.
If you're curious about deep learning from the "how" side, Andrew Ng's deep learning courses on Coursera are really good: https://www.coursera.org/specializations/deep-learning
Andrew Ng is kind of like the Fred Rogers of machine learning. He also has a machine learning course on Coursera that I've heard is great.
On a side note. I am currently reading https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321. Loving it so far. Author clearly explains the difference b/w relational & document model.
Highly recommended.
I'm a little late to the thread but I work at a company that operates at a large scale and I've found Designing Data Intensive Applications to be the best overview of modern techniques for scalable applications