30 redditors share their thoughts on Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

u/samort7 · 257 points r/learnprogramming

Here's my list of the classics:

General Computing

u/healydorf · 227 points r/cscareerquestions

> When is it okay to get complacent in your job and when is it not?

That's 100% up to you. Different strokes for different folks and all that.

> How important is it to constantly be working on or learning new stuff?

Extremely important. So much so that I give almost no pushback if my people wanna spend a few days per month at a conference/training. Company will even pay for most of it. Find a company that has a line-item in the budget for professional development -- dollars that are specifically intended to be spent by the end of the year on training, conferences, etc.

And that's not exclusive to software/data/compsci. Any skilled labor is changing constantly. Professional development is important.

> For the data engineers out there what skills should I perfect that will make me employable / desirable anywhere?

Become familiar with a variety of query languages and syntax. SQL, Elastic, AQL, N1QL, a time series DB -- the specific one doesn't really matter, just know more than "basic SQL joins" that you'll see in an undergrad database course.

Recommended reading: Designing Data Intensive Applications.

u/UniverseCity · 32 points r/golang

Designing Data-Intensive Applications seems to be the industry standard, although it's not Go specific.

u/Vitate · 26 points r/cscareerquestions

Much of this stuff is learnable outside of work, too, at least at a superficially-passable level. Trust me.

Pick up a few seminal books and read them with vigor. That's all you need to do.

Here are some books I can personally recommend from my library:

Software Design

A Philosophy of Software Design
Clean Code
Dive Into Design Patterns (Better than the GoF book IMO)
Designing Data Intensive Applications
Microservice Patterns
Test-Driven Development

&#x200B;

Software Industry
The Software Craftsman
The Clean Coder
Stealing the Corner Office

u/FunkyCannaHigh · 22 points r/devops

https://landing.google.com/sre/books/

&#x200B;

SRE book is free, workbook is not.

https://cloud.google.com/solutions/best-practices-for-operating-containers

https://cloud.google.com/solutions/about-capacity-optimization-with-global-lb

Some of this is google cloud specific but the principles are the same with on-prem or a different provider. "State-of-the-art" deployments are usually learned by using best practices since each distributed app's deployment will vary. These books will help with best practices:

https://www.amazon.com/Microservices-Patterns-examples-Chris-Richardson/dp/1617294543/

https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/

https://www.amazon.com/Designing-Distributed-Systems-Patterns-Paradigms/dp/1491983647/

u/cfors · 22 points r/datascience

Designing Data Intensive Applications is your ticket here. It takes you through a lot of the algorithms and architecture present in the distributed technologies out there.

In a data engineering role you will probably just be munging data through a pipeline making it useful for the analysts/scientists to use, so a book recommendation for that depends on the technology you will be using. Here are some of my favorite resources for the various tools I used in my experience as a Data Engineer:

High Performant Spark
acloud.guru
Python Cookbook

Good luck in your new position!

u/fernandotakai · 14 points r/programming

i've been reading Designing Data-Intensive Applications by Martin Kleppman and i would recommend to all backend developers out there that want to step up their game.

(i also love that it's a language agnostic book)

u/parts_of_speech · 12 points r/datascience

Hey, DE here with lots of experience, and I was self taught. I can be pretty specific about the subfield and what is necessary to know and not know. In an inversion of the normal path I did a mid career M.Sc in CS so it was kind of amusing to see what was and was not relevant in traditional CS. Prestigious C.S. programs prepare you for an academic career in C.S. theory but the down and dirty of moving and processing data use only a specific subset. You can also get a lot done without the theory for a while.

If I had to transition now, I'd look into a bootcamp program like Insight Data Engineering. At least look at their syllabus. In terms of CS fundamentals... https://teachyourselfcs.com/ offers a list of resources you can use over the years to fill in the blanks. They put you in front of employers, force you to finish a demo project.

Data Engineering is more fundamentally operational in nature that most software engineering You care a lot about things happening reliably across multiple systems, and when using many systems the fragility increases a lot. A typical pipeline can cross a hundred actual computers and 3 or 4 different frameworks.doesn't need a lot of it. (Also I'm doing the inverse transition as you... trying to understand multivariate time series right now)

I have trained jr coders to be come data engineers and I focus a lot on Operating System fundamentals: network, memory, processes. Debugging systems is a different skill set than debugging code, it's often much more I/O centric. It's very useful to be quick on the command line too as you are often shelling in to diagnose what's happening on this computer or that. Checking 'top', 'netstat', grepping through logs. Distributed systems are a pain. Data Eng in production is like 1/4 linux sysadmin.

It's good to be a language polyglot. (python, bash commands, SQL, Java)

Those massive java stack traces are less intimidating when you know that Java's design encourages lots of deep class hierarchies, and every library you import introduces a few layers to the stack trace. But usually the meat and potatoes method you need to look at is at the top of a given thread. Scala is only useful because of Spark, and the level of Scala you need to know for Spark is small compared to the full extent of the language. Mostly you are programatically configuring a computation graph.

Kleppman's book is a great way to skip to relevant things in large system design.

https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321

It's very worth understanding how relational databases work because all the big distributed systems are basically subsets of relational database functionality, compromised for the sake of the distributed-ness. The fundamental concepts of how the data is partitioned, written to disk, caching, indexing, query optimization and transaction handling all apply. Whether the input is SQL or Spark, you are usually generate the same few fundamental operations (google Relational Algebra) and asking the system to execute it the best way it knows how. We face the same data issues now we did in the 70s but at a larger scale.

Keeping up with the framework or storage product fashion show is a lot easier when you have these fundamentals. I used Ramakrishnan, Database Management Systems. But anything that puts you in the position of asking how database systems work from the inside is extremely relevant even for "big data" distributed systems.

https://www.amazon.com/Database-Management-Systems-Raghu-Ramakrishnan/dp/0072465638

I also saw this recently and by the ToC it covers lots of stuff.

https://www.amazon.com/Database-Internals-Deep-Distributed-Systems-ebook/dp/B07XW76VHZ/ref=sr_1_1?keywords=database+internals&qid=1568739274&s=gateway&sr=8-1

But to keep in mind... the designers of these big data systems all had a thorough grounding in the issues of single node relational databases systems. It's very clarifying to see things through that lens.

u/CowboyFromSmell · 9 points r/compsci

Designing Data Intensive Applications by Martin Kleppmann is a solid overview of the field and gives you plenty more references for further investigation. It starts on singe-host databases and expands out to all kinds of distributed systems. Starting on single host systems is important because it helps you appreciate the designs of the distributed systems that replaced them.

Edit: markdown is hard

u/terrorobe · 6 points r/PostgreSQL

By now already dated but a good top-to-bottom introduction into Postgres in the real world is PostgreSQL 9.0 High Performance.

Most of the things Postgres does is exposed via system tables & views - for example pg_stat_activity & pg_locks.

The rest of the documentation is great as well, give it a read.

If you are new to system administration & architecture, you may want to put Designing data intensive applications on your shopping list as well to broaden your horizon.

If you have Postgres-specific questions you can ask them here or reach out to the community.

edit: fixed links

u/Mofo_Turtles · 6 points r/cscareerquestions

This book is a very good for Distributed Systems at a high level.

u/PM_me_goat_gifs · 6 points r/cscareerquestions

> scalability was a rare issue

Designing Data-Intensive Applications is a great book. Get yourself into some good personal habits, learn to cook efficiently, find a good gym near your new job, and spend some time sitting in the park reading that book.

u/th3_gibs0n · 6 points r/datascience

Data Engineering is different everywhere and task dependent. The best advice I can give is have SQL be your second language. Then depending on your role or daily tasks you would be looking at extra materials.

General Insightful Reads:

u/kambabamba · 4 points r/cscareerquestions

https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321

u/_a9o_ · 4 points r/cscareerquestions

If you're doing backend/server side work, there's no better book than:
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems https://www.amazon.com/dp/1449373321/ref=cm_sw_r_cp_apa_h5nPBbZ9ZAWG9

In terms of learning what it takes to level up, I highly recommend the following books:
The Senior Software Engineer: 11 Practices of an Effective Technical Leader https://www.amazon.com/dp/0990702804/ref=cm_sw_r_cp_apa_o6nPBbVY8XDM9

The Effective Engineer: How to Leverage Your Efforts In Software Engineering to Make a Disproportionate and Meaningful Impact https://www.amazon.com/dp/0996128107/ref=cm_sw_r_cp_apa_n7nPBbB1ZDP2H

u/ansalonhistorian · 4 points r/DistributedSystems

If you want to learn it systematically, consider the following:

The popular DDIA book: Designing Data-Intensive Applications gives you some insights into data systems, which are the main reason why people study those difficult distributed theories.

The underestimated textbook: Distributed Systems: An Algorithmic Approach shows you the reasoning behind the scene and gives you a taste of the algorithms used in distributed systems.

When you think it's finally over: Distributed Algorithms talks about the system models and algorithms in a more formal way.

u/ryfm · 3 points r/compsci

this one is good https://amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/

u/FullOfEnnui · 3 points r/cscareerquestions

A few books that I have been going through: Programming Interviews Exposed, The Algorithm Design Manual, and Designing Data-Intensive Applications.

u/Spawnbroker · 3 points r/ExperiencedDevs

If you really want to push the envelope on TC, especially as a more experienced dev, you're going to need to ace the system design interview(s).

I'm still learning this myself, but a good book you might want to check out is Designing Data-Intensive Applications. I've also heard good things about Grokking the System Design Interview.

Good luck! I'm going through the studying process as well, it's brutal.

u/forever_i_b_stangin · 3 points r/webdev

I strongly recommend this book: https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321

u/PM_ME_YOUR_DOOTFILES · 3 points r/programming

> Data Intensive systems book

Are you referring to this book? Seems like a good book according to Amazon.

u/ProfessionalTensions · 2 points r/financialindependence

Honestly, I just read a lot of blog posts. Sometimes for fun, but most of the time when I'm trying to solve a specific problem. I also make sure to document what I'm learning in github (like this (not mine)) and throw up any personal projects I work on. I also try to creatively mention in interviews that I'm self-taught and always ready to learn more. I know I've gotten lucky along the way, but I also spend hours and hours applying to jobs.

If you want hard resources: the Kimball approach was one of the first things I got familiar with and Designing Data-Intensive Application is a great modern day resource. Both are pretty dry, but once you find yourself in a situation where their knowledge applies, you'll be thankful for it a thousand times over. I've even had the Kimball approach come up in an interview....so, you never know.

Edit: I also like to watch all of the PyCon videos that even remotely relate to data.

u/messacz · 2 points r/mongodb

It's normal thing in distributed systems. It's pretty logical :)

https://en.wikipedia.org/wiki/Quorum_(distributed_computing)
https://en.wikipedia.org/wiki/CAP_theorem
https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
https://blog.yugabyte.com/how-does-consensus-based-replication-work-in-distributed-databases/
https://www.amazon.com/dp/1449373321/ref=cm_sw_em_r_mt_dp_U_GkfUDb87YKGR4
etc. etc. :)

PostgreSQL is a classic single-server database, not a distributed database; it supports multiple replication strategies, I think the closest one to MongoDB is this: https://www.postgresql.org/docs/current/warm-standby.html#SYNCHRONOUS-REPLICATION (notice: it's not even the default setting 😀)

Timescale uses PostgreSQL under the hood, so same thing as above.

u/Himmelswind · 2 points r/cscareerquestions

There's little backend stuff in CTCI besides the parallelism/concurrency chapter, unfortunately. You may have heard of them, but I'm a big fan of Computer Systems: A Programmer's Perspective and Designing Data-Intensive Applications.

u/gfever · 1 point r/cscareerquestions

Robert Martin books are good read "Clean Code" and his architecture book.

Learn design patterns: Head First Design Patterns: A Brain-Friendly Guide

Supplement with leetcode: Elements of programming interviews

You need some linux in your life: https://www.amazon.com/gp/product/0134277554/ref=ox_sc_act_title_1?smid=ATVPDKIKX0DER&psc=1

Get some system design knowledge: https://www.amazon.com/gp/product/1449373321?pf_rd_p=183f5289-9dc0-416f-942e-e8f213ef368b&pf_rd_r=NZSW6YF36GPNR9EM27XB

You need some CI/CD knowledge: The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations

u/gin_and_toxic · 1 point r/webdev

Clean Architecture: https://www.amazon.com/dp/0134494164/ (also read Clean Code if you haven't).

Designing Data-Intensive Applications: https://www.amazon.com/dp/1449373321/

u/tekedozai · 1 point r/softwarearchitecture

This one has some good info:

https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321

And prob this:

http://erlang.org/download/armstrong_thesis_2003.pdf

u/puppy_and_puppy · 1 point r/MensLib

Weird how I just finished the book Designing Data-Intensive Applications, and it ended with a section on ethics in computer science/big data that ties into this article really well. I'll add some of the sources from that section of the book here if people are curious. Cathy's book is in there, too.

https://www.linkedin.com/pulse/making-hard-choices-quest-ethics-machine-learning-igor-perisic
https://www.theguardian.com/commentisfree/2015/dec/06/algorithm-writers-should-have-code-of-conduct
https://www.theatlantic.com/technology/archive/2014/02/welcome-to-algorithmic-prison/283985/
https://www.theatlantic.com/magazine/archive/2013/12/theyre-watching-you-at-work/354681/
https://www.theguardian.com/technology/2016/aug/03/algorithm-racist-human-employers-work
https://www.scientificamerican.com/article/how-a-machine-learns-prejudice/
http://idlewords.com/talks/sase_panel.htm
https://www.amazon.com/Weapons-Math-Destruction-Increases-Inequality/dp/0553418815
https://www.nytimes.com/2016/08/01/opinion/make-algorithms-accountable.html
https://www.theguardian.com/technology/2016/nov/10/facebook-fake-news-election-conspiracy-theories

There's also this fiasco: https://www.theguardian.com/technology/2018/jan/12/google-racism-ban-gorilla-black-people

We're definitely having our machine babies learn from our own racist, classist, sexist data and giving people with malicious intent access to unprecedented amounts of data.

Some forms of machine learning, like decision trees or random forests, have output that resembles a flow-chart which is nicer for humans because you can follow the decisions the algorithm is making. Deep learning with neural nets is real hard to understand. The model is basically just a ton of numbers.

If you're curious about deep learning from the "how" side, Andrew Ng's deep learning courses on Coursera are really good: https://www.coursera.org/specializations/deep-learning

Andrew Ng is kind of like the Fred Rogers of machine learning. He also has a machine learning course on Coursera that I've heard is great.

u/vira28 · 1 point r/Firebase

On a side note. I am currently reading https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321. Loving it so far. Author clearly explains the difference b/w relational & document model.

Highly recommended.

u/lordmister_15 · 1 point r/cscareerquestions

I'm a little late to the thread but I work at a company that operates at a large scale and I've found Designing Data Intensive Applications to be the best overview of modern techniques for scalable applications

Height	9.17321 inches
Length	7.00786 inches
Number of items	1
Weight	2.15 Pounds
Width	1.2389739 inches

Reddit mentions of Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Interested in what Redditors like? Check out our Shuffle feature

Found 30 comments on Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems:

Related categories:

Reddit mentions of Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

idea-bulb Interested in what Redditors like? Check out our Shuffle feature

Found 30 comments on Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems:

Interested in what Redditors like? Check out our Shuffle feature