Designing Data-Intensive Applications. The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. 2nd Edition (ebook)

Lista Ofert

Opis

Data is at the center of many challenges in system design today. Difficult issues such as scalability, consistency, reliability, efficiency, and maintainability need to be resolved. In addition, theres an overwhelming variety of systems, including relational databases, NoSQL datastores, data warehouses, and data lakes. There are cloud services, on-premises services, and embedded databases. What are the right choices for your application? How do you make sense of all these buzzwords? In this second edition, authors Martin Kleppmann and Chris Riccomini build on the foundation laid in the acclaimed first edition, integrating new technologies and emerging trends. Youll be guided through the maze of decisions and trade-offs involved in building a modern data system, learn how to choose the right tools for your needs, and understand the fundamentals of distributed systems. Peer under the hood of the systems you already use, and learn to use them more effectively Make informed decisions by identifying the strengths and weaknesses of different tools Learn how major cloud services are designed for scalability, fault tolerance, and consistency Understand the core principles upon which modern databases are built Spis treści: Preface Who Should Read This Book? Whats New in the Second Edition? References and Further Reading Conventions Used in This Book OReilly Online Learning How to Contact Us Acknowledgments 1. Trade-Offs in Data Systems Architecture Operational Versus Analytical Systems Characterizing Transaction Processing and Analytics Data Warehousing From data warehouse to data lake Beyond the data lake Systems of Record and Derived Data Cloud Versus Self-Hosting Pros and Cons of Cloud Services Cloud Native System Architecture Layering of cloud services Separation of storage and compute Operations in the Cloud Era Distributed Versus Single-Node Systems Problems with Distributed Systems Microservices and Serverless Cloud Computing Versus Supercomputing Data Systems, Law, and Society Summary 2. Defining Nonfunctional Requirements Case Study: Social Network Home Timelines Representing Users, Posts, and Follows Materializing and Updating Timelines Describing Performance Latency and Response Time Average, Median, and Percentiles Use of Response Time Metrics Reliability and Fault Tolerance Fault Tolerance Hardware and Software Faults Tolerating hardware faults through redundancy Software faults Humans and Reliability Scalability Understanding Load Shared-Memory, Shared-Disk, and Shared-Nothing Architectures Principles for Scalability Maintainability Operability: Making Life Easy for Operations Simplicity: Managing Complexity Evolvability: Making Change Easy Summary 3. Data Models and Query Languages Relational Versus Document Models The Object-Relational Mismatch Object-relational mapping The document data model for one-to-many relationships Normalization, Denormalization, and Joins Trade-offs of normalization Denormalization in the social networking case study Many-to-One and Many-to-Many Relationships Stars and Snowflakes: Schemas for Analytics When to Use Which Model Schema flexibility in the document model Data locality for reads and writes Query languages for documents Convergence of document and relational databases Graph-Like Data Models Property Graphs The Cypher Query Language Graph Queries in SQL Triple Stores and SPARQL The RDF data model The SPARQL query language Datalog: Recursive Relational Queries GraphQL Event Sourcing and CQRS DataFrames, Matrices, and Arrays Summary 4. Storage and Retrieval Storage and Indexing for OLTP Log-Structured Storage The SSTable file format Constructing and merging SSTables Bloom filters Compaction strategies B-Trees Making B-trees reliable Using B-tree variants Comparing B-Trees and LSM-Trees Read performance Sequential versus random writes Write amplification Disk space usage Multicolumn and Secondary Indexes Storing Values Within the Index Keeping Everything in Memory Data Storage for Analytics Cloud Data Warehouses Column-Oriented Storage Column compression Sort order in column storage Writing to column-oriented storage Query Execution: Compilation and Vectorization Materialized Views and Data Cubes Multidimensional and Full-Text Indexes Full-Text Search Vector Embeddings Summary 5. Encoding and Evolution Formats for Encoding Data Language-Specific Formats JSON, XML, and Binary Variants JSON Schema Binary encodings Protocol Buffers Field tags and schema evolution Avro The writers schema and the readers schema Schema evolution rules But what is the writers schema? Dynamically generated schemas The Merits of Schemas Modes of Dataflow Dataflow Through Databases Different values written at different times Archival storage Dataflow Through Services: REST and RPC Web services The problems with remote procedure calls Load balancers, service discovery, and service meshes Data encoding and evolution for RPC Durable Execution and Workflows Event-Driven Architectures Message brokers Distributed actor frameworks Summary 6. Replication Single-Leader Replication Synchronous Versus Asynchronous Replication Setting Up New Followers Handling Node Outages Follower failure: Catch-up recovery Leader failure: Failover Implementation of Replication Logs Statement-based replication Write-ahead log shipping Logical (row-based) log replication Problems with Replication Lag Reading your own writes Monotonic reads Consistent prefix reads Solutions for Replication Lag Multi-Leader Replication Geographically Distributed Operation Multi-leader replication topologies Problems with different topologies Sync Engines and Local-First Software Real-time collaboration, offline-first, and local-first apps Pros and cons of sync engines Dealing with Conflicting Writes Conflict avoidance Last write wins (discarding concurrent writes) Manual conflict resolution Automatic conflict resolution Conflict-free replicated datatypes and operational transformation Types of conflict Leaderless Replication Writing to the Database When a Node Is Down Catching up on missed writes Using quorums for reading and writing Understanding the limitations of quorum consistency Monitoring staleness Single-Leader Versus Leaderless Replication Performance Multi-Region Operation Detecting Concurrent Writes The happens-before relation and concurrency Capturing the happens-before relationship Version vectors Summary 7. Sharding Pros and Cons of Sharding Sharding for Multitenancy Sharding of Key-Value Data Sharding by Key Range Rebalancing key-range sharded data Sharding by Hash of Key Hash modulo number of nodes Fixed number of shards Sharding by hash range Consistent hashing Skewed Workloads and Relieving Hot Spots Operations: Automatic Versus Manual Rebalancing Request Routing Sharding and Secondary Indexes Local Secondary Indexes Global Secondary Indexes Summary 8. Transactions What Exactly Is a Transaction? The Meaning of ACID Atomicity Consistency Isolation Durability Single-Object and Multi-Object Operations Single-object writes The need for multi-object transactions Handling errors and aborts Weak Isolation Levels Read Committed No dirty reads No dirty writes Implementing read-committed Snapshot Isolation and Repeatable Read Multiversion concurrency control Visibility rules for observing a consistent snapshot Indexes and snapshot isolation Snapshot isolation, repeatable read, and naming confusion Preventing Lost Updates Atomic write operations Explicit locking Automatically detecting lost updates Conditional writes (compare-and-set) Conflict resolution and replication Write Skew and Phantoms Characterizing write skew More examples of write skew Phantoms causing write skew Materializing conflicts Serializability Actual Serial Execution Encapsulating transactions in stored procedures Pros and cons of stored procedures Sharding Summary of serial execution Two-Phase Locking Implementation of 2PL Performance of 2PL Predicate locks Index-range locks Serializable Snapshot Isolation Pessimistic versus optimistic concurrency control Decisions based on an outdated premise Detection of stale MVCC reads Detection of writes that affect prior reads Performance of serializable snapshot isolation Distributed Transactions Two-Phase Commit A system of promises Coordinator failure Three-phase commit Distributed Transactions Across Different Systems Exactly-once message processing XA transactions Holding locks while in doubt Recovering from coordinator failure Problems with XA transactions Database-Internal Distributed Transactions Exactly-Once Message Processing Revisited Summary 9. The Trouble with Distributed Systems Faults and Partial Failures Unreliable Networks The Limitations of TCP Network Faults in Practice Fault Detection Timeouts and Unbounded Delays Network congestion and queueing Variability of network delays Synchronous Versus Asynchronous Networks Can we not simply make network delays predictable? Combining circuit switching and packet switching Unreliable Clocks Monotonic Versus Time-of-Day Clocks Time-of-day clocks Monotonic clocks Clock Synchronization and Accuracy Relying on Synchronized Clocks Timestamps for ordering events Clock readings with a confidence interval Synchronized clocks for global snapshots Process Pauses Provididng response time guarantees Limiting the impact of garbage collection Knowledge, Truth, and Lies The Majority Rules Distributed Locks and Leases Fencing off zombies and delayed requests Fencing with multiple replicas Byzantine Faults Uses of Byzantine fault tolerance Weak forms of lying System Model and Reality Defining the correctness of an algorithm Distinguishing between safety and liveness Mapping system models to the real world Formal Methods and Randomized Testing Model checking and specification languages Fault injection Deterministic simulation testing Summary 10. Consistency and Consensus Linearizability What Makes a System Linearizable? Relying on Linearizability Locking and leader election Constraints and uniqueness guarantees Cross-channel timing dependencies Implementing Linearizable Systems The Cost of Linearizability The CAP theorem Linearizability and network delays ID Generators and Logical Clocks Logical Clocks Lamport timestamps Hybrid logical clocks Lamport/hybrid logical clocks versus vector clocks Linearizable ID Generators Implementing a linearizable ID generator Enforcing constraints using logical clocks Consensus The Many Faces of Consensus Single-value consensus Compare-and-set as consensus Shared logs as consensus Fetch-and-add as consensus Atomic commitment as consensus Consensus in Practice Using shared logs From single-leader replication to consensus Subtleties of consensus Pros and cons of consensus Coordination Services Allocating work to nodes Service discovery Summary 11. Batch Processing Batch Processing with Unix Tools Simple Log Analysis Chain of Commands Versus Custom Program Sorting Versus In-Memory Aggregation Batch Processing in Distributed Systems Distributed Filesystems Object Stores Distributed Job Orchestration Resource allocation Scheduling workflows Handling faults Batch Processing Models MapReduce Dataflow Engines Shuffling Data Joins and Grouping Query Languages DataFrames Batch Use Cases ExtractTransformLoad Analytics Machine Learning Serving Derived Data Summary 12. Stream Processing Transmitting Event Streams Messaging Systems Direct messaging from producers to consumers Message brokers Message brokers compared to databases Multiple consumers Acknowledgments and redelivery Log-Based Message Brokers Using logs for message storage Logs compared to traditional messaging Consumer offsets Disk space usage When consumers cannot keep up with producers Replaying old messages Databases and Streams Keeping Systems in Sync Change Data Capture Implementing CDC Initial snapshot Log compaction API support for change streams CDC versus event sourcing State, Streams, and Immutability Advantages of immutable events Deriving several views from the same event log Concurrency control Limitations of immutability Processing Streams Uses of Stream Processing Complex event processing Stream analytics Maintaining materialized views Search on streams Event-driven architectures and RPC Reasoning About Time Event time versus processing time Handling straggler events Whose clock are you using, anyway? Types of windows Stream Joins Streamstream join (window join) Streamtable join (stream enrichment) Tabletable join (materialized view maintenance) Time dependence of joins Fault Tolerance Microbatching and checkpointing Atomic commit revisited Idempotence Rebuilding state after a failure Summary 13. A Philosophy of Streaming Systems Data Integration Combining Specialized Tools by Deriving Data Reasoning about dataflows Derived data versus distributed transactions The limits of total ordering Ordering events to capture causality Batch and Stream Processing Maintaining derived state Reprocessing data for application evolution Unifying batch and stream processing Unbundling Databases Composing Data Storage Technologies Creating an index The meta-database of everything Making unbundling work Unbundled versus integrated systems Designing Applications Around Dataflow Application code as a derivation function Separation of application code and state Dataflow: Interplay between state changes and application code Stream processors and services Observing Derived State Materialized views and caching Stateful, offline-capable clients Pushing state changes to clients End-to-end event streams Reads are events too Multishard data processing Aiming for Correctness The End-to-End Argument for Databases Exactly-once execution of an operation Duplicate suppression Uniquely identifying requests The end-to-end argument Applying end-to-end thinking in data systems Enforcing Constraints Uniqueness constraints require consensus Uniqueness in log-based messaging Multishard request processing Timeliness and Integrity Correctness of dataflow systems Loosely interpreted constraints Coordination-avoiding data systems Trust, but Verify Maintaining integrity in the face of software bugs Dont just blindly trust what they promise Designing for auditability The end-to-end argument again Tools for auditable data systems Summary 14. Doing the Right Thing Predictive Analytics Bias and Discrimination Responsibility and Accountability Feedback Loops Privacy and Tracking Surveillance Consent and Freedom of Choice Privacy and Use of Data Data as Assets and Power Remembering the Industrial Revolution Legislation and Self-Regulation Summary Glossary Index

Rozwiń Zwiń

Specyfikacja

Podstawowe informacje

Autor	Martin Kleppmann, Chris Riccomini
Rok wydania	2026

Techniczne

Format	MOBI EPUB
Ilość stron	672

Dodatkowe informacje

Wydawnictwo

O'Reilly Media