Cost-Effective Data Pipelines Chorzów

The low cost of getting started with cloud services can easily evolve into a significant expense down the road. That's challenging for teams developing data pipelines, particularly when rapid changes in technology and workload require a constant cycle of redesign. How do you deliver scalable, …

od 186,15 Najbliżej: 21 km

Liczba ofert: 1

Oferta sklepu

Opis

The low cost of getting started with cloud services can easily evolve into a significant expense down the road. That's challenging for teams developing data pipelines, particularly when rapid changes in technology and workload require a constant cycle of redesign. How do you deliver scalable, highly available products while keeping costs in check?With this practical guide, author Sev Leonard provides a holistic approach to designing scalable data pipelines in the cloud. Intermediate data engineers, software developers, and architects will learn how to navigate cost/performance trade-offs and how to choose and configure compute and storage. You'll also pick up best practices for code development, testing, and monitoring.By focusing on the entire design process, you'll be able to deliver cost-effective, high-quality products. This book helps you:Reduce cloud spend with lower cost cloud service offerings and smart design strategiesMinimize waste without sacrificing performance by rightsizing compute resourcesDrive pipeline evolution, head off performance issues, and quickly debug with effective monitoringSet up development and test environments that minimize cloud service dependenciesCreate data pipeline code bases that are testable and extensible, fostering rapid development and evolutionImprove data quality and pipeline operation through validation and testing Spis treści: Preface Who This Book Is For What You Will Learn What This Book Is Not Running Example Conventions Used in This Book Using Code Examples OReilly Online Learning How to Contact Us Acknowledgments 1. Designing Compute for Data Pipelines Understanding Availability of Cloud Compute Outages Capacity Limits Account Limits Infrastructure Leveraging Different Purchasing Options in Pipeline Design On Demand Spot/Interruptible Contractual Discounts Contractual Discounts in the Real World: A Cautionary Tale Requirements Gathering for Compute Design Business Requirements Architectural Requirements Requirements-Gathering Example: HoD Batch Ingest Data Performance Purchasing options Benchmarking Instance Family Identification Cluster Sizing Monitoring Cluster resource utilization Data processing engine introspection Benchmarking Example Undersized Oversized Right-Sized Summary Recommended Readings 2. Responding to Changes in Demand by Scaling Compute Identifying Scaling Opportunities Variation in Data Pipelines Scaling Metrics Pipeline Scaling Example Designing for Scaling Implementing Scaling Plans Scaling Mechanics Common Autoscaling Pitfalls Scale-out threshold is too high Flapping Over-scaling Autoscaling Example Summary Recommended Readings 3. Data Organization in the Cloud Cloud Storage Costs Storage at Rest Egress Data Access Cloud Storage Organization Storage Bucket Strategies Lifecycle Configurations File Structure Design File Formats Partitioning Compaction Summary Recommended Readings 4. Economical Pipeline Fundamentals Idempotency Preventing Data Duplication Tolerating Data Duplication Checkpointing Automatic Retries Retry Considerations Retry Levels in Data Pipelines Data Validation Validating Data Characteristics Schemas Creating schemas Validating with schemas Keeping schemas up to date Summary 5. Setting Up Effective Development Environments Environments Software Environments Data Environments Data Pipeline Environments Environment Planning Design Costs Environment uptime Local Development Containers Container lifecycle Container composition Running local code against production dependencies Using environment variables Sharing configurations Consolidating common settings Resource Dependency Reduction Resource Cleanup Summary 6. Software Development Strategies Managing Different Coding Environments Example: A Multimodal Pipeline Notebooks Web UIs Example: How Code Becomes Difficult to Change Modular Design Single Responsibility Dependency Inversion Supporting multicloud Plugging in other data sinks Testing Modular Design with DataFrames Configurable Design Summary Recommended Readings 7. Unit Testing The Role of Unit Testing in Data Pipelines Unit Testing Overview Example: Identifying Unit Testing Needs Pipeline Areas to Unit-Test Data Logic Connections Observability Data Modification Processes Cloud Components Working with Dependencies Interfaces Data Example: Unit Testing Plan Identifying Components to Test Identifying Dependencies Summary 8. Mocks Considerations for Replacing Dependencies Placement Dependency Stability Complexity Versus Criticality Mocking Generic Interfaces Responses Requests Connectivity Mocking Cloud Services Building Your Own Mocks Mocking with Moto Testing with Databases Test Database Example Working with Test Databases Summary Further Exploration More Moto Mocks Mock Placement 9. Data for Testing Working with Live Data Benefits Challenges Working with Synthetic Data Benefits Challenges Is Synthetic Data the Right Approach? Manual Data Generation Automated Data Generation Synthetic Data Libraries Customizing generated data Distributing cases in test data Schema-Driven Generation Mapping data generation to schemas Example: catching schema change impacts with CI tests Property-Based Testing Summary 10. Logging Logging Costs Impact of Scale Impact of Cloud Storage Elasticity Reducing Logging Costs Effective Logging Summary 11. Finding Your Way with Monitoring Costs of Inadequate Monitoring Getting Lost in the Woods Navigation to the Rescue Job metrics Autoscaling events Job runtime alerting Error metrics System Monitoring Data Volume Throughput Consumer Lag Worker Utilization Resource Monitoring Understanding the Bounds Understanding Reliability Impacts Pipeline Performance Pipeline Stage Duration Profiling Errors to Watch Out For Ingestion success and failure Stage failures Validation failures Communication failures Stage timeouts Query Monitoring Minimizing Monitoring Costs Summary Recommended Readings 12. Essential Takeaways An Ounce of Prevention Is Worth a Pound of Cure Reign In Compute Spend Organize Your Resources Design for Interruption Build In Data Quality Change Is the Only Constant Design for Change Monitor for Change Parting Thoughts A. Preparing a Cloud Budget Its All About the Details Historical Data Estimating for New Projects Changes That Impact Costs Data landscape Load Infrastructure Creating a Budget Budget Summary Changes Between Previous and Next Budget Periods Cost Breakdown Communicating the Budget Summary Index

Specyfikacja

Podstawowe informacje

Autor
  • Sev Leonard
Rok wydania
  • 2023
Format
  • MOBI
  • EPUB
Ilość stron
  • 288