Learning Spark - Jules S. Damji Chorzów

Data is bigger, arrives faster, and comes in a variety of formats-and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark.Updated to include Spark 3.0, this second edition shows data engineers and …

od 203,15 Najbliżej: 21 km

Liczba ofert: 1

Oferta sklepu

Opis

Data is bigger, arrives faster, and comes in a variety of formats-and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark.Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, you...ll be able to:Learn Python, SQL, Scala, or Java high-level Structured APIsUnderstand Spark operations and SQL EngineInspect, tune, and debug Spark operations with Spark configurations and Spark UIConnect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or KafkaPerform analytics on batch and streaming data using Structured StreamingBuild reliable data pipelines with open source Delta Lake and SparkDevelop machine learning pipelines with MLlib and productionize models using MLflow Spis treści: Foreword Preface Who This Book Is For How the Book Is Organized How to Use the Code Examples Software and Configuration Used Conventions Used in This Book Using Code Examples OReilly Online Learning How to Contact Us Acknowledgments 1. Introduction to Apache Spark: A Unified Analytics Engine The Genesis of Spark Big Data and Distributed Computing at Google Hadoop at Yahoo! Sparks Early Years at AMPLab What Is Apache Spark? Speed Ease of Use Modularity Extensibility Unified Analytics Apache Spark Components as a Unified Stack Spark SQL Spark MLlib Spark Structured Streaming GraphX Apache Sparks Distributed Execution Spark driver SparkSession Cluster manager Spark executor Deployment modes Distributed data and partitions The Developers Experience Who Uses Spark, and for What? Data science tasks Data engineering tasks Popular Spark use cases Community Adoption and Expansion 2. Downloading Apache Spark and Getting Started Step 1: Downloading Apache Spark Sparks Directories and Files Step 2: Using the Scala or PySpark Shell Using the Local Machine Step 3: Understanding Spark Application Concepts Spark Application and SparkSession Spark Jobs Spark Stages Spark Tasks Transformations, Actions, and Lazy Evaluation Narrow and Wide Transformations The Spark UI Your First Standalone Application Counting M&Ms for the Cookie Monster Building Standalone Applications in Scala Summary 3. Apache Sparks Structured APIs Spark: Whats Underneath an RDD? Structuring Spark Key Merits and Benefits The DataFrame API Sparks Basic Data Types Sparks Structured and Complex Data Types Schemas and Creating DataFrames Two ways to define a schema Columns and Expressions Rows Common DataFrame Operations Using DataFrameReader and DataFrameWriter Saving a DataFrame as a Parquet file or SQL table Transformations and actions Projections and filters Renaming, adding, and dropping columns Aggregations Other common DataFrame operations End-to-End DataFrame Example The Dataset API Typed Objects, Untyped Objects, and Generic Rows Creating Datasets Scala: Case classes Dataset Operations End-to-End Dataset Example DataFrames Versus Datasets When to Use RDDs Spark SQL and the Underlying Engine The Catalyst Optimizer Phase 1: Analysis Phase 2: Logical optimization Phase 3: Physical planning Phase 4: Code generation Summary 4. Spark SQL and DataFrames: Introduction to Built-in Data Sources Using Spark SQL in Spark Applications Basic Query Examples SQL Tables and Views Managed Versus UnmanagedTables Creating SQL Databases and Tables Creating a managed table Creating an unmanaged table Creating Views Temporary views versus global temporary views Viewing the Metadata Caching SQL Tables Reading Tables into DataFrames Data Sources for DataFrames and SQL Tables DataFrameReader DataFrameWriter Parquet Reading Parquet files into a DataFrame Reading Parquet files into a Spark SQL table Writing DataFrames to Parquet files Writing DataFrames to Spark SQL tables JSON Reading a JSON file into a DataFrame Reading a JSON file into a Spark SQL table Writing DataFrames to JSON files JSON data source options CSV Reading a CSV file into a DataFrame Reading a CSV file into a Spark SQL table Writing DataFrames to CSV files CSV data source options Avro Reading an Avro file into a DataFrame Reading an Avro file into a Spark SQL table Writing DataFrames to Avro files Avro data source options ORC Reading an ORC file into a DataFrame Reading an ORC file into a Spark SQL table Writing DataFrames to ORC files Images Reading an image file into a DataFrame Binary Files Reading a binary file into a DataFrame Summary 5. Spark SQL and DataFrames: Interacting with External Data Sources Spark SQL and Apache Hive User-Defined Functions Spark SQL UDFs Evaluation order and null checking in Spark SQL Speeding up and distributing PySpark UDFs with Pandas UDFs Querying with the Spark SQL Shell, Beeline, and Tableau Using the Spark SQL Shell Create a table Insert data into the table Running a Spark SQL query Working with Beeline Start the Thrift server Connect to the Thrift server via Beeline Execute a Spark SQL query with Beeline Stop the Thrift server Working with Tableau Start the Thrift server Start Tableau Stop the Thrift server External Data Sources JDBC and SQL Databases The importance of partitioning PostgreSQL MySQL Azure Cosmos DB MS SQL Server Other External Sources Higher-Order Functions in DataFrames and Spark SQL Option 1: Explode and Collect Option 2: User-Defined Function Built-in Functions for Complex Data Types Higher-Order Functions transform() filter() exists() reduce() Common DataFrames and Spark SQL Operations Unions Joins Windowing Modifications Adding new columns Dropping columns Renaming columns Pivoting Summary 6. Spark SQL and Datasets Single API for Java and Scala Scala Case Classes and JavaBeans for Datasets Working with Datasets Creating Sample Data Transforming Sample Data Higher-order functions and functional programming Converting DataFrames to Datasets Memory Management for Datasets and DataFrames Dataset Encoders Sparks Internal Format Versus Java Object Format Serialization and Deserialization (SerDe) Costs of Using Datasets Strategies to Mitigate Costs Summary 7. Optimizing and Tuning Spark Applications Optimizing and Tuning Spark for Efficiency Viewing and Setting Apache Spark Configurations Scaling Spark for Large Workloads Static versus dynamic resource allocation Configuring Spark executors memory and the shuffle service Maximizing Spark parallelism How partitions are created Caching and Persistence of Data DataFrame.cache() DataFrame.persist() When to Cache and Persist When Not to Cache and Persist A Family of Spark Joins Broadcast Hash Join When to use a broadcast hash join Shuffle Sort Merge Join Optimizing the shuffle sort merge join When to use a shuffle sort merge join Inspecting the Spark UI Journey Through the Spark UI Tabs Jobs and Stages Executors Storage SQL Environment Debugging Spark applications Summary 8. Structured Streaming Evolution of the Apache Spark Stream Processing Engine The Advent of Micro-Batch Stream Processing Lessons Learned from Spark Streaming (DStreams) The Philosophy of Structured Streaming The Programming Model of Structured Streaming The Fundamentals of a Structured Streaming Query Five Steps to Define a Streaming Query Step 1: Define input sources Step 2: Transform data Step 3: Define output sink and output mode Step 4: Specify processing details Step 5: Start the query Putting it all together Under the Hood of an Active Streaming Query Recovering from Failures with Exactly-Once Guarantees Monitoring an Active Query Querying current status using StreamingQuery Get current metrics using StreamingQuery Get current status using StreamingQuery.status() Publishing metrics using Dropwizard Metrics Publishing metrics using custom StreamingQueryListeners Streaming Data Sources and Sinks Files Reading from files Writing to files Apache Kafka Reading from Kafka Writing to Kafka Custom Streaming Sources and Sinks Writing to any storage system Using foreachBatch() Using foreach() Reading from any storage system Data Transformations Incremental Execution and Streaming State Stateless Transformations Stateful Transformations Distributed and fault-tolerant state management Types of stateful operations Stateful Streaming Aggregations Aggregations Not Based on Time Aggregations with Event-Time Windows Handling late data with watermarks Semantic guarantees with watermarks Supported output modes Streaming Joins StreamStatic Joins StreamStream Joins Inner joins with optional watermarking Outer joins with watermarking Arbitrary Stateful Computations Modeling Arbitrary Stateful Operations with mapGroupsWithState() Using Timeouts to Manage Inactive Groups Processing-time timeouts Event-time timeouts Generalization with flatMapGroupsWithState() Performance Tuning Summary 9. Building Reliable Data Lakes with Apache Spark The Importance of an Optimal Storage Solution Databases A Brief Introduction to Databases Reading from and Writing to Databases Using Apache Spark Limitations of Databases Data Lakes A Brief Introduction to Data Lakes Reading from and Writing to Data Lakes using Apache Spark Limitations of Data Lakes Lakehouses: The Next Step in the Evolution of Storage Solutions Apache Hudi Apache Iceberg Delta Lake Building Lakehouses with Apache Spark and Delta Lake Configuring Apache Spark with Delta Lake Loading Data into a Delta Lake Table Loading Data Streams into a Delta Lake Table Enforcing Schema on Write to Prevent Data Corruption Evolving Schemas to Accommodate Changing Data Transforming Existing Data Updating data to fix errors Deleting user-related data Upserting change data to a table using merge() Deduplicating data while inserting using insert-only merge Auditing Data Changes with Operation History Querying Previous Snapshots of a Table with Time Travel Summary 10. Machine Learning with MLlib What Is Machine Learning? Supervised Learning Unsupervised Learning Why Spark for Machine Learning? Designing Machine Learning Pipelines Data Ingestion and Exploration Creating Training and Test Data Sets Preparing Features with Transformers Understanding Linear Regression Using Estimators to Build Models Creating a Pipeline One-hot encoding Evaluating Models RMSE Interpreting the value of RMSE R2 Saving and Loading Models Hyperparameter Tuning Tree-Based Models Decision trees Random forests k-Fold Cross-Validation Optimizing Pipelines Summary 11. Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark Model Management MLflow Tracking Model Deployment Options with MLlib Batch Streaming Model Export Patterns for Real-Time Inference Leveraging Spark for Non-MLlib Models Pandas UDFs Spark for Distributed Hyperparameter Tuning Joblib Hyperopt Summary 12. Epilogue: Apache Spark 3.0 Spark Core and Spark SQL Dynamic Partition Pruning Adaptive Query Execution The AQE framework SQL Join Hints Shuffle sort merge join (SMJ) Broadcast hash join (BHJ) Shuffle hash join (SHJ) Shuffle-and-replicate nested loop join (SNLJ) Catalog Plugin API and DataSourceV2 Accelerator-Aware Scheduler Structured Streaming PySpark, Pandas UDFs, and Pandas Function APIs Redesigned Pandas UDFs with Python Type Hints Iterator Support in Pandas UDFs New Pandas Function APIs Changed Functionality Languages Supported and Deprecated Changes to the DataFrame and Dataset APIs DataFrame and SQL Explain Commands Summary Index

Specyfikacja

Podstawowe informacje

Autor
  • Jules S. Damji;Brooke Wenig;Tathagata Das
Rok wydania
  • 2020
Kategorie
  • Literatura obcojęzyczna
Format
  • MOBI
  • EPUB
Ilość stron
  • 400
Wybrane wydawnictwa
  • O'Reilly Media