Scaling Machine Learning with Spark

Lista Ofert

Opis

Learn how to build end-to-end scalable machine learning solutions with Apache Spark. With this practical guide, author Adi Polak introduces data and ML practitioners to creative solutions that supersede today's traditional methods. You'll learn a more holistic approach that takes you beyond specific requirements and organizational goals--allowing data and ML practitioners to collaborate and understand each other better. Scaling Machine Learning with Spark examines several technologies for building end-to-end distributed ML workflows based on the Apache Spark ecosystem with Spark MLlib, MLflow, TensorFlow, and PyTorch. If you're a data scientist who works with machine learning, this book shows you when and why to use each technology. You will: Explore machine learning, including distributed computing concepts and terminologyManage the ML lifecycle with MLflowIngest data and perform basic preprocessing with SparkExplore feature engineering, and use Spark to extract featuresTrain a model with MLlib and build a pipeline to reproduce itBuild a data system to combine the power of Spark with deep learningGet a step-by-step example of working with distributed TensorFlowUse PyTorch to scale machine learning and its internal architecture Spis treści: Preface Who Should Read This Book? Do You Need Distributed Machine Learning? Navigating This Book What Is Not Covered The Environment and Tools The Tools The Datasets Conventions Used in This Book Using Code Examples OReilly Online Learning How to Contact Us Acknowledgments 1. Distributed Machine Learning Terminology and Concepts The Stages of the Machine Learning Workflow Tools and Technologies in the Machine Learning Pipeline Distributed Computing Models General-Purpose Models MapReduce MPI Barrier Shared memory Dedicated Distributed Computing Models Introduction to Distributed Systems Architecture Centralized Versus Decentralized Systems Interaction Models Client/server Peer-to-peer Geo-distributed Communication in a Distributed Setting Asynchronous Synchronous Introduction to Ensemble Methods High Versus Low Bias Types of Ensemble Methods Distributed Training Topologies Centralized ensemble learning Decentralized decision trees Centralized, distributed training with parameter servers Centralized, distributed training in a P2P topology The Challenges of Distributed Machine Learning Systems Performance Data parallelism versus model parallelism Combining data parallelism and model parallelism Deep learning Resource Management Fault Tolerance Privacy Portability Setting Up Your Local Environment Chapters 26 Tutorials Environment Chapters 710 Tutorials Environment Summary 2. Introduction to Spark and PySpark Apache Spark Architecture Intro to PySpark Apache Spark Basics Software Architecture Creating a custom schema Key Spark data abstractions and APIs DataFrames are immutable PySpark and Functional Programming Executing PySpark Code pandas DataFrames Versus Spark DataFrames Scikit-Learn Versus MLlib Summary 3. Managing the Machine Learning Experiment Lifecycle with MLflow Machine Learning Lifecycle Management Requirements What Is MLflow? Software Components of the MLflow Platform Users of the MLflow Platform MLflow Components MLflow Tracking Using MLflow Tracking to record runs Logging your dataset path and version MLflow Projects MLflow Models MLflow Model Registry Registering models Transitioning between model stages Using MLflow at Scale Summary 4. Data Ingestion, Preprocessing, and Descriptive Statistics Data Ingestion with Spark Working with Images Image format Binary format Working with Tabular Data Preprocessing Data Preprocessing Versus Processing Why Preprocess the Data? Data Structures MLlib Data Types Preprocessing with MLlib Transformers Working with text data From nominal categorical features to indices Structuring continuous numerical data Additional transformers Preprocessing Image Data Extracting labels Transforming labels to indices Extracting image size Save the Data and Avoid the Small Files Problem Avoiding small files Image compression and Parquet Descriptive Statistics: Getting a Feel for the Data Calculating Statistics Descriptive Statistics with Spark Summarizer Data Skewness Correlation Pearson correlation Spearman correlation Summary 5. Feature Engineering Features and Their Impact on Models MLlib Featurization Tools Extractors Selectors Example: Word2Vec The Image Featurization Process Understanding Image Manipulation Grayscale Defining image boundaries using image gradients Extracting Features with Spark APIs pyspark.sql.functions: pandas_udf and Python type hints pyspark.sql.GroupedData: applyInPandas and mapInPandas The Text Featurization Process Bag-of-Words TF-IDF N-Gram Additional Techniques Enriching the Dataset Summary 6. Training Models with Spark MLlib Algorithms Supervised Machine Learning Classification MLlib classification algorithms Implementing multilabel classification support What about imbalanced class labels? Regression Recommendation systems ALS for collaborative filtering Unsupervised Machine Learning Frequent Pattern Mining Clustering Evaluating Supervised Evaluators Unsupervised Evaluators Hyperparameters and Tuning Experiments Building a Parameter Grid Splitting the Data into Training and Test Sets Cross-Validation: A Better Way to Test Your Models Machine Learning Pipelines Constructing a Pipeline How Does Splitting Work with the Pipeline API? Persistence Summary 7. Bridging Spark and Deep Learning Frameworks The Two Clusters Approach Implementing a Dedicated Data Access Layer Features of a DAL Selecting a DAL What Is Petastorm? SparkDatasetConverter Petastorm as a Parquet Store Project Hydrogen Barrier Execution Mode Accelerator-Aware Scheduling A Brief Introduction to the Horovod Estimator API Summary 8. TensorFlow Distributed Machine Learning Approach A Quick Overview of TensorFlow What Is a Neural Network? TensorFlow Cluster Process Roles and Responsibilities Loading Parquet Data into a TensorFlow Dataset An Inside Look at TensorFlows Distributed Machine Learning Strategies ParameterServerStrategy CentralStorageStrategy: One Machine, Multiple Processors MirroredStrategy: One Machine, Multiple Processors, Local Copy MultiWorkerMirroredStrategy: Multiple Machines, Synchronous TPUStrategy What Things Change When You Switch Strategies? Training APIs Keras API MobileNetV2 transfer learning case study Training the Keras MobileNetV2 algorithm from scratch Custom Training Loop Estimator API Putting It All Together Troubleshooting Summary 9. PyTorch Distributed Machine Learning Approach A Quick Overview of PyTorch Basics Computation Graph PyTorch Mechanics and Concepts PyTorch Distributed Strategies for Training Models Introduction to PyTorchs Distributed Approach Distributed Data-Parallel Training RPC-Based Distributed Training Remote execution Remote references Using RRefs to orchestrate distributed algorithms Identifying objects by reference Distributed autograd The distributed optimizer Communication Topologies in PyTorch (c10d) Collective communication in PyTorch Peer-to-peer communication in PyTorch What Can We Do with PyTorchs Low-Level APIs? Loading Data with PyTorch and Petastorm Troubleshooting Guidance for Working with Petastorm and Distributed PyTorch The Enigma of Mismatched Data Types The Mystery of Straggling Workers How Does PyTorch Differ from TensorFlow? Summary 10. Deployment Patterns for Machine Learning Models Deployment Patterns Pattern 1: Batch Prediction Pattern 2: Model-in-Service Pattern 3: Model-as-a-Service Determining Which Pattern to Use Production Software Requirements Monitoring Machine Learning Models in Production Data Drift Model Drift, Concept Drift Distributional Domain Shift (the Long Tail) What Metrics Should I Monitor in Production? How Do I Measure Changes Using My Monitoring System? Define a reference Measure the reference against fresh metrics values Algorithms to use for measuring What It Looks Like in Production The Production Feedback Loop Deploying with MLlib Production Machine Learning Pipelines with Structured Streaming Deploying with MLflow Defining an MLflow Wrapper Deploying the Model as a Microservice Loading the Model as a Spark UDF How to Develop Your System Iteratively Summary Index O autorze: Adi Polak jest doświadczoną inżynierką, wiceprezeską do spraw programistów w firmie Treeverse, członkinią wielu grup eksperckich. Bierze udział w organizowaniu takich konferencji jak Data + AI Summit by Databricks, Current by Confluent i Scale by the Bay. Doświadczenie w uczeniu maszynowym zdobywała, prowadząc badania dla wielu firm z listy Fortune 500.

Rozwiń Zwiń

Specyfikacja

Podstawowe informacje

Autor	Adi Polak
Rok wydania	2023

Techniczne

Format	MOBI EPUB
Ilość stron	294

Scaling Machine Learning with Spark Chorzów

Lista Ofert

Opis

Specyfikacja

Podstawowe informacje

Techniczne

Zobacz także