Deep Learning at Scale

Lista Ofert

Opis

Bringing a deep-learning project into production at scale is quite challenging. To successfully scale your project, a foundational understanding of full stack deep learning, including the knowledge that lies at the intersection of hardware, software, data, and algorithms, is required.This book illustrates complex concepts of full stack deep learning and reinforces them through hands-on exercises to arm you with tools and techniques to scale your project. A scaling effort is only beneficial when it's effective and efficient. To that end, this guide explains the intricate concepts and techniques that will help you scale effectively and efficiently.You'll gain a thorough understanding of:How data flows through the deep-learning network and the role the computation graphs play in building your modelHow accelerated computing speeds up your training and how best you can utilize the resources at your disposalHow to train your model using distributed training paradigms, i.e., data, model, and pipeline parallelismHow to leverage PyTorch ecosystems in conjunction with NVIDIA libraries and Triton to scale your model trainingDebugging, monitoring, and investigating the undesirable bottlenecks that slow down your model trainingHow to expedite the training lifecycle and streamline your feedback loop to iterate model developmentA set of data tricks and techniques and how to apply them to scale your training modelHow to select the right tools and techniques for your deep-learning projectOptions for managing the compute infrastructure when running at scale Spis treści: Preface Why Scaling Matters Who This Book Is For How This Book Is Organized Introduction Part I: Foundational Concepts of Deep Learning Part II: Distributed Training Part III: Extreme Scaling What You Need to Use This Book Setting Up Your Environment for Hands-on Exercises Using Code Examples Conventions Used in This Book OReilly Online Learning How to Contact Us Acknowledgments 1. What Nature and History Have Taught Us About Scale The Philosophy of Scaling The General Law of Scaling History of Scaling Law Scalable Systems Nature as a Scalable System Our Visual System: A Biological Inspiration Artificial Intelligence: The Evolution of Learnable Systems It Takes Four to Tango The hardware The data The software The (deep learning) algorithms Evolving Deep Learning Trends General evolution of deep learning Evolution in specialized domains Math and compute Protein folding Simulated world Scale in the Context of Deep Learning Six Development Considerations Well-defined problem Domain knowledge (a.k.a. the constraints) Ground truth Model development Deployment Feedback Scaling Considerations Questions to ask before scaling Characteristics of scalable systems Reliability Availability Adaptability Performance Considerations of scalable systems Avoiding single points of failure Designing for high availability Scaling paradigms Coordination and communication Caching and intermittent storage Process state Graceful recovery and checkpointing Maintainability and observability Scaling effectively Summary I. Foundational Concepts of Deep Learning 2. Deep Learning The Role of Data in Deep Learning Data Flow in Deep Learning Hands-On Exercise #1: Implementing Minimalistic Deep Learning Developing the Model Model input data and pipeline Model Training loop Loss Metrics The Embedded/Latent Space A Word of Caution The Learning Rate and Loss Landscape Scaling Consideration Profiling Hands-On Exercise #2: Getting Complex with PyTorch Model Input Data and Pipeline Model Auxiliary Utilities Callbacks Loggers Profilers Putting It All Together Computation Graphs Inference Summary 3. The Computational Side of Deep Learning The Higgs Boson of the Digital World Floating-Point Numbers: The Faux Continuous Numbers Floating-point encoding Floating-point standards Units of Data Measurement Data Storage Formats: The Trade-off of Latency and Throughput Computer Architecture The Birth of the Electromechanical Engine Memory and Persistence Virtual memory Input/output Memory and Moores law Computation and Memory Combined The Scaling Laws of Electronics Scaling Out Computation with Parallelization Threads Versus Processes: The Unit of Parallelization Simultaneous multithreading Scenario walkthrough: A web crawler to curate a links dataset Hardware-Optimized Libraries for Acceleration Parallel Computer Architectures: Flynns and Duncans Taxonomies Accelerated Computing Popular Accelerated Devices for Deep Learning Graphics processing units (GPUs) GPU microarchitecture CUDA NVIDIAs dominance: The competition landscape Application-specific integrated circuits (ASICs) Tensor Processing Units (TPUs) Intelligence Processing Units (IPUs) Field programmable gate arrays (FPGAs) Wafer Scale Engines (WSEs) Accelerator Benchmarking Summary 4. Putting It All Together: Efficient Deep Learning Hands-On Exercise #1: GPT-2 Exercise Objectives Model Architecture Key contributors to scale Transformer attention block Unsupervised training Zero-shot learning Parameter scale Implementation model.py dataset.py app.py Running the Example Experiment Tracking Measuring to Understand the Limitations and Scale Out Running on a CPU Running on a GPU Transitioning from Language to Vision Hands-On Exercise #2: Vision Model with Convolution Model Architecture Key contributors to scale in the scene parsing exercise Scaling with convolutions Scaling with EfficientNet Implementation Running the Example Observations Graph Compilation Using PyTorch 2.0 New Components of PyTorch 2.0 Graph Execution in PyTorch 2.0 Graph acquisition Graph lowering Graph compilation Modeling Techniques to Scale Training on a Single Device Graph Compilation Reduced- and Mixed-Precision Training Mixed precision The effect of precision on gradients Gradient scaling Gradient clipping 8-bit optimizers and quantization A mixed-precision algorithm Memory Tricks for Efficiency Memory layout Feature compression Meta and fake tensors Optimizer Efficiencies Stochastic gradient descent (SGD) Gradient accumulation Gradient checkpointing Patch Gradient Descent Learning rate and weight decay Model Input Pipeline Tricks Writing Custom Kernels in PyTorch 2.0 with Triton Summary II. Distributed Training 5. Distributed Systems and Communications Distributed Systems The Eight Fallacies of Distributed Computing The Consistency, Availability, and Partition Tolerance (CAP) Theorem The Scaling Law of Distributed Systems Types of Distributed Systems Centralized Decentralized Communication in Distributed Systems Communication Paradigm Communication Patterns Basic communication patterns Collective communication patterns Communication Technologies RPC MPI NCCL Communication technology summary Communication Initialization: Rendezvous Hands-On Exercise Scaling Compute Capacity Infrastructure Setup Options Private cloud (on-premise/DIY data centers) Public cloud Hybrid cloud Multicloud Federation Provisioning of Accelerated Devices Workload Management Slurm Kubernetes Ray Distributed memory layer Asynchronous model Amazon SageMaker Google Vertex AI Deep Learning Infrastructure Review Overview of Leading Deep Learning Clusters Similarities Between Todays Most Powerful Systems Summary 6. Theoretical Foundations of Distributed Deep Learning Distributed Deep Learning Centralized DDL Parameter server configurations Subtypes of centralized DDL Synchronous centralized DDL Asynchronous centralized DDL Decentralized DDL Limiting divergence Subtypes of decentralized DDL Synchronous decentralized DDL Asynchronous decentralized DDL Dimensions of Scaling Distributed Deep Learning Partitioning Dimensions of Distributed Deep Learning Types of Distributed Deep Learning Techniques Ensembling Data parallelism Model parallelism Pipeline parallelism Tensor parallelism Hybrid parallelism Federation/collaborative learning Choosing a Scaling Technique Measuring Scale End-to-End Metrics and Benchmarks Time to convergence Cost to train Multilevel benchmarks Measuring Incrementally in a Reproducible Environment Summary 7. Data Parallelism Data Partitioning Implications of Data Sampling Strategies Working with Remote Datasets Introduction to Data Parallel Techniques Hands-On Exercise #1: Centralized Parameter Server Using RCP Setup Observations Inspecting involved processes Inspecting connections Communication patterns Discussion Hands-On Exercise #2: Centralized Gradient-Partitioned Joint Worker/Server Distributed Training Setup Observations Communication patterns Discussion Hands-On Exercise #3: Decentralized Asynchronous Distributed Training Setup Observations Communication patterns Discussion Centralized Synchronous Data Parallel Strategies Data Parallel (DP) Distributed Data Parallel (DDP) Devil in the details Distributed Data Parallel 2 (DDP2) Zero Redundancy OptimizerPowered Data Parallelism (ZeRO-DP) Fault-Tolerant Training Hands-On Exercise #4: Scene Parsing with DDP Setup Observations Baseline Multi-GPU training Multinode Mixed-precision training Hands-On Exercise #5: Distributed Sharded DDP (ZeRO) Setup Runtime configuration Observations Discussion Building Efficient Pipelines Dataset Format Local Versus Remote Staging Threads Versus Processes: Scaling Your Pipelines Memory Tricks Data Augmentations: CPU Versus GPU JIT Acceleration Hands-On Exercise #6: Pipeline Efficiency with FFCV Setup Runtime configuration Observations Summary 8. Scaling Beyond Data Parallelism: Model, Pipeline, Tensor, and Hybrid Parallelism Questions to Ask Before Scaling Vertically Theoretical Foundations of Vertical Scaling Revisiting the Dimensions of Scaling Implementing tensor parallelism Implementing model parallelism Choosing a scaling dimension Operators Perspective of Parallelism Dimensions Data Flow and Communications in Vertical Scaling Tensor parallelism Model parallelism Pipeline parallelism: An evolution of model parallelism GPipe PipeDream Hybrid parallelism 2D hybrid parallelism 3D hybrid parallelism Basic Building Blocks for Scaling Beyond DP PyTorch Primitives for Vertical Scaling Device mesh: Mapping model architecture to physical devices Distributed tensors: Tensors with sharding and replication Sharding and replication examples Partial tensors Logical tensors: Representation without materialization Meta tensors Fake tensors Working with Larger Models Distributed Checkpointing: Saving the Partitioned Model Summary 9. Gaining Practical Expertise with Scaling Across All Dimensions Hands-On Exercises: Model, Tensor, Pipeline, and Hybrid Parallelism The Dataset Hands-On Exercise #1: Baseline DeepFM Training Observations Hands-On Exercise #2: Model Parallel DeepFM Implementation details Observations Hands-On Exercise #3: Pipeline Parallel DeepFM Implementation details Observations Hands-On Exercise #4: Pipeline Parallel DeepFM with RPC Implementation details Observations Hands-On Exercise #5: Tensor Parallel DeepFM Implementation details Observations Hands-On Exercise #6: Hybrid Parallel DeepFM Implementation details Observations Tools and Libraries for Vertical Scaling OneFlow FairScale DeepSpeed FSDP Overview and Comparison Hands-On Exercise #7: Automatic Vertical Scaling with DeepSpeed Observations Summary III. Extreme Scaling 10. Data-Centric Scaling The Seven Vs of Data Through a Deep Learning Lens The Scaling Law of Data Data Quality Validity Variety Handling too much variety Heuristic-based pruning Algorithmic outlier pruning Hands-on exercise #1: Outlier detection Scaling outlier detection Handling too-low variety Data augmentation Advanced data augmentation Automated augmentation Synthetic data generation Handling imbalance Sampling Hands-on exercise #2: Handling imbalance in a multilabel dataset Loss tricks Veracity Reasons for error in labels Approaches to labeling Techniques to increase veracity/decrease noise Using heuristics to identify noise Using inter-label information, such as ontology Continuous feedback Handling disagreements from multiple annotators Identifying noisy samples by loss gradients Hands-on exercise #3: Loss tricks to find noisy samples Using confident learning Summary of veracity tactics Value and Volume Core principles driving value Volume reduction via compression and pruning Volume reduction via dimensionality reduction Volume reduction via approximation Volume reduction via distillation Value via regularization The Data Engine and Continual Learning Volatility Velocity Summary 11. Scaling Experiments: Effective Planning and Management Model Development Is Iterative Planning for Experiments and Execution Simplify the Complex Fast Iteration for Fast Feedback Decoupled Iterations Feasibility Testing Developing and Scaling a Minimal Viable Solution Setting Up for Iterative Execution Techniques to Scale Your Experiments Accelerating Model Convergence Using transfer learning Retraining Fine tuning Pretraining Knowledge distillation Accelerating Learning Via Optimization and Automation Hyperparameter optimization AutoML Neural architecture search Model validation Simulating optimization behavior with Daydream Accelerating Learning by Increasing Expertise Continuous learning Learning to learn via meta-learning Curriculum learning Mixture of experts Learning with Scarce Supervision Self-supervised learning Contrastive learning Hands-On Exercises Hands-On Exercise #1: Transfer Learning Hands-On Exercise #2: Hyperparameter Optimization Hands-On Exercise #3: Knowledge Distillation Hands-On Exercise #4: Mixture of Experts Mock MoE DeepSpeed-MoE Hands-On Exercise #5: Contrastive Learning Hands-On Exercise #6: Meta-Learning Summary 12. Efficient Fine-Tuning of Large Models Review of Fine-Tuning Techniques Standard Fine Tuning Meta-Learning (Zero-/Few-Shot Learning) Adapter-Based Fine Tuning Low-Rank Tuning LoRAParameter-Efficient Fine Tuning Quantized LoRA (QLoRA) Hands-on Exercise: QLoRA-Based Fine Tuning Implementation Details Inference Exercise Summary Summary 13. Foundation Models What Are Foundation Models? The Evolution of Foundation Models Challenges Involved in Developing Foundation Models Measurement Complexity Deployment Challenges Propagation of Defects to All Downstream Models Legal and Ethical Considerations Ensuring Consistency and Coherency Multimodal Large Language Models Projection Gated Cross-Attention Query-Based Encoding Further Exploration Summary Index

Rozwiń Zwiń

Specyfikacja

Podstawowe informacje

Autor

Suneeta Mall

Informacje dodatkowe

Format	MOBI EPUB
Ilość stron	448
Rok wydania	2024

Deep Learning at Scale Chorzów

Lista Ofert

Opis

Specyfikacja

Podstawowe informacje

Informacje dodatkowe

Zobacz także