Effective Monitoring and Alerting. For Web Operations (e-book) Katowice

With this practical book, you...ll discover how to catch complications in your distributed system before they develop into costly problems. Based on his extensive experience in systems ops at large technology companies, author Slawek Ligus describes an effective data-driven approach for monitoring …

od 55,24 Najbliżej: 26 km

Liczba ofert: 1

Oferta sklepu

Opis

With this practical book, you...ll discover how to catch complications in your distributed system before they develop into costly problems. Based on his extensive experience in systems ops at large technology companies, author Slawek Ligus describes an effective data-driven approach for monitoring and alerting that enables you to maintain high availability and deliver a high quality of service.Learn methods for measuring state changes and data flow in your system, and set up alerts to help you recover quickly from problems when they do arise. If you...re a system operator waging the daily battle to provide the best performance at the lowest cost, this book is for you.Monitor every component of your application stack, from the network to user experienceLearn how to draw the right conclusions from the metrics you obtainDevelop a robust alerting system that can identify problematic anomalies-without raising false alarmsAddress system failures by their impact on resource utilization and user experiencePlan an alerting configuration that scales with your expanding networkLearn how to choose appropriate maintenance times automaticallyDevelop a work environment that fosters flexibility and adaptability Spis treści: Effective Monitoring and Alerting SPECIAL OFFER: Upgrade this ebook with OReilly Preface Who Should Read This Book Conventions Used in This Book Using Code Examples Safari Books Online How to Contact Us Acknowledgements 1. Introduction Monitoring, Alerting, and What They Can Do for You Early Problem Detection Availability Performance Decision Making Baselining Predictions Automation Admission Control Autonomic Computing Monitoring and Alerting in a Nutshell Metrics and Timeseries Alarms, Alerts, and Monitors Monitoring System The Process of Alerting Issue Tracking Tickets and queues The Challenges Important Terms 2. Monitoring The Building Blocks Data Collection Coverage Resources Network Computational resources Solution stack Operating system Middleware Application User experience Metrics Summary statistics Frequency distribution and percentiles Rate of change Time granularity Metric aggregation Example: Inputs, Metrics, and Timeseries Understanding Metrics Type of unit Data Collection Mode Data Source Number of Inputs per Data Point Type of Quantity Timeseries Patterns Drawing Conclusions from Timeseries Plots Interpretation of Anomalies Flow Stock Availability Throughput Applications of quantities Frequently Encountered Anomalies Flattening Effect Warm-Up Effect Regular Anomalies Spikes During Troughs Determining Causality Capturing the Daily Cycle, Trends, and Seasonal Changes 3. Alerting The Challenge Prerequisites Monitoring and Alerting Platform Audit Trail Issue Tracking Understanding Failure and Its Impact Establishing Significance Identifying Causes Anatomy of an Alarm Boolean Function Metric Monitor Upper Limit Lower Limit Outside Range Data Points Not Recorded Time Evaluation Another Alarm as Input Source Suppression Aggregation Case Study: A Data Pipeline Types of Alerts Setting Up Alarms Identifying Impact Establishing Severity Picking the Right Timeseries Configuring Monitors Coming Up with a Threshold Static thresholds Data-driven thresholds Breach and Clear Delay Setting Up Alarms Testing Alerting Configurations Alerting Suggestions 4. At Scale Implications of Scale Composition of Large-Scale Systems Commonalities of Large-Scale Alerting Configurations Monitoring Coverage Reflecting Dimensions in Metrics Managing Large Alerting Configurations Addressing the Problems Organize alarms and monitors in a namespace Calculate threshold values from metric data Periodically refresh and clean up the configuration Suggested Solution Refresh intervals Running the engine Naming Alarm creation and threshold calculation Cleanup procedures Writing Modules Suppression Extra Features Result 5. Monitoring in System Automation Choosing Appropriate Maintenance Times Automatically Controlling the Rate of Upgrade Recovery-Oriented Admission Control Automated Deployment and Rollback 6. The Work Environment Keeping an Audit Trail Working with Tickets Root Cause Analysis The Five Whys Extracting Categories Dealing with Anomalies Learning from Outages Using Checklists Creating Dashboards Service-Level Agreements Preventing the Ironies of Automation Culture 7. Measuring Success The Feedback Loop Root Cause Classification A Short Story of a Long Classifier List Timing Ticket Reporting Frequency of Incidence Incidence Times Time to Respond and Time to Resolution Measuring Detectability False Positives and False Negatives Precision and Recall The F-Measure Transition to Automated Alarms Maintenance Overhead How (Not) to Measure 8. The Principles Get in the Habit of Measuring Draw Conclusions Reliably Monitor Extensively Alarm Selectively Work Smart, Not Hard Learn from the Experience of Others Have a Tactic Run a Bank of Cases Enjoy the Process A. Setting Up OpenTSDB The Software Architecture Getting OpenTSDB First Steps Starting TSD Pushing Data Input Tagging Tag Wildcards Temporal Aggregation Summary Statistics Rate of Change Gathering Data System-Wide Running tcollector Writing a Custom Collector Timeseries Plots Plotting Tips Get Involved About the Author SPECIAL OFFER: Upgrade this ebook with OReilly Copyright

Specyfikacja

Podstawowe informacje

Autor
  • Slawek Ligus
Rok wydania
  • 2012
Format
  • MOBI
  • EPUB
Ilość stron
  • 166
Kategorie
  • Hacking
Wybrane wydawnictwa
  • O'Reilly Media