Effective Monitoring and Alerting. For Web Operations (e-book)

Lista Ofert

Opis

With this practical book, youll discover how to catch complications in your distributed system before they develop into costly problems. Based on his extensive experience in systems ops at large technology companies, author Slawek Ligus describes an effective data-driven approach for monitoring and alerting that enables you to maintain high availability and deliver a high quality of service. Learn methods for measuring state changes and data flow in your system, and set up alerts to help you recover quickly from problems when they do arise. If youre a system operator waging the daily battle to provide the best performance at the lowest cost, this book is for you. Monitor every component of your application stack, from the network to user experience Learn how to draw the right conclusions from the metrics you obtain Develop a robust alerting system that can identify problematic anomalies-without raising false alarms Address system failures by their impact on resource utilization and user experience Plan an alerting configuration that scales with your expanding network Learn how to choose appropriate maintenance times automatically Develop a work environment that fosters flexibility and adaptability Spis treści: Effective Monitoring and Alerting SPECIAL OFFER: Upgrade this ebook with OReilly Preface Who Should Read This Book Conventions Used in This Book Using Code Examples Safari Books Online How to Contact Us Acknowledgements 1. Introduction Monitoring, Alerting, and What They Can Do for You Early Problem Detection Availability Performance Decision Making Baselining Predictions Automation Admission Control Autonomic Computing Monitoring and Alerting in a Nutshell Metrics and Timeseries Alarms, Alerts, and Monitors Monitoring System The Process of Alerting Issue Tracking Tickets and queues The Challenges Important Terms 2. Monitoring The Building Blocks Data Collection Coverage Resources Network Computational resources Solution stack Operating system Middleware Application User experience Metrics Summary statistics Frequency distribution and percentiles Rate of change Time granularity Metric aggregation Example: Inputs, Metrics, and Timeseries Understanding Metrics Type of unit Data Collection Mode Data Source Number of Inputs per Data Point Type of Quantity Timeseries Patterns Drawing Conclusions from Timeseries Plots Interpretation of Anomalies Flow Stock Availability Throughput Applications of quantities Frequently Encountered Anomalies Flattening Effect Warm-Up Effect Regular Anomalies Spikes During Troughs Determining Causality Capturing the Daily Cycle, Trends, and Seasonal Changes 3. Alerting The Challenge Prerequisites Monitoring and Alerting Platform Audit Trail Issue Tracking Understanding Failure and Its Impact Establishing Significance Identifying Causes Anatomy of an Alarm Boolean Function Metric Monitor Upper Limit Lower Limit Outside Range Data Points Not Recorded Time Evaluation Another Alarm as Input Source Suppression Aggregation Case Study: A Data Pipeline Types of Alerts Setting Up Alarms Identifying Impact Establishing Severity Picking the Right Timeseries Configuring Monitors Coming Up with a Threshold Static thresholds Data-driven thresholds Breach and Clear Delay Setting Up Alarms Testing Alerting Configurations Alerting Suggestions 4. At Scale Implications of Scale Composition of Large-Scale Systems Commonalities of Large-Scale Alerting Configurations Monitoring Coverage Reflecting Dimensions in Metrics Managing Large Alerting Configurations Addressing the Problems Organize alarms and monitors in a namespace Calculate threshold values from metric data Periodically refresh and clean up the configuration Suggested Solution Refresh intervals Running the engine Naming Alarm creation and threshold calculation Cleanup procedures Writing Modules Suppression Extra Features Result 5. Monitoring in System Automation Choosing Appropriate Maintenance Times Automatically Controlling the Rate of Upgrade Recovery-Oriented Admission Control Automated Deployment and Rollback 6. The Work Environment Keeping an Audit Trail Working with Tickets Root Cause Analysis The Five Whys Extracting Categories Dealing with Anomalies Learning from Outages Using Checklists Creating Dashboards Service-Level Agreements Preventing the Ironies of Automation Culture 7. Measuring Success The Feedback Loop Root Cause Classification A Short Story of a Long Classifier List Timing Ticket Reporting Frequency of Incidence Incidence Times Time to Respond and Time to Resolution Measuring Detectability False Positives and False Negatives Precision and Recall The F-Measure Transition to Automated Alarms Maintenance Overhead How (Not) to Measure 8. The Principles Get in the Habit of Measuring Draw Conclusions Reliably Monitor Extensively Alarm Selectively Work Smart, Not Hard Learn from the Experience of Others Have a Tactic Run a Bank of Cases Enjoy the Process A. Setting Up OpenTSDB The Software Architecture Getting OpenTSDB First Steps Starting TSD Pushing Data Input Tagging Tag Wildcards Temporal Aggregation Summary Statistics Rate of Change Gathering Data System-Wide Running tcollector Writing a Custom Collector Timeseries Plots Plotting Tips Get Involved About the Author SPECIAL OFFER: Upgrade this ebook with OReilly Copyright

Rozwiń Zwiń

Specyfikacja

Podstawowe informacje

Autor	Slawek Ligus
Rok wydania	2012

Techniczne

Format	MOBI EPUB
Ilość stron	166

Dodatkowe informacje

Kategorie	Hacking
Wydawnictwo	O'Reilly Media

Effective Monitoring and Alerting. For Web Operations (e-book) Katowice