Data Wrangling with Python (e-book) Oświęcim

Simplify your ETL processes with these hands-on data hygiene tips, tricks, and best practices. Key Features Focus on the basics of data wrangling Study various ways to extract the most out of your data in less time Boost your learning curve with bonus topics like random data generation and data …

od 107,10 Najbliżej: 48 km

Liczba ofert: 1

Oferta sklepu

Opis

Simplify your ETL processes with these hands-on data hygiene tips, tricks, and best practices. Key Features Focus on the basics of data wrangling Study various ways to extract the most out of your data in less time Boost your learning curve with bonus topics like random data generation and data integrity checks Book Description For data to be useful and meaningful, it must be curated and refined. Data Wrangling with Python teaches you the core ideas behind these processes and equips you with knowledge of the most popular tools and techniques in the domain. The book starts with the absolute basics of Python, focusing mainly on data structures. It then delves into the fundamental tools of data wrangling like NumPy and Pandas libraries. You'll explore useful insights into why you should stay away from traditional ways of data cleaning, as done in other languages, and take advantage of the specialized pre-built routines in Python. This combination of Python tips and tricks will also demonstrate how to use the same Python backend and extract/transform data from an array of sources including the Internet, large database vaults, and Excel financial tables. To help you prepare for more challenging scenarios, you'll cover how to handle missing or wrong data, and reformat it based on the requirements from the downstream analytics tool. The book will further help you grasp concepts through real-world examples and datasets. By the end of this book, you will be confident in using a diverse array of sources to extract, clean, transform, and format your data efficiently. What you will learn Use and manipulate complex and simple data structures Harness the full potential of DataFrames and numpy.array at run time Perform web scraping with BeautifulSoup4 and html5lib Execute advanced string search and manipulation with RegEX Handle outliers and perform data imputation with Pandas Use descriptive statistics and plotting techniques Practice data wrangling and modeling using data generation techniques Who this book is for Data Wrangling with Python is designed for developers, data analysts, and business analysts who are keen to pursue a career as a full-fledged data scientist or analytics expert. Although, this book is for beginners, prior working knowledge of Python is necessary to easily grasp the concepts covered here. It will also help to have rudimentary knowledge of relational database and SQL. Spis treści: Preface About the Book About the Authors Learning Objectives Approach Audience Minimum Hardware Requirements Software Requirements Conventions Installation and Setup Installing the Code Bundle Additional Resources Chapter 1 Introduction to Data Wrangling with Python Introduction Importance of Data Wrangling Python for Data Wrangling Lists, Sets, Strings, Tuples, and Dictionaries Lists Exercise 1: Accessing the List Members Exercise 2: Generating a List Exercise 3: Iterating over a List and Checking Membership Exercise 4: Sorting a List Exercise 5: Generating a Random List Activity 1: Handling Lists Sets Introduction to Sets Union and Intersection of Sets Creating Null Sets Dictionary Exercise 6: Accessing and Setting Values in a Dictionary Exercise 7: Iterating Over a Dictionary Exercise 8: Revisiting the Unique Valued List Problem Exercise 9: Deleting Value from Dict Exercise 10: Dictionary Comprehension Tuples Creating a Tuple with Different Cardinalities Unpacking a Tuple Exercise 11: Handling Tuples Strings Exercise 12: Accessing Strings Exercise 13: String Slices String Functions Exercise 14: Split and Join Activity 2: Analyze a Multiline String and Generate the Unique Word Count Summary Chapter 2 Advanced Data Structures and File Handling Introduction Advanced Data Structures Iterator Exercise 15: Introduction to the Iterator Stacks Exercise 16: Implementing a Stack in Python Exercise 17: Implementing a Stack Using User-Defined Methods Exercise 18: Lambda Expression Exercise 19: Lambda Expression for Sorting Exercise 20: Multi-Element Membership Checking Queue Exercise 21: Implementing a Queue in Python Activity 3: Permutation, Iterator, Lambda, List Basic File Operations in Python Exercise 22: File Operations File Handling Exercise 23: Opening and Closing a File The with Statement Opening a File Using the with Statement Exercise 24: Reading a File Line by Line Exercise 25: Write to a File Activity 4: Design Your Own CSV Parser Summary Chapter 3 Introduction to NumPy, Pandas,and Matplotlib Introduction NumPy Arrays NumPy Array and Features Exercise 26: Creating a NumPy Array (from a List) Exercise 27: Adding Two NumPy Arrays Exercise 28: Mathematical Operations on NumPy Arrays Exercise 29: Advanced Mathematical Operations on NumPy Arrays Exercise 30: Generating Arrays Using arange and linspace Exercise 31: Creating Multi-Dimensional Arrays Exercise 32: The Dimension, Shape, Size, and Data Type of the Two-dimensional Array Exercise 33: Zeros, Ones, Random, Identity Matrices, and Vectors Exercise 34: Reshaping, Ravel, Min, Max, and Sorting Exercise 35: Indexing and Slicing Conditional Subsetting Exercise 36: Array Operations (array-array, array-scalar, and universal functions) Stacking Arrays Pandas DataFrames Exercise 37: Creating a Pandas Series Exercise 38: Pandas Series and Data Handling Exercise 39: Creating Pandas DataFrames Exercise 40: Viewing a DataFrame Partially Indexing and Slicing Columns Indexing and Slicing Rows Exercise 41: Creating and Deleting a New Column or Row Statistics and Visualization with NumPy and Pandas Refresher of Basic Descriptive Statistics (and the Matplotlib Library for Visualization) Exercise 42: Introduction to Matplotlib Through a Scatter Plot Definition of Statistical Measures Central Tendency and Spread Random Variables and Probability Distribution What Is a Probability Distribution? Discrete Distributions Continuous Distributions Data Wrangling in Statistics and Visualization Using NumPy and Pandas to Calculate Basic Descriptive Statistics on the DataFrame Random Number Generation Using NumPy Exercise 43: Generating Random Numbers from a Uniform Distribution Exercise 44: Generating Random Numbers from a Binomial Distribution and Bar Plot Exercise 45: Generating Random Numbers from Normal Distribution and Histograms Exercise 46: Calculation of Descriptive Statistics from a DataFrame Exercise 47: Built-in Plotting Utilities Activity 5: Generating Statistics from a CSV File Summary Chapter 4 A Deep Dive into Data Wrangling with Python Introduction Subsetting, Filtering, and Grouping Exercise 48: Loading and Examining a Superstores Sales Data from an Excel File Subsetting the DataFrame An Example Use Case: Determining Statistics on Sales and Profit Exercise 49: The unique Function Conditional Selection and Boolean Filtering Exercise 50: Setting and Resetting the Index Exercise 51: The GroupBy Method Detecting Outliers and Handling Missing Values Missing Values in Pandas Exercise 52: Filling in the Missing Values with fillna Exercise 53: Dropping Missing Values with dropna Outlier Detection Using a Simple Statistical Test Concatenating, Merging, and Joining Exercise 54: Concatenation Exercise 55: Merging by a Common Key Exercise 56: The join Method Useful Methods of Pandas Exercise 57: Randomized Sampling The value_counts Method Pivot Table Functionality Exercise 58: Sorting by Column Values the sort_values Method Exercise 59: Flexibility for User-Defined Functions with the apply Method Activity 6: Working with the Adult Income Dataset (UCI) Summary Chapter 5 Getting Comfortable with Different Kinds of Data Sources Introduction Reading Data from Different Text-Based (and Non-Text-Based) Sources Data Files Provided with This Chapter Libraries to Install for This Chapter Exercise 60: Reading Data from a CSV File Where Headers Are Missing Exercise 61: Reading from a CSV File where Delimiters are not Commas Exercise 62: Bypassing the Headers of a CSV File Exercise 63: Skipping Initial Rows and Footers when Reading a CSV File Reading Only the First N Rows (Especially Useful for Large Files) Exercise 64: Combining Skiprows and Nrows to Read Data in Small Chunks Setting the skip_blank_lines Option Read CSV from a Zip file Reading from an Excel File Using sheet_name and Handling a Distinct sheet_name Exercise 65: Reading a General Delimited Text File Reading HTML Tables Directly from a URL Exercise 66: Further Wrangling to Get the Desired Data Exercise 67: Reading from a JSON File Reading a Stata File Exercise 68: Reading Tabular Data from a PDF File Introduction to Beautiful Soup 4 and Web Page Parsing Structure of HTML Exercise 69: Reading an HTML file and Extracting its Contents Using BeautifulSoup Exercise 70: DataFrames and BeautifulSoup Exercise 71: Exporting a DataFrame as an Excel File Exercise 72: Stacking URLs from a Document using bs4 Activity 7: Reading Tabular Data from a Web Page and Creating DataFrames Summary Chapter 6 Learning the Hidden Secrets of Data Wrangling Introduction Additional Software Required for This Section Advanced List Comprehension and the zip Function Introduction to Generator Expressions Exercise 73: Generator Expressions Exercise 74: One-Liner Generator Expression Exercise 75: Extracting a List with Single Words Exercise 76: The zip Function Exercise 77: Handling Messy Data Data Formatting The % operator Using the format Function Exercise 78: Data Representation Using {} Identify and Clean Outliers Exercise 79: Outliers in Numerical Data Z-score Exercise 80: The Z-Score Value to Remove Outliers Exercise 81: Fuzzy Matching of Strings Activity 8: Handling Outliers and Missing Data Summary Chapter 7 Advanced Web Scraping and Data Gathering Introduction The Basics of Web Scraping and the Beautiful Soup Library Libraries in Python Exercise 81: Using the Requests Library to Get a Response from the Wikipedia Home Page Exercise 82: Checking the Status of the Web Request Checking the Encoding of the Web Page Exercise 83: Creating a Function to Decode the Contents of the Response and Check its Length Exercise 84: Extracting Human-Readable Text From a BeautifulSoup Object Extracting Text from a Section Extracting Important Historical Events that Happened on Today's Date Exercise 85: Using Advanced BS4 Techniques to Extract Relevant Text Exercise 86: Creating a Compact Function to Extract the "On this Day" Text from the Wikipedia Home Page Reading Data from XML Exercise 87: Creating an XML File and Reading XML Element Objects Exercise 88: Finding Various Elements of Data within a Tree (Element) Reading from a Local XML File into an ElementTree Object Exercise 89: Traversing the Tree, Finding the Root, and Exploring all Child Nodes and their Tags and Attributes Exercise 90: Using the text Method to Extract Meaningful Data Extracting and Printing the GDP/Per Capita Information Using a Loop Exercise 91: Finding All the Neighboring Countries for each Country and Printing Them Exercise 92: A Simple Demo of Using XML Data Obtained by Web Scraping Reading Data from an API Defining the Base URL (or API Endpoint) Exercise 93: Defining and Testing a Function to Pull Country Data from an API Using the Built-In JSON Library to Read and Examine Data Printing All the Data Elements Using a Function that Extracts a DataFrame Containing Key Information Exercise 94: Testing the Function by Building a Small Database of Countries' Information Fundamentals of Regular Expressions (RegEx) Regex in the Context of Web Scraping Exercise 95: Using the match Method to Check Whether a Pattern matches a String/Sequence Using the Compile Method to Create a Regex Program Exercise 96: Compiling Programs to Match Objects Exercise 97: Using Additional Parameters in Match to Check for Positional Matching Finding the Number of Words in a List That End with "ing" Exercise 98: The search Method in Regex Exercise 99: Using the span Method of the Match Object to Locate the Position of the Matched Pattern Exercise 100: Examples of Single Character Pattern Matching with search Exercise 101: Examples of Pattern Matching at the Start or End of a String Exercise 102: Examples of Pattern Matching with Multiple Characters Exercise 103: Greedy versus Non-Greedy Matching Exercise 104: Controlling Repetitions to Match Exercise 105: Sets of Matching Characters Exercise 106: The use of OR in Regex using the OR Operator The findall Method Activity 9: Extracting the Top 100 eBooks from Gutenberg Activity 10: Building Your Own Movie Database by Reading an API Summary Chapter 8 RDBMS and SQL Introduction Refresher of RDBMS and SQL How is an RDBMS Structured? SQL Using an RDBMS (MySQL/PostgreSQL/SQLite) Exercise 107: Connecting to Database in SQLite Exercise 108: DDL and DML Commands in SQLite Reading Data from a Database in SQLite Exercise 109: Sorting Values that are Present in the Database Exercise 110: Altering the Structure of a Table and Updating the New Fields Exercise 111: Grouping Values in Tables Relation Mapping in Databases Adding Rows in the comments Table Joins Retrieving Specific Columns from a JOIN query Exercise 112: Deleting Rows Updating Specific Values in a Table Exercise 113: RDBMS and DataFrames Activity 11: Retrieving Data Correctly From Databases Summary Chapter 9 Application of Data Wrangling in Real Life Introduction Applying Your Knowledge to a Real-life Data Wrangling Task Activity 12: Data Wrangling Task Fixing UN Data Activity 13: Data Wrangling Task Cleaning GDP Data Activity 14: Data Wrangling Task Merging UN Data and GDP Data Activity 15: Data Wrangling Task Connecting the New Data to the Database An Extension to Data Wrangling Additional Skills Required to Become a Data Scientist Basic Familiarity with Big Data and Cloud Technologies What Goes with Data Wrangling? Tips and Tricks for Mastering Machine Learning Summary Appendix Solution of Activity 1: Handling Lists Solution of Activity 2: Analyze a Multiline String and Generate the Unique Word Count Solution of Activity 3: Permutation, Iterator, Lambda, List Solution of Activity 4: Design Your Own CSV Parser Solution of Activity 5: Generating Statistics from a CSV File Solution of Activity 6: Working with the Adult Income Dataset (UCI) Solution of Activity 7: Reading Tabular Data from a Web Page and Creating DataFrames Solution of Activity 8: Handling Outliers and Missing Data Solution of Activity 9: Extracting the Top 100 eBooks from Gutenberg Solution of Activity 10: Extracting the top 100 eBooks from Gutenberg.org Solution of Activity 11: Retrieving Data Correctly from Databases Solution of Activity 12: Data Wrangling Task Fixing UN Data Activity 13: Data Wrangling Task Cleaning GDP Data Solution of Activity 14: Data Wrangling Task Merging UN Data and GDP Data Activity 15: Data Wrangling Task Connecting the New Data to a Database

Specyfikacja

Podstawowe informacje

Autor
  • Dr. Tirthajyoti Sarkar, Shubhadeep Roychowdhury
Rok wydania
  • 2019
Format
  • PDF
  • MOBI
  • EPUB
Ilość stron
  • 453
Kategorie
  • Programowanie
Wybrane wydawnictwa
  • Packt Publishing