Buy A Data Scientist's Guide to Acquiring, Cleaning, and Managing Data in R
Book 1
Book 2
Book 3
Book 1
Book 2
Book 3
Book 1
Book 2
Book 3
Book 1
Book 2
Book 3
Home > Computing and Information Technology > Databases > A Data Scientist's Guide to Acquiring, Cleaning, and Managing Data in R
A Data Scientist's Guide to Acquiring, Cleaning, and Managing Data in R

A Data Scientist's Guide to Acquiring, Cleaning, and Managing Data in R


     0     
5
4
3
2
1



International Edition


X
About the Book

The only how-to guide offering a unified, systemic approach to acquiring, cleaning, and managing data in R Every experienced practitioner knows that preparing data for modeling is a painstaking, time-consuming process. Adding to the difficulty is that most modelers learn the steps involved in cleaning and managing data piecemeal, often on the fly, or they develop their own ad hoc methods. This book helps simplify their task by providing a unified, systematic approach to acquiring, modeling, manipulating, cleaning, and maintaining data in R.  Starting with the very basics, data scientists Samuel E. Buttrey and Lyn R. Whitaker walk readers through the entire process. From what data looks like and what it should look like, they progress through all the steps involved in getting data ready for modeling.  They describe best practices for acquiring data from numerous sources; explore key issues in data handling, including text/regular expressions, big data, parallel processing, merging, matching, and checking for duplicates; and outline highly efficient and reliable techniques for documenting data and recordkeeping, including audit trails, getting data back out of R, and more. The only single-source guide to R data and its preparation, it describes best practices for acquiring, manipulating, cleaning, and maintaining data Begins with the basics and walks readers through all the steps necessary to get data ready for the modeling process Provides expert guidance on how to document the processes described so that they are reproducible Written by seasoned professionals, it provides both introductory and advanced techniques Features case studies with supporting data and R code, hosted on a companion website A Data Scientist's Guide to Acquiring, Cleaning and Managing Data in R is a valuable working resource/bench manual for practitioners who collect and analyze data, lab scientists and research associates of all levels of experience, and graduate-level data mining students.

Table of Contents:
About the Authors xv Preface xvii Acknowledgments xix About the CompanionWebsite xxi 1 R 1 1.1 Introduction 1 1.1.1 What Is R? 1 1.1.2 Who Uses R and Why? 2 1.1.3 Acquiring and Installing R 2 1.1.4 Starting and Quitting R 3 1.2 Data 3 1.2.1 Acquiring Data 3 1.2.2 Cleaning Data 4 1.2.3 The Goal of Data Cleaning 4 1.2.4 Making YourWork Reproducible 5 1.3 The Very Basics of R 5 1.3.1 Top Ten Quick Facts You Need to Know about R 5 1.3.2 Vocabulary 8 1.3.3 Calculating and Printing in R 11 1.4 Running an R Session 12 1.4.1 Where Your Data Is Stored 13 1.4.2 Options 13 1.4.3 Scripts 14 1.4.4 R Packages 14 1.4.5 RStudio and Other GUIs 15 1.4.6 Locales and Character Sets 15 1.5 Getting Help 16 1.5.1 At the Command Line 16 1.5.2 The Online Manuals 16 1.5.3 On the Internet 17 1.5.4 Further Reading 17 1.6 How to Use This Book 17 1.6.1 Syntax and Conventions inThis Book 17 1.6.2 The Chapters 18 2 RData,Part1:Vectors 21 2.1 Vectors 21 2.1.1 Creating Vectors 21 2.1.2 Sequences 22 2.1.3 Logical Vectors 23 2.1.4 Vector Operations 24 2.1.5 Names 27 2.2 Data Types 27 2.2.1 Some Less-Common Data Types 28 2.2.2 What Type of Vector IsThis? 28 2.2.3 Converting from One Type to Another 29 2.3 Subsets of Vectors 31 2.3.1 Extracting 31 2.3.2 Vectors of Length 0 34 2.3.3 Assigning or Replacing Elements of a Vector 35 2.4 Missing Data (NA) and Other Special Values 36 2.4.1 The Effect of NAs in Expressions 37 2.4.2 Identifying and Removing or Replacing NAs 37 2.4.3 Indexing with NAs 39 2.4.4 NaN and Inf Values 40 2.4.5 NULL Values 40 2.5 The table() Function 40 2.5.1 Two- and Higher-Way Tables 42 2.5.2 Operating on Elements of a Table 42 2.6 Other Actions on Vectors 45 2.6.1 Rounding 45 2.6.2 Sorting and Ordering 45 2.6.3 Vectors as Sets 46 2.6.4 Identifying Duplicates and Matching 47 2.6.5 Finding Runs of Duplicate Values 49 2.7 Long Vectors and Big Data 50 2.8 Chapter Summary and Critical Data Handling Tools 50 3 R Data, Part 2:More Complicated Structures 53 3.1 Introduction 53 3.2 Matrices 53 3.2.1 Extracting and Assigning 54 3.2.2 Row and Column Names 56 3.2.3 Applying a Function to Rows or Columns 57 3.2.4 Missing Values in Matrices 59 3.2.5 Using a Matrix Subscript 60 3.2.6 Sparse Matrices 61 3.2.7 Three- and Higher-Way Arrays 62 3.3 Lists 62 3.3.1 Extracting and Assigning 64 3.3.2 Lists in Practice 65 3.4 Data Frames 67 3.4.1 Missing Values in Data Frames 69 3.4.2 Extracting and Assigning in Data Frames 69 3.4.3 ExtractingThings That Aren’tThere 72 3.5 Operating on Lists and Data Frames 74 3.5.1 Split, Apply, Combine 75 3.5.2 All-Numeric Data Frames 77 3.5.3 Convenience Functions 78 3.5.4 Re-Ordering, De-Duplicating, and Sampling from Data Frames 79 3.6 Date and Time Objects 80 3.6.1 Formatting Dates 80 3.6.2 Common Operations on Date Objects 82 3.6.3 Differences between Dates 83 3.6.4 Dates and Times 83 3.6.5 Creating POSIXt Objects 85 3.6.6 Mathematical Functions for Date and Times 86 3.6.7 Missing Values in Dates 88 3.6.8 Using Apply Functions with Dates and Times 89 3.7 Other Actions on Data Frames 90 3.7.1 Combining by Rows or Columns 90 3.7.2 Merging Data Frames 91 3.7.3 Comparing Two Data Frames 94 3.7.4 Viewing and Editing Data Frames Interactively 94 3.8 Handling Big Data 94 3.9 Chapter Summary and Critical Data Handling Tools 96 4 RData, Part 3: Text and Factors 99 4.1 Character Data 100 4.1.1 The length() and nchar() Functions 100 4.1.2 Tab, New-Line, Quote, and Backslash Characters 100 4.1.3 The Empty String 101 4.1.4 Substrings 102 4.1.5 Changing Case and Other Substitutions 103 4.2 Converting Numbers into Text 103 4.2.2 Scientific Notation 106 4.2.3 Discretizing a Numeric Variable 107 4.3 Constructing Character Strings: Paste in Action 109 4.3.1 Constructing Column Names 109 4.3.2 Tabulating Dates by Year and Month or Quarter Labels 111 4.3.3 Constructing Unique Keys 112 4.3.4 Constructing File and Path Names 112 4.4 Regular Expressions 112 4.4.1 Types of Regular Expressions 113 4.4.2 Tools for Regular Expressions in R 113 4.4.3 Special Characters in Regular Expressions 114 4.4.4 Examples 114 4.4.5 The regexpr() Function and Its Variants 121 4.4.6 Using Regular Expressions in Replacement 123 4.4.7 Splitting Strings at Regular Expressions 124 4.4.8 Regular Expressions versusWildcard Matching 125 4.4.9 Common Data Cleaning Tasks Using Regular Expressions 126 4.4.10 Documenting and Debugging Regular Expressions 127 4.5 UTF-8 and Other Non-ASCII Characters 128 4.5.1 Extended ASCII for Latin Alphabets 128 4.5.2 Non-Latin Alphabets 129 4.5.3 Character and String Encoding in R 130 4.6 Factors 131 4.6.1 What Is a Factor? 131 4.6.2 Factor Levels 132 4.6.3 Converting and Combining Factors 134 4.6.4 Missing Values in Factors 136 4.6.5 Factors in Data Frames 137 4.7 R Object Names and Commands as Text 137 4.7.1 R Object Names as Text 137 4.7.2 R Commands as Text 138 4.8 Chapter Summary and Critical Data Handling Tools 140 5 Writing Functions and Scripts 143 5.1 Functions 143 5.1.1 Function Arguments 144 5.1.2 Global versus Local Variables 148 5.1.3 Return Values 149 5.1.4 Creating and Editing Functions 151 5.2 Scripts and Shell Scripts 153 5.2.1 Line-by-Line Parsing 155 5.3 Error Handling and Debugging 156 5.3.1 Debugging Functions 156 5.3.2 Issuing Error andWarning Messages 158 5.3.3 Catching and Processing Errors 159 5.4 Interacting with the Operating System 161 5.4.1 File and Directory Handling 162 5.4.2 Environment Variables 162 5.5 SpeedingThings Up 163 5.5.1 Profiling 163 5.5.2 Vectorizing Functions 164 5.5.3 Other Techniques to Speed Things Up 165 5.6 Chapter Summary and Critical Data Handling Tools 167 5.6.1 Programming Style 168 5.6.2 Common Bugs 169 5.6.3 Objects, Classes, and Methods 170 6 Getting Data into and out of R 171 6.1 Reading Tabular ASCII Data into Data Frames 171 6.1.1 Files with Delimiters 172 6.1.2 Column Classes 173 6.1.3 Common Pitfalls in Reading Tables 175 6.1.4 An Example of When read.table() Fails 177 6.1.5 Other Uses of the scan() Function 181 6.1.6 Writing Delimited Files 182 6.1.7 Reading andWriting Fixed-Width Files 183 6.1.8 A Note on End-of-Line Characters 183 6.2 Reading Large, Non-Tabular, or Non-ASCII Data 184 6.2.1 Opening and Closing Files 184 6.2.2 Reading andWriting Lines 185 6.2.3 Reading andWriting UTF-8 and Other Encodings 187 6.2.4 The Null Character 187 6.2.5 Binary Data 188 6.2.6 Reading Problem Files in Action 190 6.3 Reading Data From Relational Databases 192 6.3.1 Connecting to the Database Server 193 6.3.2 Introduction to SQL 194 6.4 Handling Large Numbers of Input Files 197 6.5 Other Formats 200 6.5.1 Using the Clipboard 200 6.5.2 Reading Data from Spreadsheets 201 6.5.3 Reading Data from theWeb 203 6.5.4 Reading Data from Other Statistical Packages 208 6.6 Reading andWriting R Data Directly 209 6.7 Chapter Summary and Critical Data Handling Tools 210 7 Data Handling in Practice 213 7.1 Acquiring and Reading Data 213 7.2 Cleaning Data 214 7.3 Combining Data 216 7.3.1 Combining by Row 216 7.3.2 Combining by Column 218 7.3.3 Merging by Key 218 7.4 Transactional Data 219 7.4.1 Example of Transactional Data 219 7.4.2 Combining Tabular and Transactional Data 221 7.5 Preparing Data 225 7.6 Documentation and Reproducibility 226 7.7 The Role of Judgment 228 7.8 Data Cleaning in Action 230 7.8.1 Reading and Cleaning BedBath1.csv 231 7.8.2 Reading and Cleaning BedBath2.csv 236 7.8.3 Combining the BedBath Data Frames 238 7.8.4 Reading and Cleaning EnergyUsage.csv 239 7.8.5 Merging the BedBath and EnergyUsage Data Frames 242 7.9 Chapter Summary and Critical Data Handling Tools 245 8 Extended Exercise 247 8.1 Introduction to the Problem 247 8.1.1 The Goal 248 8.1.2 Modeling Considerations 249 8.1.3 Examples ofThings to Check 249 8.2 The Data 250 8.3 Five Important Fields 252 8.4 Loan and Application Portfolios 252 8.4.1 Layout of the Beachside Lenders Data 253 8.4.2 Layout of theWilson and Sons Data 254 8.4.3 Combining the Two Portfolios 254 8.5 Scores 256 8.5.1 Scores Layout 256 8.6 Co-borrower Scores 257 8.6.1 Co-borrower Score Examples 258 8.7 Updated KScores 259 8.7.1 Updated KScores Layout 259 8.8 Loans to Be Excluded 260 8.8.1 Sample Exclusion File 260 8.9 Response Variable 260 8.10 Assembling the Final Data Sets 262 8.10.1 Final Data Layout 262 8.10.2 Concluding Remarks 263 A Hints and Pseudocode 265 A.1 Loan Portfolios 265 A.1.1 Things to Check 266 A.2 Scores Database 267 A.2.1 Things to Check 268 A.3 Co-borrower Scores 269 A.3.1 Things to Check 270 A.4 Updated KScores 271 A.4.1 Things to Check 272 A.5 Excluder Files 272 A.5.1 Things to Check 272 A.6 Payment Matrix 273 A.6.1 Things to Check 274 A.7 Starting the Modeling Process 275 Bibliography 277 Index 279

About the Author :
SAMUEL E. BUTTREY, PhD is an Associate Professor of Operations Research at the Naval Postgraduate School, Monterey, California, USA. LYN R. WHITAKER, PhD is an Associate Professor of Operations Research at the Naval Postgraduate School, Monterey, California, USA.


Best Sellers


Product Details
  • ISBN-13: 9781119080022
  • Publisher: John Wiley & Sons Inc
  • Publisher Imprint: John Wiley & Sons Inc
  • Height: 231 mm
  • No of Pages: 312
  • Returnable: N
  • Weight: 522 gr
  • ISBN-10: 1119080029
  • Publisher Date: 01 Dec 2017
  • Binding: Hardback
  • Language: English
  • Returnable: N
  • Spine Width: 20 mm
  • Width: 155 mm


Similar Products

Add Photo
Add Photo

Customer Reviews

REVIEWS      0     
Click Here To Be The First to Review this Product
A Data Scientist's Guide to Acquiring, Cleaning, and Managing Data in R
John Wiley & Sons Inc -
A Data Scientist's Guide to Acquiring, Cleaning, and Managing Data in R
Writing guidlines
We want to publish your review, so please:
  • keep your review on the product. Review's that defame author's character will be rejected.
  • Keep your review focused on the product.
  • Avoid writing about customer service. contact us instead if you have issue requiring immediate attention.
  • Refrain from mentioning competitors or the specific price you paid for the product.
  • Do not include any personally identifiable information, such as full names.

A Data Scientist's Guide to Acquiring, Cleaning, and Managing Data in R

Required fields are marked with *

Review Title*
Review
    Add Photo Add up to 6 photos
    Would you recommend this product to a friend?
    Tag this Book Read more
    Does your review contain spoilers?
    What type of reader best describes you?
    I agree to the terms & conditions
    You may receive emails regarding this submission. Any emails will include the ability to opt-out of future communications.

    CUSTOMER RATINGS AND REVIEWS AND QUESTIONS AND ANSWERS TERMS OF USE

    These Terms of Use govern your conduct associated with the Customer Ratings and Reviews and/or Questions and Answers service offered by Bookswagon (the "CRR Service").


    By submitting any content to Bookswagon, you guarantee that:
    • You are the sole author and owner of the intellectual property rights in the content;
    • All "moral rights" that you may have in such content have been voluntarily waived by you;
    • All content that you post is accurate;
    • You are at least 13 years old;
    • Use of the content you supply does not violate these Terms of Use and will not cause injury to any person or entity.
    You further agree that you may not submit any content:
    • That is known by you to be false, inaccurate or misleading;
    • That infringes any third party's copyright, patent, trademark, trade secret or other proprietary rights or rights of publicity or privacy;
    • That violates any law, statute, ordinance or regulation (including, but not limited to, those governing, consumer protection, unfair competition, anti-discrimination or false advertising);
    • That is, or may reasonably be considered to be, defamatory, libelous, hateful, racially or religiously biased or offensive, unlawfully threatening or unlawfully harassing to any individual, partnership or corporation;
    • For which you were compensated or granted any consideration by any unapproved third party;
    • That includes any information that references other websites, addresses, email addresses, contact information or phone numbers;
    • That contains any computer viruses, worms or other potentially damaging computer programs or files.
    You agree to indemnify and hold Bookswagon (and its officers, directors, agents, subsidiaries, joint ventures, employees and third-party service providers, including but not limited to Bazaarvoice, Inc.), harmless from all claims, demands, and damages (actual and consequential) of every kind and nature, known and unknown including reasonable attorneys' fees, arising out of a breach of your representations and warranties set forth above, or your violation of any law or the rights of a third party.


    For any content that you submit, you grant Bookswagon a perpetual, irrevocable, royalty-free, transferable right and license to use, copy, modify, delete in its entirety, adapt, publish, translate, create derivative works from and/or sell, transfer, and/or distribute such content and/or incorporate such content into any form, medium or technology throughout the world without compensation to you. Additionally,  Bookswagon may transfer or share any personal information that you submit with its third-party service providers, including but not limited to Bazaarvoice, Inc. in accordance with  Privacy Policy


    All content that you submit may be used at Bookswagon's sole discretion. Bookswagon reserves the right to change, condense, withhold publication, remove or delete any content on Bookswagon's website that Bookswagon deems, in its sole discretion, to violate the content guidelines or any other provision of these Terms of Use.  Bookswagon does not guarantee that you will have any recourse through Bookswagon to edit or delete any content you have submitted. Ratings and written comments are generally posted within two to four business days. However, Bookswagon reserves the right to remove or to refuse to post any submission to the extent authorized by law. You acknowledge that you, not Bookswagon, are responsible for the contents of your submission. None of the content that you submit shall be subject to any obligation of confidence on the part of Bookswagon, its agents, subsidiaries, affiliates, partners or third party service providers (including but not limited to Bazaarvoice, Inc.)and their respective directors, officers and employees.

    Accept

    Fresh on the Shelf


    Inspired by your browsing history


    Your review has been submitted!

    You've already reviewed this product!