Buy The Data Science Handbook by Field Cady at Bookstore UAE
close menu
Bookswagon
search
My Account
Book 1
Book 2
Book 3
Book 1
Book 2
Book 3
Book 1
Book 2
Book 3
Book 1
Book 2
Book 3
Home > Computing and Information Technology > Databases > Database design and theory > The Data Science Handbook
The Data Science Handbook

The Data Science Handbook


     4  |  6 Reviews 
5
4
3
2
1



Out of Stock


Notify me when this book is in stock
X
About the Book

A comprehensive overview of data science covering the analytics, programming, and business skills necessary to master the discipline

Finding a good data scientist has been likened to hunting for a unicorn: the required combination of technical skills is simply very hard to find in one person. In addition, good data science is not just rote application of trainable skill sets; it requires the ability to think flexibly about all these areas and understand the connections between them. This book provides a crash course in data science, combining all the necessary skills into a unified discipline.

Unlike many analytics books, computer science and software engineering are given extensive coverage since they play such a central role in the daily work of a data scientist. The author also describes classic machine learning algorithms, from their mathematical foundations to real-world applications. Visualization tools are reviewed, and their central importance in data science is highlighted. Classical statistics is addressed to help readers think critically about the interpretation of data and its common pitfalls. The clear communication of technical results, which is perhaps the most undertrained of data science skills, is given its own chapter, and all topics are explained in the context of solving real-world data problems. The book also features:

• Extensive sample code and tutorials using Python™ along with its technical libraries

• Core technologies of “Big Data,” including their strengths and limitations and how they can be used to solve real-world problems

• Coverage of the practical realities of the tools, keeping theory to a minimum; however, when theory is presented, it is done in an intuitive way to encourage critical thinking and creativity

• A wide variety of case studies from industry

• Practical advice on the realities of being a data scientist today, including the overall workflow, where time is spent, the types of datasets worked on, and the skill sets needed

The Data Science Handbook is an ideal resource for data analysis methodology and big data software tools. The book is appropriate for people who want to practice data science, but lack the required skill sets. This includes software professionals who need to better understand analytics and statisticians who need to understand software. Modern data science is a unified discipline, and it is presented as such. This book is also an appropriate reference for researchers and entry-level graduate students who need to learn real-world analytics and expand their skill set.

FIELD CADY is the data scientist at the Allen Institute for Artificial Intelligence, where he develops tools that use machine learning to mine scientific literature. He has also worked at Google and several Big Data startups. He has a BS in physics and math from Stanford University, and an MS in computer science from Carnegie Mellon.



Table of Contents:

Preface xvii

1 Introduction: Becoming a Unicorn 1

1.1 Aren’t Data Scientists Just Overpaid Statisticians? 2

1.2 How is This Book Organized? 3

1.3 How to Use This Book? 3

1.4 Why is It All in Python, Anyway? 4

1.5 Example Code and Datasets 4

1.6 Parting Words 5

Part I The Stuff You’ll Always Use 7

2 The Data Science Road Map 9

2.1 Frame the Problem 10

2.2 Understand the Data: Basic Questions 11

2.3 Understand the Data: Data Wrangling 12

2.4 Understand the Data: Exploratory Analysis 13

2.5 Extract Features 14

2.6 Model 15

2.7 Present Results 15

2.8 Deploy Code 16

2.9 Iterating 16

2.10 Glossary 17

3 Programming Languages 19

3.1 Why Use a Programming Language? What are the Other Options? 19

3.2 A Survey of Programming Languages for Data Science 20

3.2.1 Python 20

3.2.2 R 21

3.2.3 MATLAB® and Octave 21

3.2.4 SAS® 21

3.2.5 Scala® 22

3.3 Python Crash Course 22

3.3.1 A Note on Versions 22

3.3.2 “Hello World” Script 23

3.3.3 More Complicated Script 23

3.3.4 Atomic Data Types 26

3.4 Strings 27

3.4.1 Comments and Docstrings 28

3.4.2 Complex Data Types 29

3.4.3 Lists 29

3.4.4 Strings and Lists 30

3.4.5 Tuples 31

3.4.6 Dictionaries 31

3.4.7 Sets 32

3.5 Defining Functions 32

3.5.1 For Loops and Control Structures 33

3.5.2 A Few Key Functions 34

3.5.3 Exception Handling 35

3.5.4 Libraries 35

3.5.5 Classes and Objects 35

3.5.6 GOTCHA: Hashable and Unhashable Types 36

3.6 Python’s Technical Libraries 37

3.6.1 Data Frames 38

3.6.2 Series 39

3.6.3 Joining and Grouping 40

3.7 Other Python Resources 42

3.8 Further Reading 42

3.9 Glossary 43

3a Interlude: My Personal Toolkit 45

4 Data Munging: String Manipulation, Regular Expressions, and Data Cleaning 47

4.1 The Worst Dataset in the World 48

4.2 How to Identify Pathologies 48

4.3 Problems with Data Content 49

4.3.1 Duplicate Entries 49

4.3.2 Multiple Entries for a Single Entity 49

4.3.3 Missing Entries 49

4.3.4 NULLs 50

4.3.5 Huge Outliers 50

4.3.6 Out‐of‐Date Data 50

4.3.7 Artificial Entries 50

4.3.8 Irregular Spacings 51

4.4 Formatting Issues 51

4.4.1 Formatting is Irregular between Different Tables/Columns 51

4.4.2 Extra Whitespace 51

4.4.3 Irregular Capitalization 52

4.4.4 Inconsistent Delimiters 52

4.4.5 Irregular NULL Format 52

4.4.6 Invalid Characters 52

4.4.7 Weird or Incompatible Datetimes 52

4.4.8 Operating System Incompatibilities 53

4.4.9 Wrong Software Versions 53

4.5 Example Formatting Script 54

4.6 Regular Expressions 55

4.6.1 Regular Expression Syntax 56

4.7 Life in the Trenches 60

4.8 Glossary 60

5 Visualizations and Simple Metrics 61

5.1 A Note on Python’s Visualization Tools 62

5.2 Example Code 62

5.3 Pie Charts 63

5.4 Bar Charts 65

5.5 Histograms 66

5.6 Means, Standard Deviations, Medians, and Quantiles 69

5.7 Boxplots 70

5.8 Scatterplots 72

5.9 Scatterplots with Logarithmic Axes 74

5.10 Scatter Matrices 76

5.11 Heatmaps 77

5.12 Correlations 78

5.13 Anscombe’s Quartet and the Limits of Numbers 80

5.14 Time Series 81

5.15 Further Reading 85

5.16 Glossary 85

6 Machine Learning Overview 87

6.1 Historical Context 88

6.2 Supervised versus Unsupervised 89

6.3 Training Data, Testing Data, and the Great Boogeyman of Overfitting 89

6.4 Further Reading 91

6.5 Glossary 91

7 Interlude: Feature Extraction Ideas 93

7.1 Standard Features 93

7.2 Features That Involve Grouping 94

7.3 Preview of More Sophisticated Features 95

7.4 Defining the Feature You Want to Predict 95

8 Machine Learning Classification 97

8.1 What is a Classifier, and What Can You Do with It? 97

8.2 A Few Practical Concerns 98

8.3 Binary versus Multiclass 99

8.4 Example Script 99

8.5 Specific Classifiers 101

8.5.1 Decision Trees 101

8.5.2 Random Forests 103

8.5.3 Ensemble Classifiers 104

8.5.4 Support Vector Machines 105

8.5.5 Logistic Regression 108

8.5.6 Lasso Regression 110

8.5.7 Naive Bayes 110

8.5.8 Neural Nets 112

8.6 Evaluating Classifiers 114

8.6.1 Confusion Matrices 114

8.6.2 ROC Curves 115

8.6.3 Area under the ROC Curve 116

8.7 Selecting Classification Cutoffs 117

8.7.1 Other Performance Metrics 118

8.7.2 Lift–Reach Curves 118

8.8 Further Reading 119

8.9 Glossary 119

9 Technical Communication and Documentation 121

9.1 Several Guiding Principles 122

9.1.1 Know Your Audience 122

9.1.2 Show Why It Matters 122

9.1.3 Make It Concrete 123

9.1.4 A Picture is Worth a Thousand Words 123

9.1.5 Don’t Be Arrogant about Your Tech Knowledge 124

9.1.6 Make It Look Decent 124

9.2 Slide Decks 124

9.2.1 C.R.A.P. Design 125

9.2.2 A Few Tips and Rules of Thumb 127

9.3 Written Reports 128

9.4 Speaking: What Has Worked for Me 130

9.5 Code Documentation 131

9.6 Further Reading 132

9.7 Glossary 132

Part II Stuff You Still Need to Know 133

10 Unsupervised Learning: Clustering and Dimensionality Reduction 135

10.1 The Curse of Dimensionality 136

10.2 Example: Eigenfaces for Dimensionality Reduction 138

10.3 Principal Component Analysis and Factor Analysis 140

10.4 Skree Plots and Understanding Dimensionality 142

10.5 Factor Analysis 143

10.6 Limitations of PCA 143

10.7 Clustering 144

10.7.1 Real‐World Assessment of Clusters 144

10.7.2 k‐Means Clustering 145

10.7.3 Gaussian Mixture Models 146

10.7.4 Agglomerative Clustering 147

10.7.5 Evaluating Cluster Quality 148

10.7.6 SiIhouette Score 148

10.7.7 Rand Index and Adjusted Rand Index 149

10.7.8 Mutual Information 150

10.8 Further Reading 151

10.9 Glossary 151

11 Regression 153

11.1 Example: Predicting Diabetes Progression 153

11.2 Least Squares 156

11.3 Fitting Nonlinear Curves 157

11.4 Goodness of Fit: R2 and Correlation 159

11.5 Correlation of Residuals 160

11.6 Linear Regression 161

11.7 LASSO Regression and Feature Selection 162

11.8 Further Reading 164

11.9 Glossary 164

12 Data Encodings and File Formats 165

12.1 Typical File Format Categories 165

12.1.1 Text Files 166

12.1.2 Dense Numerical Arrays 166

12.1.3 Program‐Specific Data Formats 166

12.1.4 Compressed or Archived Data 166

12.2 CSV Files 167

12.3 JSON Files 168

12.4 XML Files 170

12.5 HTML Files 172

12.6 Tar Files 174

12.7 GZip Files 175

12.8 Zip Files 175

12.9 Image Files: Rasterized, Vectorized, and/or Compressed 176

12.10 It’s All Bytes at the End of the Day 177

12.11 Integers 178

12.12 Floats 179

12.13 Text Data 180

12.14 Further Reading 183

12.15 Glossary 183

13 Big Data 185

13.1 What is Big Data? 185

13.2 Hadoop: The File System and the Processor 187

13.3 Using HDFS 188

13.4 Example PySpark Script 189

13.5 Spark Overview 190

13.6 Spark Operations 192

13.7 Two Ways to Run PySpark 193

13.8 Configuring Spark 194

13.9 Under the Hood 195

13.10 Spark Tips and Gotchas 196

13.11 The MapReduce Paradigm 197

13.12 Performance Considerations 199

13.13 Further Reading 200

13.14 Glossary 200

14 Databases 203

14.1 Relational Databases and MySQL® 204

14.1.1 Basic Queries and Grouping 204

14.1.2 Joins 207

14.1.3 Nesting Queries 208

14.1.4 Running MySQL and Managing the DB 209

14.2 Key-Value Stores 210

14.3 Wide Column Stores 211

14.4 Document Stores 211

14.4.1 MongoDB® 212

14.5 Further Reading 214

14.6 Glossary 214

15 Software Engineering Best Practices 217

15.1 Coding Style 217

15.2 Version Control and Git for Data Scientists 220

15.3 Testing Code 222

15.3.1 Unit Tests 223

15.3.2 Integration Tests 224

15.4 Test-Driven Development 225

15.5 AGILE Methodology 225

15.6 Further Reading 226

15.7 Glossary 226

16 Natural Language Processing 229

16.1 Do I Even Need NLP? 229

16.2 The Great Divide: Language versus Statistics 230

16.3 Example: Sentiment Analysis on Stock Market Articles 230

16.4 Software and Datasets 232

16.5 Tokenization 233

16.6 Central Concept: Bag‐of‐Words 233

16.7 Word Weighting: TF‐IDF 235

16.8 n‐Grams 235

16.9 Stop Words 236

16.10 Lemmatization and Stemming 236

16.11 Synonyms 237

16.12 Part of Speech Tagging 237

16.13 Common Problems 238

16.13.1 Search 238

16.13.2 Sentiment Analysis 239

16.13.3 Entity Recognition and Topic Modeling 240

16.14 Advanced NLP: Syntax Trees, Knowledge, and Understanding 240

16.15 Further Reading 241

16.16 Glossary 242

17 Time Series Analysis 243

17.1 Example: Predicting Wikipedia Page Views 244

17.2 A Typical Workflow 247

17.3 Time Series versus Time-Stamped Events 248

17.4 Resampling an Interpolation 249

17.5 Smoothing Signals 251

17.6 Logarithms and Other Transformations 252

17.7 Trends and Periodicity 252

17.8 Windowing 253

17.9 Brainstorming Simple Features 254

17.10 Better Features: Time Series as Vectors 255

17.11 Fourier Analysis: Sometimes a Magic Bullet 256

17.12 Time Series in Context: The Whole Suite of Features 259

17.13 Further Reading 259

17.14 Glossary 260

18 Probability 261

18.1 Flipping Coins: Bernoulli Random Variables 261

18.2 Throwing Darts: Uniform Random Variables 263

18.3 The Uniform Distribution and Pseudorandom Numbers 263

18.4 Nondiscrete, Noncontinuous Random Variables 265

18.5 Notation, Expectations, and Standard Deviation 267

18.6 Dependence, Marginal and Conditional Probability 268

18.7 Understanding the Tails 269

18.8 Binomial Distribution 271

18.9 Poisson Distribution 272

18.10 Normal Distribution 272

18.11 Multivariate Gaussian 273

18.12 Exponential Distribution 274

18.13 Log-Normal Distribution 276

18.14 Entropy 277

18.15 Further Reading 279

18.16 Glossary 279

19 Statistics 281

19.1 Statistics in Perspective 281

19.2 Bayesian versus Frequentist: Practical Tradeoffs and Differing Philosophies 282

19.3 Hypothesis Testing: Key Idea and Example 283

19.4 Multiple Hypothesis Testing 285

19.5 Parameter Estimation 286

19.6 Hypothesis Testing: t-Test 287

19.7 Confidence Intervals 290

19.8 Bayesian Statistics 291

19.9 Naive Bayesian Statistics 293

19.10 Bayesian Networks 293

19.11 Choosing Priors: Maximum Entropy or Domain Knowledge 294

19.12 Further Reading 295

19.13 Glossary 295

20 Programming Language Concepts 297

20.1 Programming Paradigms 297

20.1.1 Imperative 298

20.1.2 Functional 298

20.1.3 Object‐Oriented 301

20.2 Compilation and Interpretation 305

20.3 Type Systems 307

20.3.1 Static versus Dynamic Typing 308

20.3.2 Strong versus Weak Typing 308

20.4 Further Reading 309

20.5 Glossary 309

21 Performance and Computer Memory 311

21.1 Example Script 311

21.2 Algorithm Performance and Big‐O Notation 314

21.3 Some Classic Problems: Sorting a List and Binary Search 315

21.4 Amortized Performance and Average Performance 318

21.5 Two Principles: Reducing Overhead and Managing Memory 320

21.6 Performance Tip: Use Numerical Libraries When Applicable 322

21.7 Performance Tip: Delete Large Structures You Don’t Need 323

21.8 Performance Tip: Use Built‐In Functions When Possible 324

21.9 Performance Tip: Avoid Superfluous Function Calls 324

21.10 Performance Tip: Avoid Creating Large New Objects 325

21.11 Further Reading 325

21.12 Glossary 325

Part III Specialized or Advanced Topics 327

22 Computer Memory and Data Structures 329

22.1 Virtual Memory, the Stack, and the Heap 329

22.2 Example C Program 330

22.3 Data Types and Arrays in Memory 330

22.4 Structs 332

22.5 Pointers, the Stack, and the Heap 333

22.6 Key Data Structures 337

22.6.1 Strings 337

22.6.2 Adjustable‐Size Arrays 338

22.6.3 Hash Tables 339

22.6.4 Linked Lists 340

22.6.5 Binary Search Trees 342

22.7 Further Reading 343

22.8 Glossary 343

23 Maximum Likelihood Estimation and Optimization 345

23.1 Maximum Likelihood Estimation 345

23.2 A Simple Example: Fitting a Line 346

23.3 Another Example: Logistic Regression 348

23.4 Optimization 348

23.5 Gradient Descent and Convex Optimization 350

23.6 Convex Optimization 353

23.7 Stochastic Gradient Descent 355

23.8 Further Reading 355

23.9 Glossary 356

24 Advanced Classifiers 357

24.1 A Note on Libraries 358

24.2 Basic Deep Learning 358

24.3 Convolutional Neural Networks 361

24.4 Different Types of Layers. What the Heck is a Tensor? 362

24.5 Example: The MNIST Handwriting Dataset 363

24.6 Recurrent Neural Networks 366

24.7 Bayesian Networks 367

24.8 Training and Prediction 369

24.9 Markov Chain Monte Carlo 369

24.10 PyMC Example 370

24.11 Further Reading 373

24.12 Glossary 373

25 Stochastic Modeling 375

25.1 Markov Chains 375

25.2 Two Kinds of Markov Chain, Two Kinds of Questions 377

25.3 Markov Chain Monte Carlo 379

25.4 Hidden Markov Models and the Viterbi Algorithm 380

25.5 The Viterbi Algorithm 382

25.6 Random Walks 384

25.7 Brownian Motion 384

25.8 ARIMA Models 385

25.9 Continuous‐Time Markov Processes 386

25.10 Poisson Processes 387

25.11 Further Reading 388

25.12 Glossary 388

25a Parting Words: Your Future as a Data Scientist 391

Index 393



About the Author :

FIELD CADY is the data scientist at the Allen Institute for Artificial Intelligence, where he develops tools that use machine learning to mine scientific literature.
He has also worked at Google and several Big Data startups. He has a BS in physics and math from Stanford University, and an MS in computer science from Carnegie Mellon.


Best Sellers


Product Details
  • ISBN-13: 9781119092940
  • Publisher: John Wiley & Sons Inc
  • Publisher Imprint: John Wiley & Sons Inc
  • Height: 239 mm
  • No of Pages: 416
  • Returnable: N
  • Weight: 752 gr
  • ISBN-10: 1119092949
  • Publisher Date: 14 Apr 2017
  • Binding: Hardback
  • Language: English
  • Returnable: N
  • Spine Width: 23 mm
  • Width: 152 mm


Similar Products

Add Photo
Add Photo

Customer Reviews

     4  |  6 Reviews 
out of (%) reviewers recommend this product
Top Reviews
Rating Snapshot
Select a row below to filter reviews.
5
4
3
2
1
Average Customer Ratings
     4  |  6 Reviews 
00 of 0 Reviews
Sort by :
Active Filters

00 of 0 Reviews
SEARCH RESULTS
1–2 of 2 Reviews
    BoxerLover2 - 5 Days ago
    A Thrilling But Totally Believable Murder Mystery

    Read this in one evening. I had planned to do other things with my day, but it was impossible to put down. Every time I tried, I was drawn back to it in less than 5 minutes. I sobbed my eyes out the entire last 100 pages. Highly recommend!

    BoxerLover2 - 5 Days ago
    A Thrilling But Totally Believable Murder Mystery

    Read this in one evening. I had planned to do other things with my day, but it was impossible to put down. Every time I tried, I was drawn back to it in less than 5 minutes. I sobbed my eyes out the entire last 100 pages. Highly recommend!


Sample text
Photo of
    Media Viewer

    Sample text
    Reviews
    Reader Type:
    BoxerLover2
    00 of 0 review

    Your review was submitted!
    The Data Science Handbook
    John Wiley & Sons Inc -
    The Data Science Handbook
    Writing guidlines
    We want to publish your review, so please:
    • keep your review on the product. Review's that defame author's character will be rejected.
    • Keep your review focused on the product.
    • Avoid writing about customer service. contact us instead if you have issue requiring immediate attention.
    • Refrain from mentioning competitors or the specific price you paid for the product.
    • Do not include any personally identifiable information, such as full names.

    The Data Science Handbook

    Required fields are marked with *

    Review Title*
    Review
      Add Photo Add up to 6 photos
      Would you recommend this product to a friend?
      Tag this Book Read more
      Does your review contain spoilers?
      What type of reader best describes you?
      I agree to the terms & conditions
      You may receive emails regarding this submission. Any emails will include the ability to opt-out of future communications.

      CUSTOMER RATINGS AND REVIEWS AND QUESTIONS AND ANSWERS TERMS OF USE

      These Terms of Use govern your conduct associated with the Customer Ratings and Reviews and/or Questions and Answers service offered by Bookswagon (the "CRR Service").


      By submitting any content to Bookswagon, you guarantee that:
      • You are the sole author and owner of the intellectual property rights in the content;
      • All "moral rights" that you may have in such content have been voluntarily waived by you;
      • All content that you post is accurate;
      • You are at least 13 years old;
      • Use of the content you supply does not violate these Terms of Use and will not cause injury to any person or entity.
      You further agree that you may not submit any content:
      • That is known by you to be false, inaccurate or misleading;
      • That infringes any third party's copyright, patent, trademark, trade secret or other proprietary rights or rights of publicity or privacy;
      • That violates any law, statute, ordinance or regulation (including, but not limited to, those governing, consumer protection, unfair competition, anti-discrimination or false advertising);
      • That is, or may reasonably be considered to be, defamatory, libelous, hateful, racially or religiously biased or offensive, unlawfully threatening or unlawfully harassing to any individual, partnership or corporation;
      • For which you were compensated or granted any consideration by any unapproved third party;
      • That includes any information that references other websites, addresses, email addresses, contact information or phone numbers;
      • That contains any computer viruses, worms or other potentially damaging computer programs or files.
      You agree to indemnify and hold Bookswagon (and its officers, directors, agents, subsidiaries, joint ventures, employees and third-party service providers, including but not limited to Bazaarvoice, Inc.), harmless from all claims, demands, and damages (actual and consequential) of every kind and nature, known and unknown including reasonable attorneys' fees, arising out of a breach of your representations and warranties set forth above, or your violation of any law or the rights of a third party.


      For any content that you submit, you grant Bookswagon a perpetual, irrevocable, royalty-free, transferable right and license to use, copy, modify, delete in its entirety, adapt, publish, translate, create derivative works from and/or sell, transfer, and/or distribute such content and/or incorporate such content into any form, medium or technology throughout the world without compensation to you. Additionally,  Bookswagon may transfer or share any personal information that you submit with its third-party service providers, including but not limited to Bazaarvoice, Inc. in accordance with  Privacy Policy


      All content that you submit may be used at Bookswagon's sole discretion. Bookswagon reserves the right to change, condense, withhold publication, remove or delete any content on Bookswagon's website that Bookswagon deems, in its sole discretion, to violate the content guidelines or any other provision of these Terms of Use.  Bookswagon does not guarantee that you will have any recourse through Bookswagon to edit or delete any content you have submitted. Ratings and written comments are generally posted within two to four business days. However, Bookswagon reserves the right to remove or to refuse to post any submission to the extent authorized by law. You acknowledge that you, not Bookswagon, are responsible for the contents of your submission. None of the content that you submit shall be subject to any obligation of confidence on the part of Bookswagon, its agents, subsidiaries, affiliates, partners or third party service providers (including but not limited to Bazaarvoice, Inc.)and their respective directors, officers and employees.

      Accept


      Inspired by your browsing history


      Your review has been submitted!

      You've already reviewed this product!
      Hello, User