# Foundations of Data Science - Textbook

Avrim Blum, John Hopcroft, Ravindran Kannan

June, 2016

Introduction:

Computer science as an academic discipline began in the 1960’s. Emphasis was on programming languages, compilers, operating systems, and the mathematical theory that supported these areas. Courses in theoretical computer science covered finite automata, regular expressions, context free languages, and computability.

In the 1970’s, the study of algorithms was added as an important component of theory. The emphasis was on making computers useful. Today, a fundamental change is taking place and the focus is more on applications. There are many reasons for this change. The merging of computing and communications has played an important role.

The enhanced ability to observe, collect and store data in the natural sciences, in commerce, and in other fields calls for a change in our understanding of data and how to handle it in the modern setting. The emergence of the web and social networks as central aspects of daily life presents both opportunities and challenges for theory. While traditional areas of computer science remain highly important, increasingly researchers of the future will be involved with using computers to understand and extract usable information from massive data arising in applications, not just how to make computers useful on specific well-defined problems.

With this in mind we have written this book to cover the theory likely to be useful in the next 40 years, just as an understanding of automata theory, algorithms and related topics gave students an advantage in the last 40 years. One of the major changes is the switch from discrete mathematics to more of an emphasis on probability, statistics, and numerical methods. Early drafts of the book have been used for both undergraduate and graduate courses. Background material needed for an undergraduate course has been put in the appendix. For this reason, the appendix has homework problems.

This book starts with the treatment of high dimensional geometry. Modern data in diverse fields such as Information Processing, Search, Machine Learning, etc., is often represented advantageously as vectors with a large number of components. This is so even in cases when the vector representation is not the natural first choice. Our intuition from two or three dimensional space can be surprisingly off the mark when it comes to high dimensional space.

Chapter 2 works out the fundamentals needed to understand the differences. The emphasis of the chapter, as well as the book in general, is to get across the mathematical foundations rather than dwell on particular applications that are only briefly described. The mathematical areas most relevant to dealing with high-dimensional data are matrix algebra and algorithms. We focus on singular value decomposition, a central tool in this area.

Chapter 3 gives a from-first-principles description of this. Applications of singular value decomposition include principal component analysis, a widely used technique which we touch upon, as well as modern applications to statistical mixtures of probability densities, discrete optimization, etc., which are described in more detail.

Central to our understanding of large structures, like the web and social networks, is building models to capture essential properties of these structures. The simplest model is that of a random graph formulated by Erd¨os and Renyi, which we study in detail proving that certain global phenomena, like a giant connected component, arise in such structures with only local choices. We also describe other models of random graphs.

One of the surprises of computer science over the last two decades is that some domain independent methods have been immensely successful in tackling problems from diverse areas. Machine learning is a striking example. We describe the foundations of machine learning, both algorithms for optimizing over given training examples, as well as the theory for understanding when such optimization can be expected to lead to good performance on new, unseen data, including important measures such as Vapnik-Chervonenkis dimension. Another important domain-independent technique is based on Markov chains. The underlying mathematical theory, as well as the connections to electrical networks, forms the core of our chapter on Markov chains.

The field of algorithms has traditionally assumed that the input data to a problem is presented in random access memory, which the algorithm can repeatedly access. This is not feasible for modern problems.

The streaming model and other models have been formulated to better reflect this. In this setting, sampling plays a crucial role and, indeed, we have to sample on the fly. In Chapter 7 we study how to draw good samples efficiently and how to estimate statistical and linear algebra quantities, with such samples.

Another important tool for understanding data is clustering, dividing data into groups of similar objects. After describing some of the basic methods for clustering, such as the k-means algorithm, we focus on modern developments in understanding these, as well as newer algorithms. The chapter ends with a study of clustering criteria.

This book also covers graphical models and belief propagation, ranking and voting, sparse vectors, and compressed sensing. The appendix includes a wealth of background material.

A word about notation in the book. To help the student, we have adopted certain notations, and with a few exceptions, adhered to them.

We use lower case letters for scalar variables and functions, bold face lower case for vectors, and upper case letters for matrices. Lower case near the beginning of the alphabet tend to be constants, in the middle of the alphabet, such as i, j, and k, are indices in summations, n and m for integer sizes, and x, y and z for variables.

If A is a matrix its elements are aij and its rows are ai . If ai is a vector its coordinates are aij . Where the literature traditionally uses a symbol for a quantity, we also used that symbol, even if it meant abandoning our convention. If we have a set of points in some vector space, and work with a subspace, we use n for the number of points, d for the dimension of the space, and k for the dimension of the subspace.

The term ”almost surely” means with probability one. We use ln n for the natural logarithm and log n for the base two logarithm. If we want base ten, we will use log10 . To simplify notation and to make it easier to read we use E 2 (1 − x) for E(1 − x) 2 and E(1 − x) 2 for E (1 − x) 2 . When we say “randomly select” some number of points from a given probability distribution D, independence is always assumed unless otherwise stated.