Computational and Statistical Methods of Data Reduction - Data Science PhD Course
1. Robust Statistics for Data Reduction (March 9th - 10th 2020, 09:00-13:00, DIAG, Aula B203)
Prof. Alessio Farcomeni (Tor Vergata)
2. Dimensionality Reduction in Clustering and Streaming (March 16th - 17th 2020, 09:00-13:00, DIAG, Aula B203)
Prof. Chris Schwiegelshohn (Sapienza)
2. Dimensionality Reduction in Clustering and Streaming
First Day:The curse of dimensionality is a common occurrence when working with large data sets. In few dimensions (such as the Euclidean plane), we visualize problems very well and can often find interesting properties of a data set just by hand. In more than three dimensions, our ability to visualize a problem is already severely impacted and our intuition from the Euclidean plane may lead us completely astray. Moreover, algorithms often scale poorly:
- Finding nearest neighbors in 2d can be done in nearly linear time. In high dimensions, it becomes very difficult to improve over either n^2.
- Geometric data structures and decompositions become hard to implement. Line sweeps, Voronoi diagrams, grids, nets usually scale by at least a factor 2^d, where d is the dimension. In some cases, it may be even worse.
- Many problems that are easy to solve in 2D, such as clustering, become computationally intractable in high dimensions. Often, exact solutions require running times that are exponential in the number of dimensions.
Unfortunately, high dimensional data sets are not the exception, but rather the norm in modern data analysis. As such, much of computational data analysis has been devoted with finding ways to reduce the dimension. In this course, we will study two popular methods, namely principal component analysis (PCA) and random projections. Principal component analysis originated in statistics, but is also known under various other names, depending on the fields (e.g. eigenvector problem, low rank approximation, etc). We will illustrate the method, highlighting the problem that is solved and the underlying assumptions of PCA. Next, we will see a powerful tool for dimension reduction known as the Johnson-Lindenstrauss lemma. The Johnson-Lindenstrauss lemma states that given a point set A in an arbitrary high dimension, we can transform A into a point set A' in dimension log |A|, while preserving all pairwise distances. For both of these problems, we will see applications, including k-nearest neighbor classification and k-means.
Second day:Large data sets form a sister topic to dimension reduction. While the benefits of having a small dimension are immediately understood, reducing the size of the data is a comparatively recent paradigm. There are many reasons for data compression. Aside from data storage and retrieval, we want to minimize the amount of communication in distributed computing, enable online and streaming algorithms, or simply run an accurate (but expensive) algorithm on a smaller dataset. A key concept in large-scale data analysis are coresets. We view coresets as a succinct summary of a data set that behaves, for any candidate solution, like the original data set. The surprising success story of data compression is that for many problems, we can construct coresets of size independent of the input. For example, linear regression in d dimensions admits coresets of size O(d), k-means has coresets of size O(k), irrespective of the number of data points of the original data set. In our course, we will describe the coreset paradigm formally. Moreover, we will give an overview of methods to construct coresets for various problems. Examples include constructing coresets from random projections, by analyzing gradients, or via sampling. We will further highlight a number of applications.