In the internet era, the data being collected on consumers like us are growing exponentially and attacks on our privacy are becoming a real threat. To better assure our privacy, it is safer to let data owner control the data to be uploaded to the network, as opposed to taking chance with the data servers or the third parties. To this end, we propose a privacy-preserving technique, named Compressive Privacy (CP), to enable the data creator to compress data via collaborative learning, so that the compressed data uploaded onto the internet will be useful only for the intended utility and will not be easily diverted to malicious applications.
For data in a high-dimensional feature vector space, a common approach to data compression is dimension reduction or, equivalently, subspace projection. The most prominent tool is Principal Component Analysis (PCA). For unsupervised learning, PCA can best recover the original data given a specific reduced dimensionality. However, for supervised learning environment, it is more effective to adopt a supervised PCA, known as the Discriminant Component Analysis (DCA), in order to maximize the discriminant capability.
The DCA subspace analysis embraces two different subspaces. The signal subspace components of DCA are associated with the discriminant distance/power (related to the classification effectiveness), while the noise subspace components of DCA are tightly coupled with the recoverability and/or privacy protection. This paper will present three DCA-related data compression methods useful for privacy-preserving applications.
- Utility-driven DCA: Because the rank of the signal subspace is limited by the number of classes, DCA can effectively support classification using a relatively small dimensionality (i.e. high compression).
- Desensitized PCA: By incorporating a signal-subspace ridge into DCA, it leads to a variant especially effective for extracting privacy-preserving components. In this case, the eigenvalues of the noise-space are made to become insensitive to the privacy labels and are ordered according to their corresponding component powers.
- Desensitized K-means/SOM: Since the revelation of the K-means or SOM cluster structure could leak sensitive information, it will be safer perform K-means or SOM clustering on desensitized PCA subspace.