Data-Science — Data Reduction Techniques In Data Pre-Processing

Data Reduction

The goal of reducing the amount of data is to only use relevant information as an input to the data mining algorithm. As a result, data reduction discovers and finally discards material that is useless or redundant. After data reduction, the user should ideally work with datasets of reduced dimensions that represent the original dataset. It’s also expected that the smaller dataset has the same amount of information as the original dataset, if not more.

When the available datasets are vast, the goal of data reduction is to boost the efficacy of machine learning rather than to lose extractable information. In many data mining techniques, it is the most important component for obtaining information from massive data.

Scaling up might have unintended repercussions, which can be mitigated by reducing data size. Excessive memory needs increased computational complexity, and poor learning performance is all examples of such repercussions, according to experts.


loading dataset

Now we will see something about the dataset:

Adding noise in the dataset:

There are total14 features in the dataset. We must first separate the data before implementing the feature selection approach. The reason for this is that we only choose features based on data from the training set, not the entire data set. To evaluate the success of the feature selection and the model, we should put aside a portion of the entire data set as a test set. As a result, the information from the test set is hidden while we choose features and train the model.

Splitting data for train and test

Variance Threshold:

Manually computing variances and thresholding them can be a lot of work. Fortunately, Scikit-learn provides an estimator which can do all the work for us. Just pass a threshold cut-off and all features below that threshold will be dropped.

variance threshold

Univariate Feature Selection:

(1) use the chi-square test.

The first four entries of the array are true, indicating that this technique picked the first four characteristics. The chi-square test performs well because these characteristics are the original features in the data.

(2) use f test

The f test is also capable of accurately identifying the original characteristics.

(3) use mutual_info_classif test

Three univariate feature selection strategies, in all, get the same outcome.
The iris data is used to solve a classification challenge. Similarly, we may undertake feature selection for regression issues using f regression and mutual info regression.

Recursive Feature Elimination:

Principal Component Analysis (PCA):

Naturally, reducing the number of variables in a data set reduces accuracy; nevertheless, the answer to dimensionality reduction is to exchange some accuracy for simplicity. Because smaller data sets are simpler to study and display, and because machine learning algorithms can analyze data more easily and quickly without having to deal with superfluous factors.

PCA Projection to 2D:

When this data is subjected to Principal Component Analysis (PCA), the combination of attributes (principal components, or directions in the feature space) that account for the most variance in the data is identified. On the two initial main components, we plot the various samples.

The goal of Linear Discriminant Analysis (LDA) is to find the characteristics that account for the most variation between groups. In contrast to PCA, LDA is a supervised approach that uses known class labels.

Code for PCA 2D projection of Iris dataset

PCA Projection to 3D:

The code in this part converts four-dimensional data into three-dimensional data. The three primary dimensions of variation are represented by the new components.

Code for PCA 3D projection of Iris dataset