Data-Science — Data Reduction Techniques In Data Pre-Processing

Sarjakmodi
6 min readOct 29, 2021

Data Reduction

The goal of reducing the amount of data is to only use relevant information as an input to the data mining algorithm. As a result, data reduction discovers and finally discards material that is useless or redundant. After data reduction, the user should ideally work with datasets of reduced dimensions that represent the original dataset. It’s also expected that the smaller dataset has the same amount of information as the original dataset, if not more.

When the available datasets are vast, the goal of data reduction is to boost the efficacy of machine learning rather than to lose extractable information. In many data mining techniques, it is the most important component for obtaining information from massive data.

Scaling up might have unintended repercussions, which can be mitigated by reducing data size. Excessive memory needs increased computational complexity, and poor learning performance is all examples of such repercussions, according to experts.

Dataset:

The dataset chosen for data reduction is the ‘Iris’ dataset from the sklearn. datasets library, which is the most appropriate for this purpose.

loading dataset

Now we will see something about the dataset:

Adding noise in the dataset:

There are total14 features in the dataset. We must first separate the data before implementing the feature selection approach. The reason for this is that we only choose features based on data from the training set, not the entire data set. To evaluate the success of the feature selection and the model, we should put aside a portion of the entire data set as a test set. As a result, the information from the test set is hidden while we choose features and train the model.

Splitting data for train and test

Variance Threshold:

A basic baseline technique to feature selection is the variance threshold. It eliminates any characteristics whose variance falls below a certain level. It eliminates all zero-variance features by default, that is, characteristics that have the same value across all samples.

Manually computing variances and thresholding them can be a lot of work. Fortunately, Scikit-learn provides an estimator which can do all the work for us. Just pass a threshold cut-off and all features below that threshold will be dropped.

variance threshold

Univariate Feature Selection:

The best features are chosen using univariate statistical tests in univariate feature selection. Each attribute is compared to the target variable to check if there is a statistically significant link between the two. Analysis of variance is another name for it (ANOVA). We disregard the other characteristics while analyzing the link between one feature and the target variable. That is why it is referred to as “univariate.” Each feature has a test score associated with it.
Finally, all of the test results are compared, and the characteristics with the highest scores are chosen.

(1) use the chi-square test.

The first four entries of the array are true, indicating that this technique picked the first four characteristics. The chi-square test performs well because these characteristics are the original features in the data.

(2) use f test

The f test is also capable of accurately identifying the original characteristics.

(3) use mutual_info_classif test

Three univariate feature selection strategies, in all, get the same outcome.
The iris data is used to solve a classification challenge. Similarly, we may undertake feature selection for regression issues using f regression and mutual info regression.

Recursive Feature Elimination:

Recursive feature elimination (RFE) selects features by recursively examining smaller and smaller sets of features, given an external estimator that gives weights to features (e.g., the coefficients of a linear model). The estimator is first trained on the original set of features, and the significance of each feature is determined using either the coef_ or feature importances_ attributes. Then, given the existing collection of features, the least significant features are trimmed. On the trimmed set, this approach is continued recursively until the required number of features to pick is attained.

Principal Component Analysis (PCA):

Principal Component Analysis, or PCA, is a dimensionality-reduction approach for reducing the dimensionality of big data sets by converting a large collection of variables into a smaller one that retains the majority of the information in the large set.

Naturally, reducing the number of variables in a data set reduces accuracy; nevertheless, the answer to dimensionality reduction is to exchange some accuracy for simplicity. Because smaller data sets are simpler to study and display, and because machine learning algorithms can analyze data more easily and quickly without having to deal with superfluous factors.

PCA Projection to 2D:

The Iris dataset contains four parameters for three types of Iris flowers (Setosa, Versicolor, and Virginia): sepal length, sepal width, petal length, and petal width.

When this data is subjected to Principal Component Analysis (PCA), the combination of attributes (principal components, or directions in the feature space) that account for the most variance in the data is identified. On the two initial main components, we plot the various samples.

The goal of Linear Discriminant Analysis (LDA) is to find the characteristics that account for the most variation between groups. In contrast to PCA, LDA is a supervised approach that uses known class labels.

Code for PCA 2D projection of Iris dataset

PCA Projection to 3D:

There are four columns in the original data (sepal length, sepal width, petal length, and petal width).

The code in this part converts four-dimensional data into three-dimensional data. The three primary dimensions of variation are represented by the new components.

Code for PCA 3D projection of Iris dataset

Summary:

In this blog, I compared and contrasted the results of several feature selection approaches on the same data. When all of the features are used to train the model, the model performs better when just the remaining features are used following feature selection. Following feature selection, PCA was used to display the data frame in 2D and 3D with decreased components.

--

--