Comparison of non-linear activation functions for deep neural networks on Forest Cover multi-classification

Published in

Analytics Vidhya

17 min readNov 4, 2019

Objective

Deep neural networks have been successfully used in diverse emerging domains to solve real-world complex problems with many more deep learning (DL) architectures, being developed to date. To achieve these state-of-the-art performances, the DL architectures use activation functions (AFs), to perform diverse computations between the hidden layers and the output layers of any given DL architecture. Here, we introduce common types of non-linear activation functions that are alternative to the well known sigmoid function and then evaluate their characteristics. Moreover deeper neural networks will be analyzed because they positively influence the final performances compared to shallower networks. They also strictly depend on the weight initialization hence the effect of drawing weights from Gaussian and uniform distribution will be analyzed making particular attention on how the number of incoming and outgoing connection to a node influences the whole network.

Introduction

The main activation function that was widely used is the Sigmoid function, however, when the Rectifier Linear Unit (ReLU) (Nair & Hinton, 2010) was introduced, it soon became a better replacement for the Sigmoid function due to its positive impact on the different machine learning tasks. After that, different variants of the ReLU activation function have been introduced and this experiment explores them and their impact on the Forest Cover Type dataset. The forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data). The actual forest cover type for a given 30 x 30-meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form (not scaled) and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type.

This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.

· Introduction to the data set

The study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. Each observation is a 30m x 30m patch. You are asked to predict an integer classification for the forest cover type. The seven types are:

1 — Spruce/Fir 5 — Aspen
2 — Lodgepole Pine 6 — Douglas-fir
3 — Ponderosa Pine 7 — Krummholz
4 — Cottonwood/Willow

The training set (15120 observations) contains both features and the Cover_Type. The test set contains only the features. You must predict the Cover_Type for every row in the test set (565892 observations).

o Data Fields

· Elevation — Elevation in meters

· Aspect — Aspect in degrees azimuth

· Slope — Slope in degrees

· Horizontal_Distance_To_Hydrology — Horz Dist to nearest surface water features

· Vertical_Distance_To_Hydrology — Vert Dist to nearest surface water features

· Horizontal_Distance_To_Roadways — Horz Dist to the nearest roadway

· Hillshade_9am (0 to 255 index) — Hillshade index at 9am, summer solstice

· Hillshade_Noon (0 to 255 index) — Hillshade index at noon, summer solstice

· Hillshade_3pm (0 to 255 index) — Hillshade index at 3pm, summer solstice

· Horizontal_Distance_To_Fire_Points — Horz Dist to nearest wildfire ignition points

· Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) — Wilderness area designation

· Soil_Type (40 binary columns, 0 = absence or 1 = presence) — Soil Type designation

· Cover_Type (7 types, integers 1 to 7) — Forest Cover Type designation

o The wilderness areas are:

· Rawah Wilderness Area

· Neota Wilderness Area

· Comanche Peak Wilderness Area

· Cache la Poudre Wilderness Area

o The soil types are:

1) Cathedral family — Rock outcrop complex, extremely stony.
2) Vanet — Retake families complex, very stony.
3) Haploborolis — Rock outcrop complex, rubbly.
4) Ratake family — Rock outcrop complex, rubbly.
5) Vanet family — Rock outcrop complex, rubbly.
6) Vanet — Wetmore families — Rock outcrop complex, stony.
7) Gothic family.
8) Supervisor — Limber families complex.
9) Troutville family, very stony.
10) Bullwark — Catamount families — Rock outcrop complex, rubbly.
11) Bullwark — Catamount families — Rock land complex, rubbly.
12) Legault family — Rock land complex, stony.
13) Catamount family — Rock land — Bullwark family complex, rubbly.
14) Pachic Argiborolis — Aquolis complex.
15) Unspecified in the USFS Soil and ELU Survey.
16) Cryaquolis — Cryoborolis complex.
17) Gateview family — Cryaquolis complex.
18) Rogert family, very stony.
19) Typic Cryaquolis — Borohemists complex.
20) Typic Cryaquepts — Typic Cryaquolls complex.
21) Typic Cryaquolls — Leighcan family, till substratum complex.
22) Leighcan family, till substratum, extremely bouldery.
23) Leighcan family, till substratum — Typic Cryaquolls complex.
24) Leighcan family, extremely stony.
25) Leigh can family, warm, extremely stony.
26) Granile — Catamount families’ complex, very stony.
27) Leigh can family, warm — Rock outcrop complex, extremely stony.
28) Leigh can family — Rock outcrop complex, extremely stony.
29) Como — Legault families’ complex, extremely stony.
30) Como family — Rock land — Legault family complex, extremely stony.
31) Leigh can — Catamount families’ complex, extremely stony.
32) Catamount family — Rock outcrop — Leigh can family complex, extremely stony.
33) Leigh can — Catamount families — Rock outcrop complex, extremely stony.
34) Cryorthents — Rock land complex, extremely stony.
35) Cryumbrepts — Rock outcrop — Cryaquepts complex.
36) Bross family — Rock land — Cryumbrepts complex, extremely stony.
37) Rock outcrop — Cryumbrepts — Cryorthents complex, extremely stony.
38) Leigh can — Moran families — Cryaquolls complex, extremely stony.
39) Moran family — Cryorthents — Leigh can family complex, extremely stony.
40) Moran family — Cryorthents — Rock land complex, extremely stony.

An exploratory study on feature selection

Feature engineering is a process of transforming the given data into a form which is easier to interpret. Here, we are interested in making it more transparent for an ML & DL model, but some features can be generated so that the data visualization prepared for people without a data-related background can be more digestible. However, the concept of transparency for the learning models is a complicated thing as different models often require different approaches for the different kinds of data.

· LOADING OF DATASET

This dataset includes information on tree type, shadow coverage, distance to nearby landmarks (roads etcetera), soil type, and local topography. Here, are loading our data set and importing all the python packages that will be useful in exploration, visualization of our data and predicting our model.

Forest Soil type prediction-loading dataset -Ayush

· Data Statistics

Head: This is another python function head () which gives the overview of our dataset with 5 initial rows along with all the columns available.

2. Shape: This is a python function shape () which determines the size of the data frame, as we can see 581012 instances are having 54 attributes. This also gives an evident confirmation that our dataset is successfully loaded as it matches with the data description.

Forest Soil type prediction-**Data Statistics: shape()- Ayush**

3. Data types: Exploring data types through the function info () which gives a lot of other insights about our dataset.

Forest Soil type prediction-**Data Statistics: info() -Ayush**

· There are 581012 entries in each column

· We can see all columns has been inferred as int64.

· There are no null values, so it’s no sense to use any missing value techniques (e.g. Imputer).

· Total memory usage of our dataset is 243.8 Mb.

4. Statistical description: This can be explored through describe () function which computes the summary of statistics pertaining to the Data-Frame columns. This function gives the mean, std and IQR( interquartile range) values. This function excludes the character columns (although our dataset doesn’t have any) and given summary about numeric columns.

Forest Soil type prediction-**Data Statistics: statistical description — Ayush**

· No attribute is missing as the count is 581012 for all attributes. Hence, all rows can be used.

· Negative value(s) present in Vertical_Distance_To_Hydrology. Hence, some tests such as chi-sq can’t be used.

· Wilderness_Area and Soil_type are one-hot encoded. Hence, they could be converted back for some analysis

· All attributes have some distinct values and no constant attributes, so there is no sense in removing any columns.

· Scales are not the same for all, hence, rescaling and standardization may be necessary for some algorithm.

5. Skew: In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or undefined. Skewness is asymmetry in a statistical distribution, in which the curve appears distorted or skewed either to the left or to the right. Skewness can be quantified to define the extent to which a distribution differs from a normal distribution.

Forest Soil type prediction-**Data Statistics: skewness -Ayush**

· Values close to 0 show less skew

· Several attributes in Soil_Type show a large skew. Hence, some algos may benefit if the skew is corrected.

6. Class Distribution: Class distribution is the way to determine whether all the classes are having the same number of instances or not. Imbalanced class distribution, it’s a scenario where the number of observations belonging to one class is significantly lower than those belonging to the other classes.

Learning algorithms tend to produce unsatisfactory classifiers when faced with imbalanced datasets. For any imbalanced data set, if the event to be predicted belongs to the minority class and the event rate is less than 5%, it is usually referred to as a rare event.

Forest Soil type prediction-**Data Statistics: class distribution (1)- ayush**

Forest Soil type prediction-**Data Statistics: class distribution (2)- Ayush**

Class 4 (having lowest appearances of a class in ‘Cover_type’): 2747

Total no. of observation: 581,102

% of Class 4: 0.47

· Class 4 is having the <1 percentage of presence.

· Classes 3, 4, 5, 6, 7 are having a very low presence in comparison to Classes 1 & 2 in ‘Cover_type’.

Here, we distributed each class with the same number of rows & columns to make imbalance class distribution ineffective.

Exploratory Data Analysis

· PCA ( Principal Component Analysis)

One of the common problems in the analysis of complex data comes from a large number of variables, which requires a large amount of memory and computation power. This is where Principal Component Analysis (PCA) comes in. It is a technique to reduce the dimension of the feature space by feature extraction.

Forest Soil type prediction-Exploratory Data Analysis: PCA- Ayush

· Data Integration

1. Correlation:

· Correlation tells the relation between two attributes.

· Correlation requires continuous data. Hence, ignore Wilderness_Area and Soil_Type as they are binary.

Forest Soil type prediction-**Data Integration: Correlation- Ayush**

2. Scatter Plot (pair plot): Scatter plot is extremely intuitive yet powerful. Just plot the vertical coordinate and horizontal coordinate of each data point in the sample to get its scatter plot. If the relationship is non-linear, or there may be the presence of an outlier, these targets will be visible in the scatter plot. In the case of many features i.e. dimensions, a scatter-plot matrix can be used.

Forest Soil type prediction-**Data Integration: Scatter plot(1)- Ayush**

Forest Soil type prediction-**Data Integration: Scatter plot(2)- Ayush**

· The plots show to which class a point belongs to. The class distribution overlaps in the plots.

· Hill shade patterns give nice ellipsoid patterns with each other.

· Aspect and Hill shades attribute to form a sigmoid pattern.

· Horizontal and vertical distance to hydrology give an almost linear pattern.

· Data Visualization

1. HEAT MAP: Heat map will feature variables as row headers and column headers, and the variable vs. itself on the diagonal — is an extremely powerful way to visualize relationships between variables in high dimensional space.

2. BOX PLOT: t displays essential stats about distribution in a concise visual form. Aka candlestick plot. Also popular in finance. This is known as the box and whisker graph too. It’s popular among statisticians. Used to visualize range. It can be drawn horizontally.

· Max, 3rd Quartile, Median, 1st Quartile, min are the feature can be easily detectable through box plot.

· Box whisker plot displays outliers as a dot!

Forest Soil type prediction-**Data Visualization:** Box Plot -**Ayush**

3. PAIR PLOT: If we want 4d visualization for scatter plot, this is when Pair plot from seaborn package comes into play. Let’s say we have n number of features in a data, Pair plot will create us a (n x n) figure where the diagonal plots will be histogram plot of the feature corresponding to that row and rest of the plots are the combination of feature from each row in y-axis and feature from each column in x-axis.

Forest Soil type prediction-**Data Visualization: Pair** Plot -**Ayush**

4. LM PLOT: Every plot in Seaborn has a set of fixed parameters. We have three mandatory parameters and the rest are optional that we may use as per our requirements. These 3 parameters are values for X-axis, values for Y-axis and reference to the dataset. These 3 are pre-dominantly visible in almost all of Seaborn plots and besides, there is an optional parameter which I would like you to memorize as it comes in very handy. This is a hue parameter and it takes in categorical columns and kind of helps us to group our data plot as per hue parameter values.

· Horizontal_Distance_To_Hydrology & Vertical_Distance_To_Hydrology with Soil_Type2.

· Horizontal_Distance_To_Hydrology & Vertical_Distance_To_Hydrology with Wilderness_Area1.

Forest Soil type prediction-**Data Visualization: LM** Plot -**Ayush**

5. VIOLIN PLOT: The violin plots can be inferred as a combination of Box plot at the middle and distribution /density plots(Kernel Density Estimation ) on both side of the data. This can give us the details of distribution like whether the distribution is multimodal, skewness etc. It also gives us the useful info like 95% confidence interval.

Forest Soil type prediction-**Data Visualization: Violin** Plot(1) -**Ayush**

Forest Soil type prediction-**Data Visualization: Violin** Plot(2) -**Ayush**

· Elevation is has a separate distribution for most classes. Highly correlated with the target and hence an important attribute.

· Aspect contains a couple of normal distribution for several classes

· Horizontal distance to road and hydrology have a similar distribution.

· Hillshade 9 am and 12 pm display left skew.

· Hillshade 3 pm is normal.

· Lots of 0s in the vertical distance to hydrology.

· Wilderness_Area3 gives no class distinction. As values are not present, others give some scope to distinguish.

· Soil_Type, 1,5,8,9,12,14,18–22, 25–30 and 35–40 offer class distinction as values are not present for many classes.

Grouping of one hot encoded attributes: For categorical variables where no such ordinal relationship exists, the integer encoding is not enough. In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

Forest Soil type prediction-**Data Visualization: One hot encoded(1)**-**Ayush**

Forest Soil type prediction-**Data Visualization: One hot encoded(2)**-**Ayush**

· WildernessArea_4 has a lot of presence for cover_type 4. Good class distinction.

· WildernessArea_3 has not much class distinction.

· SoilType 1–6,10–14,17, 22–23, 29–33,35,38–40 offer lot of class distinction as counts for some are very high.

· Data Cleaning

Removing Unnecessary Columns: There are maybe some columns in our data set which will not help in any prediction because they don’t have any specific distinct values to make them even considerable and just creating some noises here. This can be determined by calculating the standard deviation of each column, which we already did through the statistical description.

· Add constant columns as they don’t help in the prediction process.

· Removing the ones which have the standard deviation equals to zero.

· No columns have that standard, so there is no need to remove any column.

Forest Soil type prediction-**Data Cleaning**-**Ayush**

· Data Preparation

Standardization(Standard Scalar): The idea behind Standard Scalar is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1. Given the distribution of the data, each value in the dataset will have the sample mean value subtracted and then divided by the standard deviation of the whole dataset.

Forest Soil type prediction-**Data Preparation: Standardization (2)**-**Ayush**

2. Normalization: Normalization is one of the methods used in data science to bring features in a dataset to the same scale. When you normalize a feature, all feature values will be in the range of 0 to 1.

· Select numerical columns which need to be normalized.

· Normalize Training Data.

· Converting numpy array to a data frame.

· Normalize Testing Data by using the mean and SD of the training set.

Forest Soil type prediction-**Data Preparation: Normalization(1)**-**Ayush**

Forest Soil type prediction-**Data Preparation: Normalization(2)**-**Ayush**

Forest Soil type prediction-**Data Preparation: Normalization(3)**-**Ayush**

Evaluation, prediction, and analysis

This study aims to analyze the performance of generalized MLP architectures which has a back-propagation algorithm using various activation functions for the neurons of hidden and output layers. For experimental comparisons Sigmoid, Tanh, ReLU & Softmax.

The activation function used to transform the activation level of a unit (neuron) into an output signal. There are several common activation functions in use with artificial neural networks (ANN). The most common choice of activation functions for multi-layered perceptron (MLP) is used as transfer functions in research and engineering. Among the reasons for this popularity is its boundedness in the unit interval, the function’s and it’s derivative’s fast computability and several amenable mathematical properties in the realm of approximation theory. However, considering the huge variety of problem domains MLP is applied in, it is intriguing to suspect that specific problems call for single or a set of specific activation functions.

· Here, we can’t use the sigmoid activation function. Our dataset output layer has 7 different classes to determine, which means it’s a multi-classification task but sigmoid is a binary classification algorithm as it just transforms the output to either 0 or 1.

· The sigmoid and hyperbolic tangent activation functions cannot be used in networks with many layers due to the vanishing gradient problem.

· Instead of sigmoid activation, we can use softmax activation function which significantly great in building a multi classifier as it a form of logistic regression that normalizes an input value into a vector of values that follows a probability distribution whose total sums up to 1.

· It should be noted that softmax is not ideally used as an activation function like Sigmoid, Tanh & ReLU (Rectified Linear Units) between layers which may be multiple or just a single one but as an output layer.

1. Validating Data Through ReLU (Rectified Linear Units): The solution of gradient descent is to use the rectified linear activation function, or ReL for short, the main idea is to let the gradient be non zero and recover during training eventually. A node or unit that implements this activation function is referred to as a rectified linear activation unit or ReLU for short. Often, networks that use the rectifier function for the hidden layers are referred to as rectified networks. ReLU is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations. That is a good point to consider when we are designing deep neural nets. The rectified linear activation function is a simple calculation that returns the value provided as input directly, or the value 0.0 if the input is 0.0 or less.

· Softmax activation is great to use as output layer (as it normalizes an input value into a vector of values that follows a probability distribution whose total sums up to 1) in neural networks with ReLU as the middle layer.

Forest Soil type prediction-**Data Prediction: Validating ReLU**-**Ayush**

· As Y variable is multi-class categorical variable, hence using softmax as activation function and sparse-categorical cross-entropy as the loss function.

· Using 2 layers with 64 neurons on both hidden layers and 8 neurons on the output layer.

· We use 26 epochs, one forward passes and one backward pass of all the training examples.

· Here we take batch_size of 60 for this model, the number of training examples in one forward/backwards pass. The higher the batch size, the more memory space you’ll need.

Visualize Training History

Forest Soil type prediction-**Data Prediction: Visualizing ReLU(1)**-**Ayush**

Forest Soil type prediction-**Data Prediction: Visualizing ReLU(2)**-**Ayush**

2. Validating Data Through Tanh: The Hyperbolic Tangent Activation Function. This is due in part to the fact that if a strongly-negative input is provided to the logistic sigmoid, it outputs values very near zero. Like the logistic sigmoid, the tanh function is also sigmoid (“s”-shaped), but instead outputs values that range. Its outputs range from 0 to 1 and are often interpreted as probabilities (in, say, logistic regression). The tanh function, a.k.a. hyperbolic tangent function, is a rescaling of the logistic sigmoid, such that its outputs range from -1 to 1.

The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph.

Forest Soil type prediction-**Data Prediction: Validating TanH(1)**-**Ayush**

Forest Soil type prediction-**Data Prediction: Validating TanH(2)**-**Ayush**

· The function is differentiable.

· The function is monotonic while its derivative is not monotonic.

· We use 54 neurons on the input layer as the total columns are of the same size.

· Here 100 neurons are placed with 4 hidden layers & softmax function is used in the output layer.

· We used dropout function with a value of 0.2 to deal regularization in the deep learning model.

· The tanh function is mainly used classification between two classes.

No. Of Epochs, Loss on validation, Accuracy on validation, Actual loss on Model & Actual accuracy on Model.

Training and Validation Accuracy

Model Accuracy

Conclusion

In this study, we have used 3 conventional differentiable and monotonic activation functions for the evolution of MLP architecture along with Generalized Delta rule learning. These proposed well-known and effective activation functions are TanH, ReLU & Softmax. Having compared their performances, simulation results show that ReLU(rectified linear unit) function performs better recognition accuracy than those of the other functions. In other words, the neural network computed good results when “ReLU-Softmax” combination of activation functions was used for neurons (or nodes) of hidden and output layers. The use softmax layer as a hidden layer is not possible as we have to keep all our nodes (hidden variables) linearly dependent which may result in many problems and poor generalization.

ReLu is less computationally expensive than tanh because it involves simpler mathematical operations.
Softmax were used in output layers instead of the hidden layer.
‘Tanh-Softmax ‘combination gives good results instead of ‘Tanh-Tanh’ combination with an accuracy of 61%.
‘ReLU-Softmax’ combination gives the best results with perfect accuracy of 100%.
Normalization and Standardization help the algorithm to achieve these good accuracies.

Comparison of non-linear activation functions for deep neural networks on Forest Cover multi-classification

Written by Ayush Chauhan