Simple Classification in Neural Network

Muhamad Mustain
Analytics Vidhya
Published in
5 min readAug 29, 2020

--

We know that many complex machine learning problems can easily be solved using neural network. For example, in supervised learning (classification), we can use them to classify the image or text.

by : Frankfurt School

Now, what if we do it for the simple dataset that actually can be solved using “regular” machine learning?

In this post, we’ll try to use the simple dataset titled “Gender Classification” from Kaggle. This data contains only 66 rows, 4 features (that represent interests of the user’s gender in our target variable).

Our target is to classify the gender of the user based only on their interests/preferences. By using data.info(), we can see that there is no NULL values in our dataset.

However, despite of no null values in our data, since we only have 4 features and 2 classes, we still have a problem. We can’t be sure that there is no inconsistent label from the same feature values in our data. We can detect them by grouping the data as follows.

grouping = data.groupby(list(data.columns)[:-1]).apply(lambda x: x.Gender.nunique())
grouping[grouping.eq(2)]

What will we do if we have different label (output) from the same feature values (input)?

Well, in this case, nothing we can do now. If it is because of the mistake happened during the data entry (human error), then we can drop those values. However, we can’t just do that since it will make bias in our model.

In reality, it is reasonable that both genders could have mutual interest in color, music genre, beverage, and soft drink. Therefore, we can tackle this problem by adding more features, especially by finding unique characteristics for each gender. But for now, let’s just take it as it is.

Preprocess the Data

The next thing to consider is we can’t directly input the data to neural network model since the data is still in categorical text. Hence, we have to encode the data using One-Hot Encoding and Label Encoding.

by : @michaeldelsole

Read more about them on this link.

# Split the features and labels
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
# Change the data into one-hot encoding (for features) and change label to 0-1
X = pd.get_dummies(X)
y = le.fit_transform(y)

Then, since we have an anomaly in our data, I prefer to use K-Fold to evaluate the model and try to find best train-test split scenario that gives best accuracy in both train and test set.

by : Mingchao Li

Train and Test the Model

We define the model so we can create new model for every iteration in K-Fold. For the input layer, we input the data with shape 20 (because we got 20 columns in total after preprocessing the data), and use ‘float32’ as dtype (because I will export this model into TFJS — that supports float32).

We chose Adam as our optimizer since it is the most popular than the rest. You can find more about other optimizers on this link and learn more about Adam here.

def train_model(X_train, X_test, y_train, y_test):
model = tf.keras.models.Sequential([
tf.keras.Input(shape=(20), dtype='float32'),
tf.keras.layers.Dense(units=1024, activation='relu'),
tf.keras.layers.Dropout(0.4),
tf.keras.layers.Dense(units=1, activation='sigmoid')
])
model.compile(optimizer=Adam(lr=0.0001),
loss='binary_crossentropy',
metrics=['accuracy'])
# Callback to reduce learning rate if no improvement in validation loss for certain number of epochs
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=10, min_lr=1e-8, verbose=0)
# Callback to stop training if no improvement in validation loss for certain number of epochs
early_stop = EarlyStopping(monitor='val_loss', patience=20, verbose=0)
history = model.fit(
X_train, y_train,
epochs=1000,
validation_data=(X_test, y_test),
callbacks=[reduce_lr, early_stop],
verbose=0
)
tr_loss, tr_acc = model.evaluate(X_train, y_train)
loss, accuracy = model.evaluate(X_test, y_test)
return model, history, tr_loss, tr_acc, loss, accuracy

Then we run and try to find best model in each iteration in K-Fold.

kfold = KFold(n_splits=5, random_state=42, shuffle=True)loss_arr = []
acc_arr = []
trloss_arr = []
tracc_arr = []
temp_acc = 0for train, test in kfold.split(data):
model, history, trloss_val, tracc_val, loss_val, acc_val = train_model(X.iloc[train], X.iloc[test], y[train], y[test])
# If we got better accuracy in validation, then we save the split scenario and the model
if acc_val > temp_acc:
print("Model changed")
temp_acc = acc_val
model.save('best_model.h5')
train_index = train
test_index = test
best_history = history
trloss_arr.append(trloss_val)
tracc_arr.append(tracc_val)
loss_arr.append(loss_val)
acc_arr.append(acc_val)

This is the result of the accuracy for each iteration.

Below are the accuracy and loss plot for our best model (in 5th iteration).

Lastly, I made ReactJS App and have deployed it on Github Page. You can try to predict the gender from chosen preferences interactively.

from https://musmeong.github.io/gender-guess/

The model I used is the same with what I proposed in this post, then it is converted into TFJS using following code (after pip install tensorflowjs).

!tensorflowjs_converter --input_format keras best_model.h5 models/

--

--