[CS231n]Exercise1.3 - Softmax

Before begin…

CS231n exercise1.3 : softmax에 대한 본인의 풀이이다. 자세한 코드는 github에서 확인 가능하다.

Softmax exercise

Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the assignments page on the course website.

This exercise is analogous to the SVM exercise. You will:

  • implement a fully-vectorized loss function for the Softmax classifier
  • implement the fully-vectorized expression for its analytic gradient
  • check your implementation with numerical gradient
  • use a validation set to tune the learning rate and regularization strength
  • optimize the loss function with SGD
  • visualize the final learned weights
import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading extenrnal modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2
def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000, num_dev=500):
    """
    Load the CIFAR-10 dataset from disk and perform preprocessing to prepare
    it for the linear classifier. These are the same steps as we used for the
    SVM, but condensed to a single function.
    """
    # Load the raw CIFAR-10 data
    cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'

    # Cleaning up variables to prevent loading data multiple times (which may cause memory issue)
    try:
       del X_train, y_train
       del X_test, y_test
       print('Clear previously loaded data.')
    except:
       pass

    X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)

    # subsample the data
    mask = list(range(num_training, num_training + num_validation))
    X_val = X_train[mask]
    y_val = y_train[mask]
    mask = list(range(num_training))
    X_train = X_train[mask]
    y_train = y_train[mask]
    mask = list(range(num_test))
    X_test = X_test[mask]
    y_test = y_test[mask]
    mask = np.random.choice(num_training, num_dev, replace=False)
    X_dev = X_train[mask]
    y_dev = y_train[mask]

    # Preprocessing: reshape the image data into rows
    X_train = np.reshape(X_train, (X_train.shape[0], -1))
    X_val = np.reshape(X_val, (X_val.shape[0], -1))
    X_test = np.reshape(X_test, (X_test.shape[0], -1))
    X_dev = np.reshape(X_dev, (X_dev.shape[0], -1))

    # Normalize the data: subtract the mean image
    mean_image = np.mean(X_train, axis = 0)
    X_train -= mean_image
    X_val -= mean_image
    X_test -= mean_image
    X_dev -= mean_image

    # add bias dimension and transform into columns
    X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])
    X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))])
    X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))])
    X_dev = np.hstack([X_dev, np.ones((X_dev.shape[0], 1))])

    return X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev


# Invoke the above function to get our data.
X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev = get_CIFAR10_data()
print('Train data shape: ', X_train.shape)
print('Train labels shape: ', y_train.shape)
print('Validation data shape: ', X_val.shape)
print('Validation labels shape: ', y_val.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)
print('dev data shape: ', X_dev.shape)
print('dev labels shape: ', y_dev.shape)
Train data shape:  (49000, 3073)
Train labels shape:  (49000,)
Validation data shape:  (1000, 3073)
Validation labels shape:  (1000,)
Test data shape:  (1000, 3073)
Test labels shape:  (1000,)
dev data shape:  (500, 3073)
dev labels shape:  (500,)

Softmax Classifier

Your code for this section will all be written inside cs231n/classifiers/softmax.py.

# First implement the naive softmax loss function with nested loops.
# Open the file cs231n/classifiers/softmax.py and implement the
# softmax_loss_naive function.

from cs231n.classifiers.softmax import softmax_loss_naive
import time

# Generate a random softmax weight matrix and use it to compute the loss.
W = np.random.randn(3073, 10) * 0.0001
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)

# As a rough sanity check, our loss should be something close to -log(0.1).
print('loss: %f' % loss)
print('sanity check: %f' % (-np.log(0.1)))
loss: 2.356826
sanity check: 2.302585

gradient 벡터 계산

\[\begin{align} \frac{\partial L}{\partial W_j} &= (\text{prob}[j] - \mathbf{1}_{j=y}) \cdot x^{(i)} \\ \text{Answer Class} &: (\text{prob}[j]-1) \cdot x^{(i)} \\ \text{Other Class} &: \text{prob}[j] \cdot x^{(i)} \end{align}\]

왜냐하면 $\frac{\partial L}{\partial W_j}=(p_j-\delta_{}j,y)\cdot x \qquad \delta_{j,y} = 1 \ \text{if} j=y, \text{else} 0$

Inline Question 1

Why do we expect our loss to be close to -log(0.1)? Explain briefly.**

$\color{blue}{\textit Your Answer:}$ 소프트 맥스의 로스 계산은 다음과 같다. \(L_i = -\log(\frac{e^{f_{yi}}}{\sum_je^{f_j}})\) 여기서 $f_j$는 클래스 점수 벡터 $f$의 $j$번째 원소로, 입력 샘플에 대한 클래스 $j$의 점수이다. 만약 모델이 모든 클래스에 대해 동일한 확률 $\frac{1}{C}$을 예측한다면, 손실은 다음과 같이 단순화된다. \(L_i = -\log\left(\frac{1}{C}\right) = \log(C)\) CIFAR-10과 같이 클래스 수가 10개인 경우, 이 값은 $-\log(0.1) = \log(10) \approx 2.302$에 해당한다.

# Complete the implementation of softmax_loss_naive and implement a (naive)
# version of the gradient that uses nested loops.
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)

# As we did for the SVM, use numeric gradient checking as a debugging tool.
# The numeric gradient should be close to the analytic gradient.
from cs231n.gradient_check import grad_check_sparse
f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 0.0)[0]
grad_numerical = grad_check_sparse(f, W, grad, 10)
print('End of numeric gradient check. reg = 0.0\n')

# similar to SVM case, do another gradient check with regularization
loss, grad = softmax_loss_naive(W, X_dev, y_dev, 5e1)
f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 5e1)[0]
grad_numerical = grad_check_sparse(f, W, grad, 10)
print('End of numeric gradient check. reg = 5e1')

numerical: 1.664401 analytic: 1.664401, relative error: 4.367385e-08
numerical: 3.735496 analytic: 3.735496, relative error: 1.282187e-08
numerical: -0.727268 analytic: -0.727268, relative error: 7.609598e-08
numerical: 1.459336 analytic: 1.459336, relative error: 2.002898e-08
numerical: 0.331639 analytic: 0.331639, relative error: 6.909905e-08
numerical: 0.296508 analytic: 0.296508, relative error: 2.724084e-08
# Now that we have a naive implementation of the softmax loss function and its gradient,
# implement a vectorized version in softmax_loss_vectorized.
# The two versions should compute the same results, but the vectorized version should be
# much faster.
tic = time.time()
loss_naive, grad_naive = softmax_loss_naive(W, X_dev, y_dev, 0.000005)
toc = time.time()
print('naive loss: %e computed in %fs' % (loss_naive, toc - tic))

from cs231n.classifiers.softmax import softmax_loss_vectorized
tic = time.time()
loss_vectorized, grad_vectorized = softmax_loss_vectorized(W, X_dev, y_dev, 0.000005)
toc = time.time()
print('vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic))

# As we did for the SVM, we use the Frobenius norm to compare the two versions
# of the gradient.
grad_difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')
print('Loss difference: %f' % np.abs(loss_naive - loss_vectorized))
print('Gradient difference: %f' % grad_difference)
naive loss: 2.356826e+00 computed in 0.382285s
vectorized loss: 2.356826e+00 computed in 0.106798s
Loss difference: 0.000000
Gradient difference: 0.000000
# Use the validation set to tune hyperparameters (regularization strength and
# learning rate). You should experiment with different ranges for the learning
# rates and regularization strengths; if you are careful you should be able to
# get a classification accuracy of over 0.35 on the validation set.

from cs231n.classifiers import Softmax
results = {}
best_val = -1
best_softmax = None

################################################################################
# TODO:                                                                        #
# Use the validation set to set the learning rate and regularization strength. #
# This should be identical to the validation that you did for the SVM; save    #
# the best trained softmax classifer in best_softmax.                          #
################################################################################

# Provided as a reference. You may or may not want to change these hyperparameters
learning_rates = [1e-7, 5e-7, 1, 3, 5]
regularization_strengths = [2.5e4, 5e4, 1, 3 ,5]

# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
import itertools

for lr, reg in itertools.product(learning_rates, regularization_strengths):
  model = Softmax()
  model.train(X_train, y_train, learning_rate = lr, reg = reg, num_iters = 1000)

  y_train_pred = model.predict(X_train)
  y_val_pred = model.predict(X_val)

  train_acc = np.mean(y_train == y_train_pred)
  val_acc = np.mean(y_val_pred == y_val)

  results[(lr,reg)] = (train_acc, val_acc)

  if best_val < val_acc:
    best_val = val_acc
    best_softmax = model

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

# Print out results.
for lr, reg in sorted(results):
    train_accuracy, val_accuracy = results[(lr, reg)]
    print('lr %e reg %e train accuracy: %f val accuracy: %f' % (
                lr, reg, train_accuracy, val_accuracy))

print('best validation accuracy achieved during cross-validation: %f' % best_val)
/content/drive/My Drive/cs231n/assignments/assignment1/cs231n/classifiers/softmax.py:96: RuntimeWarning: divide by zero encountered in log
  loss = -np.sum(np.log(correct_class_probs))/num_train
/usr/local/lib/python3.11/dist-packages/numpy/_core/fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
/content/drive/My Drive/cs231n/assignments/assignment1/cs231n/classifiers/softmax.py:97: RuntimeWarning: overflow encountered in multiply
  loss += reg * np.sum(W * W)
/content/drive/My Drive/cs231n/assignments/assignment1/cs231n/classifiers/softmax.py:102: RuntimeWarning: overflow encountered in multiply
  dW += 2 * reg * W
/content/drive/My Drive/cs231n/assignments/assignment1/cs231n/classifiers/softmax.py:97: RuntimeWarning: overflow encountered in scalar multiply
  loss += reg * np.sum(W * W)
/content/drive/My Drive/cs231n/assignments/assignment1/cs231n/classifiers/softmax.py:89: RuntimeWarning: overflow encountered in subtract
  scores -= np.max(scores, axis=1, keepdims=True)
/content/drive/My Drive/cs231n/assignments/assignment1/cs231n/classifiers/softmax.py:89: RuntimeWarning: invalid value encountered in subtract
  scores -= np.max(scores, axis=1, keepdims=True)
/content/drive/My Drive/cs231n/assignments/assignment1/cs231n/classifiers/linear_classifier.py:88: RuntimeWarning: overflow encountered in multiply
  self.W -= learning_rate * grad # Vanilla Gradient Descent 응용 step size == learning rate


lr 1.000000e-07 reg 1.000000e+00 train accuracy: 0.227816 val accuracy: 0.223000
lr 1.000000e-07 reg 3.000000e+00 train accuracy: 0.221816 val accuracy: 0.210000
lr 1.000000e-07 reg 5.000000e+00 train accuracy: 0.229306 val accuracy: 0.232000
lr 1.000000e-07 reg 2.500000e+04 train accuracy: 0.329143 val accuracy: 0.340000
lr 1.000000e-07 reg 5.000000e+04 train accuracy: 0.307000 val accuracy: 0.319000
lr 5.000000e-07 reg 1.000000e+00 train accuracy: 0.299776 val accuracy: 0.310000
lr 5.000000e-07 reg 3.000000e+00 train accuracy: 0.299429 val accuracy: 0.309000
lr 5.000000e-07 reg 5.000000e+00 train accuracy: 0.303020 val accuracy: 0.289000
lr 5.000000e-07 reg 2.500000e+04 train accuracy: 0.322020 val accuracy: 0.345000
lr 5.000000e-07 reg 5.000000e+04 train accuracy: 0.293918 val accuracy: 0.312000
lr 1.000000e+00 reg 1.000000e+00 train accuracy: 0.083204 val accuracy: 0.073000
lr 1.000000e+00 reg 3.000000e+00 train accuracy: 0.100265 val accuracy: 0.087000
lr 1.000000e+00 reg 5.000000e+00 train accuracy: 0.100265 val accuracy: 0.087000
lr 1.000000e+00 reg 2.500000e+04 train accuracy: 0.100265 val accuracy: 0.087000
lr 1.000000e+00 reg 5.000000e+04 train accuracy: 0.100265 val accuracy: 0.087000
lr 3.000000e+00 reg 1.000000e+00 train accuracy: 0.100265 val accuracy: 0.087000
lr 3.000000e+00 reg 3.000000e+00 train accuracy: 0.100265 val accuracy: 0.087000
lr 3.000000e+00 reg 5.000000e+00 train accuracy: 0.100265 val accuracy: 0.087000
lr 3.000000e+00 reg 2.500000e+04 train accuracy: 0.100265 val accuracy: 0.087000
lr 3.000000e+00 reg 5.000000e+04 train accuracy: 0.100265 val accuracy: 0.087000
lr 5.000000e+00 reg 1.000000e+00 train accuracy: 0.100265 val accuracy: 0.087000
lr 5.000000e+00 reg 3.000000e+00 train accuracy: 0.100265 val accuracy: 0.087000
lr 5.000000e+00 reg 5.000000e+00 train accuracy: 0.100265 val accuracy: 0.087000
lr 5.000000e+00 reg 2.500000e+04 train accuracy: 0.100265 val accuracy: 0.087000
lr 5.000000e+00 reg 5.000000e+04 train accuracy: 0.100265 val accuracy: 0.087000
best validation accuracy achieved during cross-validation: 0.345000
# evaluate on test set
# Evaluate the best softmax on test set
y_test_pred = best_softmax.predict(X_test)
test_accuracy = np.mean(y_test == y_test_pred)
print('softmax on raw pixels final test set accuracy: %f' % (test_accuracy, ))
softmax on raw pixels final test set accuracy: 0.334000

Inline Question 2 - True or False

Suppose the overall training loss is defined as the sum of the per-datapoint loss over all training examples. It is possible to add a new datapoint to a training set that would leave the SVM loss unchanged, but this is not the case with the Softmax classifier loss.

$\color{blue}{\textit Your Answer:}$ True

$\color{blue}{\textit Your Explanation:}$

SVM의 데이터 포인트 별 손실은 다음과 같다.

\[L_i = \sum_{j\ne y_i}\max(0,f_j - f_{y_i} + \Delta)\]

새로운 데이터 포인트가 추가되었을때 이 포인트에서 정답클래스가 다른 클래스보타 최소한 델타 만큼 더 크다면 해당 데이터의 손실은 0, 전체 손실은 그대로 유지된다.

Softmax의 데이터 포인트 별 손실은 다음과 같다.

\[L_i = -\log(\frac{e^{f_{y_i}}}{\sum_je^{f_i}})\]

Softmax의 손실 함수는 1에 가까울수록 작아지지만 확률이 완전히 1이 되지 않는 한 항상 양수값을 가진다. 따라서 아무리 정답 클래스 확률이 높아도 손실은 0보다 크며, 새로운 포인트를 추가하면 언제나 손실이 증가한다.

# Visualize the learned weights for each class
w = best_softmax.W[:-1,:] # strip out the bias
w = w.reshape(32, 32, 3, 10)

w_min, w_max = np.min(w), np.max(w)

classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
for i in range(10):
    plt.subplot(2, 5, i + 1)

    # Rescale the weights to be between 0 and 255
    wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)
    plt.imshow(wimg.astype('uint8'))
    plt.axis('off')
    plt.title(classes[i])



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • [CS231n]Exercise1.5 - Features
  • [CS231n]Exercise1.4 - Two Layer Net
  • [CS231n]Exercise1.2 - Support Vector Machine
  • [CS231n]Exercise1.1 - kNN