본문 바로가기

AI/CS231n

CS231n - Lec3. Loss Functions and Optimization

반응형

Loss Function

Loss function (cost function) : loss function is a method of evaluating how well your machine learning algorithm models your featured data set. measurement of how good your model is at predicting the expected outcome 

Linear Classifier

Todo! 

1. Define a loss function that quantify our unhappiness with the scores across the training data

2. Come up with a way of efficiently finding the parameters that minimize the loss function (optimization)

Multiclass SVM loss (Suppor vector machine):

sum(max(Sj-Sji+1 , 0))

 

+1 is safety margin 

how can i choose +1? doesn't matter, it washes out like the overall setting of the scale in W

 

Q1. What happens to loss if car scores change a bit?

   if we jiggle the score, loss will not change 

Q2. What is the min/max possible loss? 

   min=0, max=infinity   __/

Q3. At initialization W is small so all s==0, What is the loss? 

    #ofclasses - 1

   => useful thing to check at practice( for sanity check )

Q4. What if the sum was over all classes? (including j-y_i)

   loss += 1 

   nothing changes significantly but use conventional way of omitting correct one to make mimnimum 0 

Q5. What if we used mean instead of sum? 

   doesn't change, just rescaling 

Q6. What if we used squared term Sum(Max(0, Sj-Sji+1)^2)

   this is different algorithm. non linear way. 

   choosing over linear or square? 

   squared hinge loss : we dont want any big wrong but okay with small wrong 

 

Suppose that we found a W such that loss=0. Is this W unique? 

No! 2W, 3W, 4W ..... exists! 

 

minimizing loss for train data is not too good. 

add regularization to simplify the model (Occam'sRazor - simplest is the best)

ramda: regularizaiaon strength 

다항식이 깊어지지않도록 방지 (L1:차수0, L2:전제차수합0 으로 유도)

L1보다 L2 선호 ( L1은 내가 원하는 특성이 제거됨, L2는 모든것 고려) 

Softmax Classifier

in linear classifier, we dont say what scores mean. 

but for the multinomial logistic regression, meaning exists

exp : sigmoid 

확률=1 , -log=0 , loss=0 

확률=0.01, -log=10, loss=10

Q1. What is the min/max possible loss L_i? 

    min : 0(-log1),  max : infinity(-log0)

    in order to get totally right, score should be like... +(정답), -(모든오답)

    we will never get 0 loss 

Q2. Usually at initialization, W is small so all s=0. what is the loss?

   -log(1/C) also for sanity check 

Q3. Suppose I take a datapoint and i jiggle a bit(changing its score slightly). What happends to the loss in both cases?

   SVM only wants to keep correct score higher than others thats all, doesn't affect loss. But Softmax want to make correct score plus infinity and wrond minus infinity so that jiggling can show significant difference

 

Optimizaion

how to minimize loss? 

Stategy #1 : Random search

Strategy #2 : Follow the slope - GRADIENT DESCENT 

numerical gradient : easy to write but slow, approximate 

W를 변화시켯을때 loss의 변화를 통해 gradient dW를 구함

W => W+h => dW  (use sometimes for debugging - gradient check)

analytic gradient : use calculus, fast, exact, but error-prone

 in practice : derive analytic gradient, check your implementation with numerical gradient

 

we know the gradient then, use GRADIENT DESCENT

weight += - step_size * weight_grad

- : minimum 방향으로, step_size : learning rate 

 

Stochastic Gradient Descent(SGD)

매번 W를 업데이트하면 너무 느림. minibatch(32,64,128..)를 두어서 데이터 개수를 잘라서 n^i개로 W업뎃, ...반복 

update W using sum of gradient descent of each minibatch 

 

For Images

1. Color Histogram

2. Histogram of Oriented Gradient (HoG)

3. bag of words

이미지를 잘라서 비지도학습/ 클러스터 등으로 돌려버리면 각도, 색깔등이 뽑힘 => 새로운 이미지가 들어오면 기존것과 비교해 어떤 특징이 있는지 비교 

 

CNN: 특징을 뽑아내서 사용하는 것이 아니라, 입력된 이미지에서 스스로 특징을 뽑아내도록 사용. 

반응형