본문 바로가기

AI/CS231n

CS231n- Lec6. Training Neural Networks 1

반응형

Part 1 

Activation Functions

what if input of neural is always positive?

x * weight and pass through activation function 

dL/df(activation func) 

dF/dW = x 

W가 항상 같은 방향으로만  움직임

Data Preprocessing

zero mean

normalize => every dimension contribute fairly 

image: zero-centering only, not normalization 

does data preprocessing can solve problem of sigmoid? 

only at the first step

Weight Initialization

if we initialize all params = 0, what happens?

w = 0, all neurons operate same, output same, gradient same, update same 

=> all neuron looks same. symmetry breaking does not happen which we want

but loss can affect backprop differently? 

but all neuron linked with same weight 

 

Small random numbers

W = 0.01 * np.random.randn(D,H)

okay with small networks, but proplems with deeper networks

converge to 0 because of activation function 

1 or -1?

in case of tanh, saturation happens

Xavier initialization (Glorot 2010)

initialize input number with Standard Gaussian 

W = np.random.randn(fan_in, fan_out)/np.sqrt(fan_in)

but doesnt work well with ReLU - lowering the variance, values are getting smaller - deactivated

Batch Normalization

consider a batch of activation at some layer

to make every each dimension unit gaussian

usually insert after FC or CV layers and before nonlinearity

improve gradient flow 

allow higher learning rates

reduces the strong dependence on initialization

acts as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe 

normalizing input of layer, not weight of layer 

gaussian => - mean / deviation

shift와 scale요소를 추가시키고 학습을 시켜버리면 결국 identity mapping이 되서 BM이 사라지지않음? 

doens't happen actually 

BabySitting the Learning Process 

if data size is small? overfiit is must 

Hyperparameter Optimization

learning rate

cross - validation : training set으로 학습, validation set으로 평가 

 

반응형