Activation Functions

Activation Functions

  • Limits output values
  • Squashing function

Threshold

  • For binary functions
  • Not differentiable
    • Sharp rise
  • Heaviside function
  • Unipolar
    • 0 <-> +1
  • Bipolar
    • -1 <-> +1

threshold-activation

Sigmoid

  • Logistic function
  • Normalises
  • Introduces non-linearity
  • Alternative is tanhtanh
    • -1 <-> +1
  • Easy to take derivative ddxσ(x)=ddx[11+ex]=σ(x)(1σ(x))\frac d {dx} \sigma(x)= \frac d {dx} \left[ \frac 1 {1+e^{-x}} \right] =\sigma(x)\cdot(1-\sigma(x))

sigmoid

Derivative

yj(n)=φj(vj(n))=11+evj(n)y_j(n)=\varphi_j(v_j(n))= \frac 1 {1+e^{-v_j(n)}}yj(n)vj(n)=φj(vj(n))=evj(n)(1+evj(n))2=yj(n)(1yj(n))\frac{\partial y_j(n)}{\partial v_j(n)}= \varphi_j'(v_j(n))= \frac{e^{-v_j(n)}}{(1+e^{-v_j(n)})^2}= y_j(n)(1-y_j(n))
  • Nice derivative
  • Max value of φj(vj(n))\varphi_j'(v_j(n)) occurs when yj(n)=0.5y_j(n)=0.5
  • Min value of 0 when yj=0y_j=0 or 11
  • Initial weights chosen so not saturated at 0 or 1

If y=uvy=\frac u v Where uu and vv are differential functions

dydx=ddx(uv)\frac{dy}{dx}=\frac d {dx}\left(\frac u v\right)dydx=vddx(u)uddx(v)v2\frac{dy}{dx}= \frac {v \frac d {dx}(u) - u\frac d {dx}(v)} {v^2}

ReLu

Rectilinear

  • For deep networks
  • y=max(0,x)y=max(0,x)
  • CNNs
    • Breaks associativity of successive convolutions
      • Critical for learning complex functions
    • Sometimes small scalar for negative
      • Leaky ReLu

relu

SoftMax

  • Output is per-class vector of likelihoods #classification
    • Should be normalised into probability vector

AlexNet

f(xi)=exp(xi)j=11000exp(xj)f(x_i)=\frac{\text{exp}(x_i)}{\sum_{j=1}^{1000}\text{exp}(x_j)}