Activation Functions
- Limits output values
- Squashing function
Threshold
- For binary functions
- Not differentiable
- Heaviside function
- Unipolar
- Bipolar

Sigmoid
- Logistic function
- Normalises
- Introduces non-linearity
- Alternative is tanh
- Easy to take derivative
dxdσ(x)=dxd[1+e−x1]=σ(x)⋅(1−σ(x))

Derivative
yj(n)=φj(vj(n))=1+e−vj(n)1∂vj(n)∂yj(n)=φj′(vj(n))=(1+e−vj(n))2e−vj(n)=yj(n)(1−yj(n))- Nice derivative
- Max value of φj′(vj(n)) occurs when yj(n)=0.5
- Min value of 0 when yj=0 or 1
- Initial weights chosen so not saturated at 0 or 1
If y=vu
Where u and v are differential functions
dxdy=dxd(vu)dxdy=v2vdxd(u)−udxd(v)ReLu
Rectilinear
- For deep networks
- y=max(0,x)
- CNNs
- Breaks associativity of successive convolutions
- Critical for learning complex functions
- Sometimes small scalar for negative

SoftMax
- Output is per-class vector of likelihoods #classification
- Should be normalised into probability vector
AlexNet
f(xi)=∑j=11000exp(xj)exp(xi)