Introduction

Often represented as mimicking the human brain. A Neuron represents an algorithm. The neuron gets the input data and produces output. The connection between the Neuron are weighted, and organised in layers

Hidden layers will be multiple, a Neural network with more than 3 or 3 hidden layers is considered to be a Deep Learning Network

FNN - Feed forward Neural Network - The data strictly moves from left to right. It won’t go in cycle (back and forth)

Back Propagation - Move backword on the network and adjusts the weights, why? todo

Loss Function - Finds the error rate by comparing the ground truth and prediction

Neuron’s Algorithm (Activation Functions) - affects the connected outputs, eg RELU

Dense - When the next network layer increases, Sparse otherwise

Deep Learning Algorithms, Not the Activation Function

Supervised

  • FNN - Feed-fully Forward Neural Network
  • RNN - Recurrent Neural Network
  • CNN - Convoluted Neural Network Unsupervised Deep Belief Networks (DBN) Stacked Auto Encoders (SAE) Restricted Boltzmann Machines (RBMs)

Activation Functions

The output of the activation function is in the range of 0 to - 1 or -1 to 1, the name activation function signifies “whether data is going to move forward or not, acts like a gate”

Linear Activation Function - Only pass data forward (also known as Identity function)

  • The model is not really learning, doesn’t improve upon error term
  • cannot perform back propagation
  • cannot have layers, has only one layer
  • cannot handle complex non-linear data
  • derivative is 1, what you put is what you get

Non linear activation function - Perform back propagations

Binary Step - Activate on threshold

  • Yes or No, {0, 1}
  • Not much used today

Sigmoid - continuous logistic function, squashed between 0 and 1

  • Logistic curve that represents an S shape
  • Handles binary and muliti-classification
    • Cat, Dog and Donkey, why not
  • One of the most used activation function
  • Responds less to x near the end (the vanishing gradient problem, where the network refuses to learn further)
  • Range (0,1)

Tanh - scaled version of sigmoid

- The scale is larger, but still have the vanishing gradient problem - Sigmoid saturates earlier, where are in Tanh, mean is 0 better gradients

ReLU - Treats any negative value as 0, (most commonly used functions)

Rectified Linear Unit Activation Function

The positive axis linear, negative axis is always 0

  • Range is [0, Infinity)
  • ReLU sparsely trigger the neuron, unlike Tanh and Sigmoid where the network become dense, since almost all the neurons fire
  • Less costly and more effective
  • Negative values often mean feature not present”
  • Dying ReLU: if a neuron keeps getting negative inputs, gradient = 0 → it stops learning

Leaky ReLU - counters dying ReLU problem with a small slope of negative values

  • Reduces the risk of “dying ReLU problem”
  • Saves some nodes from death Parameterised ReLU - Negative slope is fixed at 0.01x
    ReLU6 - The upper value have some max value, ie bound

ELU - Exponential Linear Unit

  • Linear gradient on the positive axis, has a slope toward -1
  • Something between ReLU and Leaky ReLU
  • Saturates at larger negative values

Swish activation functions

Suggested by the Google team as a replacement to ReLU The idea is Very Negative Values will zero out

Maxout (n)

Has the benefits of ReLU but doensn’t have the dying ReLU problem. However Maxout is expensive as it doubles the parameters for each neuron?