Machine Learning(study notes)

There is no studying without going crazy

Studying alwats drives us crazy

course from 吴恩达机器学习系列课程

Define

Machine Learning

A computer program is said to learn from experience E with respect to some task T and some performance measure P , if its performance on T, as measued

计算机程序从经验E中学习,解决某一任务T进行某一性能度量P,通过P测定在T上的表现因经验E而提高
realy rhyme

Supervised Learning(监督学习)

right answers given

Regression problem

tring to predict continuios valued ouput

需要预测连续的数值输出

in that problem , you should give its some right valued with different classic and machine learning will learn to predict it

在这里插入图片描述

Classidication

discrete valued output (zero or one)

离散取值输出

in that problem, you should give some valued . Different with regression , maybe the type of data
在这里插入图片描述

Unspervised Learning

Clustering

maybe using clustring algorithm to break that data into two separate clusters

使用聚类算法将数据分为两簇

do not know what data mean and data features and so on(just about data information),and you know ,machine learning should classification those data into different clusters

the classic problem of that maybe Cocktail party problem algorithm

经典问题就是鸡尾酒派对算法,就是有背景音乐以及人声,能分理分离出两者的声音

Study

Model representation(模型概述)

using this example
在这里插入图片描述

And we will give some training set
在这里插入图片描述
As you can see , we put
m as number of training examples,
x’s as “input” variable / features ,
y’s as “output” variable / “target” variable ,
(x,y) as one training example ,
(x(i),y(i)) refer to the ith training example.(this superscript i over here , this is not exponentiation.The superscript i in parenthess that’s just an index into my training set)

在这里插入图片描述
We saw that with the training set like our training set of housing prices and we feed that to our learning algorithm.Is the job of a learning algorithm to then output a function which by convention is usually denoted lowercase h

const function

在这里插入图片描述
In this chart , we want the difference between h(x) and y to be small .And one thing I’m gonna do is try to minimize the square difference between the output of the hypothesis and the actual price of the house.

在这里插入图片描述
What we want to do is minimize over theta zero and theta one my function J of theta zero comma theta one

error cost function is probably the most commonly used one for regeression problem

How to use and t

在这里插入图片描述
Here is something we will use.

In order to figure out how it use, we think of theta zero as setting the parameter theta zero equal to 0.So we have only one parameter theta one.

在这里插入图片描述
In the left , the line we fit and the theta we chose will mapping to the chart in right

solve problem

在这里插入图片描述
Here is our problem formulation as usual with the hypothesis ,parameters, cost function ,and our optimization objective

在这里插入图片描述
Using that function, we will finally get that plot

在这里插入图片描述
Here is an example of a contour figure

gradient descent

It turns out gradient descent is a more general algorithm, and is used not only in linear regression. It’s actually used all over the place in machine learning.

在这里插入图片描述
Here is the problem setup.
We are going to see that we have some function J of (θ01). Maybe it is a cost function from linear regression.And we want to come up with an algorithm for minimizing that as a function of J of (θ01).

For the sake of brevity , for the sake of your succinctness of notation , so we just goingn to pretend that have only two parameters through the rest of this video.


The idea for gradient descent :

What we’re going to do is we are going to strat off with some initial guesses for θ0 and θ1.
What we are going to do in gradient descent is we’ll keep changing θ0 and θ1 a little bit to try to reduce J of (θ01)

The summary to gradient descent

在这里插入图片描述
Here is the gradient descent algorithm that we saw last time.

In order to convey these intutions, we want to do is use a slightly simpler example where we want to minimize the function of just one parameter.
So we have a cost function J of just one parameter ,theta one.

在这里插入图片描述

As we choose the red point ,we will through the funtion((d/dθ1)J(θ1)).
Through the function , we can get the tangent of it .
And it is absoultly a postive number . So we will get θ1 = θ1 - α*(positive number).α , the learning rate is always a positive number. So θ1 is decrease.
And it is actually right, the direction of theta move get me closer to the minimum

Similarly, when a point is selected on the left, it will eventually move to the right

在这里插入图片描述
let’s suppose you initialize theta one at a local minimum.And it is already at a local optimum or the local minimum.It turns out that at local optimum your derivative would be equal to zero.So, in your gradient descent update, you have theta one , gives update that theta one ,minus alpha times zero.
what this means is that ,if you are already at a local optimum , it leaves theta one unchanged cause this.

在这里插入图片描述

Gradient descent for linear regression

在这里插入图片描述

在这里插入图片描述
Here is gradient descent for the regression ,which is going to repeat until convergence
在这里插入图片描述

在这里插入图片描述
This kind of algorithm is sometimes called batch gradient descent

Matrices and vectors(basic knowledge)

Firstly, we learn what is matrices
在这里插入图片描述
Next , let us talk about how to refer to sppecific elements of the matrix

在这里插入图片描述
What is vector?
A vector turns out to be a special case of a matrix.
A vector is a matrix that has only 1 column

在这里插入图片描述
about index:
in the matchine, the index is from zero .So ,while we use, zero_index vector is a more convenient notation.

Addition and scalar multiplication

在这里插入图片描述
It turns out you can add only two matrices that are of the same dimensions

在这里插入图片描述

在这里插入图片描述

Matrix-vector multiplication

在这里插入图片描述
在这里插入图片描述

Matrix-matrix multiplication

在这里插入图片描述
在这里插入图片描述

Matrix multiplication properties

在这里插入图片描述
在这里插入图片描述

Identity Matrix

The indentity matrix has the property that it has ones along the diagonals and is zero everywhere else.

在这里插入图片描述

Inverse and transpose

在这里插入图片描述
It turns out only square matrices have inverses
Matrices that don’t have an inverse are “singular” or “degenerate”

在这里插入图片描述

Multiple features

在这里插入图片描述
We have a single feature x, the size of the hourse , and we wanted to use that to predict y the price of the house and the function hθ was our form of our hypothesis.
But now imagine , what if we had not only the size of the house as a feature or as a variable with which to try to predict the price, but that we also knew the number of bedrooms, the number of floors, and the age of home in years.It seems like this would give us a lot of information with which to predict the price.

在这里插入图片描述
if we have N features then rather than summing up over our four features, we would have a sum over our N features.
在这里插入图片描述
In order to simplify the function, we add x0 = 1,and the finaly function express like that
在这里插入图片描述

Gradient descent for multiple variables

在这里插入图片描述
在这里插入图片描述

Gradient descent in practice(多元梯度下降法)

I : Feature Scaling(1:特征缩放)

We are going to repeatedly update each parameter θj according to θj minus α times this derivative tern.
A useful thing to do is to scale the features. Concretely, if you instead define the feature X1 to be the size of the house divided by 2000, and define X2 to be maybe the number of bedrooms dividied by five, then thw contours of the cost function J can become much less skewed, so the contours may look more like circles.And if you turn gradient descent on a cost function like this, then gradient descent, you can show mathematically, can find a much more direct path to the global minimum, rather than taking a much more convoluted path.

在这里插入图片描述
And the feature range should in -3 to 3,too large and too small is not allowed
在这里插入图片描述
在这里插入图片描述

II : Learning rate

The target to learn learning rate is make sure that gradient descent is working correctly.
在这里插入图片描述
What this plot is showing , is it’s showing the value of your cost function after each iteration of gradient descent.And ,if gradient descent is working properly, then J of theta should decrease after every iteration.

在这里插入图片描述
在这里插入图片描述
And the summary of alpha choose is that:

在这里插入图片描述

Features and polynomial regression

Using sold house example

We have twi features called frontage and depth.You might build a linear regression model like this
在这里插入图片描述
where frontage is your first feature x_1 and depth is your second feature x_2, but when you are applying linear regression, you do not necessarily have to use just the features x_1 and x_2 that you are given. What you can do is actually create new features by yourself. So ,if I want to predict the price of a house, what I might do instead is decide that what really determines the size of the house is the area or the land area that I own.So, I might create a new feature.I’m just gonna call this feature x, which is frontage, times depth.This is a multiplication symbol.It’s a frontage times depth because this is the land area that I own and I might then select my hypothesis as that usingjust one feature which is my land area. Because the area of a rectangle is the product of the lengths of the sides. So, depending on what insight you might have into a particular problem ,rather than just taking the features frontage and depth taht we happen to have started off with, sometimes by defining new features you might actually get a better model.

Closely related to the idea of choosing your features is this idea called polynomial regression.

Normal equation

Its essence is actually to take the partial derivative and make it zero, so that its minimum value can be determined。What‘s more ,it just went from univariate to multivariate evidence

在这里插入图片描述
why should we set x_0, it just a const ,like ax+b=0,and x_0 is the b.
在这里插入图片描述
There is the change of feature x.Its form becomes a matrix。
在这里插入图片描述
And this equation will calculate the minimum value we want.(Although I do not know this implementation principle)

what the advantage of this normal equation is you can take any value you want .You do not think about the range of value.(can do not use feature scaling)
在这里插入图片描述
Then , when should you choose gradient descent and when should you choose normal equation , here is some advice.
在这里插入图片描述