||

machine learning algorithm是一种data analysis方法，用于search dataset中的pattern和characteristic structures。typical task主要是data classification，automatic regression，和unsupervised model fitting。machine learning已经从computer science和artifical intelligence上开始emerge，并且从其他的相关学科上借鉴方法，包括statistics，applied mathematics，和 pattern recognition，neural computation。其应用领域包括image and speech analysis， medical imaging，bioinformatics，和exploratory data analysis。

这个course可以intended 作为introduction to machine learning。这个课程review了statistic preliminaries，并且提供了一个overview of commonly used machine learning methods。further 和 more advanced topics会在另外一个课程，叫做statistical learning theory上介绍，在会在spring semester有pro. buhmann来讲授。

lecture会在周一14-15以及周二10-12讲授。教材包括了c.bishop的recognition and machine learning，这本书cover了几乎所有的topic，包含了很多的exercese。还有r.duda的的pattern classification，这是一个该领域的classic introduction，t.hastie的the elements of statistical learning: data mining，inference and prediction，包含了additive model以及boosting。

lecture1

example: classifying documents; problem with high dimension dataset; false positive; false negative; model seltion and validation; over fitting; linear classification;computational consideration for linear classification; easy to train; the nunber of misclassified;Good fit: compromise and tradoff ; Overfitting:; too complex;Goal: balance of goodness of fit and complexicity; address these tradeoff; validate the methods; inspect the decision; statistical basis; training error; minimize generalization error; training data sample from distribution? cartoon picture; some times the complexity increase with prediction error/ generalization error; naivagate the train curve; compact representation; dimension reduction; plane; infer labels; pixe in image for face approximated in low dimension; unusual dta as anomaly? infer label or coefficent vector; infer who is most influential; unsupervised learning; less well defined problem; model fitting and generation but without ground true training set; huge budge of data; other models online learning; reinforcement learning; mathematical representation; object space; feature vector; data representation; measurement; curvature; different kinds of feature vectors; good feature as informative about label; computational cost about feature extraction; nominal, ordinal;

lecture 2

two tutorial sessions. grading with project 30%, exam 70%. recitiation start from this week. broad view of machine learning course supervised learning, supervised learning, classification, given training data, tradoff of the model complexicity and good fitting; predict the real value; represnet the feature; vector base; there are number of considerations; what types of funcition should be considered; linear function and nonlinear function; candidate model; quantifiy the offset; vertical offset; perpendicular offset; linear regression; assumption of the type of functions; one techniques; how do we quantify goodness of fit? linear quadratic; penalize the the point that far away from the line; squared loss; method 1: closed form solution; x transpose; method2: optimization; the objective function; convex? or not convex? gradient descent;

regression;regression problem as optimization problem; minimize the loss between the true label; choose step size for gradient descent; the derivative of sum equas to the derivative of sum; the quadratic fucntion is the loss function; compute the derivate of different diection and combine them together to form; shit a little bit to the direction of the vector; loss function shit to the data which is actually the direction of data; gradient is just a vector; using the gradient times datapoint to construct loss function; compute close form solution which may cause expensive off load when dimension increase; the gradient descent with n*d computational complexicity; other loss function; two dimensional least square fit; linear aggression function; fit linear function; different condideration for the linear function to fit the data; poloyminal function and poloynoimal degree; talor expansion ; degree of polynomial affet the training error; automatically choose the best model; dataset is generated independently and identically distributed(iid); expected error should be minimized under probability p; probability; variable; expectation; independent sample; proximate the expectation; from infinity to uninfinity the sum equas to one; variance and mean for gaussian distribution; one dimension gaussian distrsibution; multivariate distribution; expectation; probability density; variance equals the expectation of square minis the expectation of x square; probabilistic model; minminize the data the data you have see; just solve the optimization problem; r hat w; take more sample and compute the expectation of these samples; if we have acess to sample distribution; this will converge to the true epxectation; sample average converge to the true; how overfitting; just find the least square solution; optimze the training error; experial risk is going to be less than prediction error; test set purely for evaluation; optimize for the training set; the expectation of test data set should equal to expectaion of trainning set; choose the degree of poloynomial; model selection; training set and data set from same distribution; using testset to evaluate these models; underestimate the true risk; training set and test set; test set is random quantity; more and more complicated model can be avoided if using differeent test set. random frunstration; test error; for the training error;

exercise 1:

random variable; space of element; complex; probablity space; the probability of random variable; space and sample sapce; continuous variables; if this represnetation hold true; joint probability; the amount of intrerest normalized by the whole amount; conditional distribution; normalized according to the condition; marginal probability; conditional probability; decompose marginal table to conditional table; the joint probability can lead to Bayes rule; independence; Expectation; what value does these variables take in average; the outcome with the same value to the same group; the probability of this function; mean is just average value; one dimensional random variables; if x and y are totally independently, the covariance of X and Y is zero; increase together; Gaussian Distribution; nice properties; emprerial distribution; grow more and more distribution; converge to distribution; what is convergence; what i want to communicate to you is; large number law; training error; expected error; shrink the gap of two test error;

Lectue 3

project include data set with label and data set without label. Online site for submission. certain performance with certain grade; two baselines; strong baseline 100%degree; form teams of 3 students; determine different function to predict the test dataset; this is the new slide! split data into dataset and training set; LOOCV leave one out cross validation; regulation?; penalize the cost; how large should we pick k; k-fold cross validation;? the larger the degree, the more complexicity the model is; exponential function; polonomial funcition; model validation; regulation; cross validation; the tradeoff; regularization parameter; can control the complexicity; gradient descent; or analytical form; using nonliner basis funtion for linera regression; using regularization coefficient; this was regresion; fit a funciton in a space;

linear classification; when you just want to find yes or no; spam classification; instead of function; training exaple for classification; the basic example to distinguish two features; descision boundary; a lot of features classifiy the email, including IP bag of words; linear separator could have different equations; using nonlinear basis function for linear classification; least squrare is not the only loss function; in linear agression, we have convext function; but in classification; 0/1 loss function is not convex; fit to 0/1 loss function using convex function; solve a different solution; find the argmin of the new loss function; still using gradient decent; stochastic gradient descent; get data point randomly; move in that direction; the perception algrotim; perceptron algorithm; the graident is zero, as we didn't find the false positive; the perceptron will never move away if the data set has a perfect solution; x prime; if no perfect classification, it will jump over and over; convex function local minimum is the global minimum; convext function with gradient descent method; using nonliner function to transform nonlinear feature;

Exercise 2:

this week found the goup using ETH account; the solution gona online one week after the exercise; Matrix Laborary- MATLAB; iterative environment; visualization; .* power; ' transpose; initialization with zeros or ones, or with concrete value; all indices start from 1 not zero in matlab; hold all; equation and syntax and detail description of each parameter in help; any question so far? some trick: C*one(N, 1)' ones(N,1)*C';

Lecture 4

maximum the margin; the same time minimum the number of mistakes; how to select parameter c; try c on different magituide; crossvalidation; the marginal doesn't change as pertron algorithm; the temperature c control the loss; lower c will wider the margin; variables; constrained; Hinge loss function; perceptron loss shit; compute the gradient of hinge loss function; the gradient is zero if the classification is well; regulaizer; margin; pick (xi yi) from training set uniformly at random; dataset reuter rcv1; libraries; convergence; kernel as similarity functions

http://courses.cms.caltech.edu/cs253/

https://blog.sciencenet.cn/blog-942948-724351.html

上一篇：[机器视觉] Computer Graphics

下一篇：[机器学习]object-oriented programming

Archiver|手机版|**科学网**
( 京ICP备07017567号-12 )

GMT+8, 2022-12-3 21:22

Powered by **ScienceNet.cn**

Copyright © 2007- 中国科学报社