Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Learning Linear Separability and Linear Programming in CS 573: Algorithms, Fall 2013, Lecture notes of Linear Programming

The Perceptron algorithm, a machine learning algorithm used for automatic classification and linear separability. It explains how to compute the separating line between red and blue points using linear programming and the concept of linear separability in two dimensions. The document also touches upon the VC dimension and its relationship to the complexity of the function being learned.

What you will learn

  • How does the Perceptron algorithm work for automatic classification?
  • How can linear programming be used to learn a linear separator?
  • What is linear separability and how is it related to machine learning?

Typology: Lecture notes

2021/2022

Uploaded on 09/27/2022

gorillaz
gorillaz 🇬🇧

3.8

(5)

219 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 22
Learning, Linear Separability and Linear
Programming
CS 573: Algorithms, Fall 2013
November 12, 2013
22.1 The Perceptron algorithm
22.1.0.1 Labeling...
(A) given examples:a database of cars.
(B) like to determine which cars are sport cars..
(C) Each car record: interpreted as point in high dimensions.
(D) Example: sport car with 4 doors, manufactured in 1997, by Quaky (with manufacturer ID 6):
(4,1997,6).
Labeled as a sport car.
(E) Tractor by General Mess (manufacturer ID 3) in 1998: (0,1997,3)
Labeled as not a sport car.
(F) Real world: hundreds of attributes. In some cases even millions of attributes!
(G) Automate this classification process: label sports/regular car automatically.
22.1.0.2 Automatic classification...
(A) learning algorithm:
(A) given several (or many) classified examples...
(B) ...develop its own conjecture for rule of classification.
(C) ... can use it for classifying new data.
(B) learning:training +classifying.
(C) Learn a function: f: IRd {−1,1}.
(D) challenge: fmight have infinite complexity...
(E) ...rare situation in real world. Assume learnable functions.
(F) red and blue points that are linearly separable.
(G) Trying to learn a line that separates the red points from the blue points.
1
pf3
pf4
pf5
pf8

Partial preview of the text

Download Learning Linear Separability and Linear Programming in CS 573: Algorithms, Fall 2013 and more Lecture notes Linear Programming in PDF only on Docsity!

Chapter 22

Learning, Linear Separability and Linear

Programming

CS 573: Algorithms, Fall 2013 November 12, 2013

22.1 The Perceptron algorithm

22.1.0.1 Labeling...

(A) given examples:a database of cars. (B) like to determine which cars are sport cars.. (C) Each car record: interpreted as point in high dimensions. (D) Example: sport car with 4 doors, manufactured in 1997, by Quaky (with manufacturer ID 6): (4 , 1997 , 6). Labeled as a sport car. (E) Tractor by General Mess (manufacturer ID 3) in 1998: (0 , 1997 , 3) Labeled as not a sport car. (F) Real world: hundreds of attributes. In some cases even millions of attributes! (G) Automate this classification process: label sports/regular car automatically.

22.1.0.2 Automatic classification...

(A) learning algorithm: (A) given several (or many) classified examples... (B) ...develop its own conjecture for rule of classification. (C) ... can use it for classifying new data. (B) learning : training + classifying. (C) Learn a function: f : IR d^ → {− 1 , 1 }. (D) challenge: f might have infinite complexity... (E) ...rare situation in real world. Assume learnable functions. (F) red and blue points that are linearly separable. (G) Trying to learn a line that separates the red points from the blue points.

22.1.0.3 Linear separability example...

`

22.1.0.4 Learning linear separation

(A) Given red and blue points – how to compute the separating line ? (B) line/plane/hyperplane is the zero set of a linear function. (C) Form: ∀ x ∈ IR d^ f ( x ) = ⟨ a, x ⟩ + b, where a = ( a 1 ,... , ad ) , b = ( b 1 ,... , bd ) ∈ IR^2. ⟨ a, x ⟩ = ∑ i aixi^ is the^ dot product^ of^ a^ and^ x. (D) classification done by computing sign of f ( x ): sign( f ( x )). (E) If sign( f ( x )) is negative: x is not in class. If positive: inside. (F) A set of training examples :

S =

{ ( x 1 , y 1 ) ,... , ( xn, yn )

} ,

where xi ∈ IR d^ and yi ∈ {-1,1}, for i = 1 ,... , n.

22.1.0.5 Classification...

(A) linear classifier h : ( w, b ) where w ∈ IR d^ and b ∈ IR. (B) classification of x ∈ IR d^ is sign(⟨ w, x ⟩ + b ). (C) labeled example ( x, y ), h classifies ( x, y ) correctly if sign(⟨ w, x ⟩ + b ) = y. (D) Assume a linear classifier exists. (E) Given n labeled example. How to compute the linear classifier for these examples? (F) Use linear programming.... (G) looking for ( w , b ), such that for an ( x i, yi ) we have sign(⟨ w , x i ⟩ + b ) = yi , which is

w , x i ⟩ + b ≥ 0 if yi = 1 , and ⟨ w , x i ⟩ + b ≤ 0 if yi = − 1_._

22.1.0.11 Claim by figure...

hard easy

R R

`

R

wopt

γ

R

wopt

`

γ ′

errors: ( R/γ )^2 # errors: ( R/γ ′)^2

22.1.0.12 Proof of Perceptron convergence...

(A) Idea of proof: perceptron weight vector converges to wopt. (B) Distance between wopt and k th update vector:

αk =

∥∥ ∥∥ ∥ wk^ −^

R^2

γ

wopt

∥∥ ∥∥ ∥

2 .

(C) Quantify the change between αk and αk + (D) Example being misclassified is ( x, y ).

22.1.0.13 Proof of Perceptron convergence...

(A) Example being misclassified is ( x, y ) (both are constants). (B) w k +1 ← w k + yx

(C) αk +1 =

∥∥ ∥∥ ∥ wk +1^ −^

R^2

γ

wopt

∥∥ ∥∥ ∥

2

∥∥ ∥∥ ∥ wk^ +^ y x^ −^

R^2

γ

wopt

∥∥ ∥∥ ∥

2

∥∥ ∥∥ ∥

( wk

R^2

γ

wopt

)

  • y x

∥∥ ∥∥ ∥

2

⟨( wkR

2 γ wopt

)

  • y x ,

( wkR

2 γ wopt

)

  • y x

⟨( wkR

2 γ wopt

) ,

( wkR

2 γ wopt

)⟩ +2 y

⟨( wkR

2 γ wopt

) , x

  • x , x ⟩ = αk + 2 y

⟨( wkR

2 γ wopt

) , x

∥∥ ∥ x

∥∥ ∥

2 .

22.1.0.14 Proof of Perceptron convergence...

(A) We proved: αk +1 = αk + 2 y

⟨( wkR

2 γ wopt

) , x

∥∥ ∥ x

∥∥ ∥

2 .

(B) ( x , y ) is misclassified: sign(⟨ wk, x ⟩) ̸= y (C) =⇒ sign( yw k, x ⟩) = − 1 (D) =⇒ yw k, x< 0.

(E)

∥∥ ∥ x

∥∥ ∥ ≤ R =⇒

αk +1 ≤ αk + R^2 + 2 ywk, x ⟩ − 2 y

R^2 γ

wopt, x

αk + R^2 + − 2

R^2

γ

ywopt,x.

(F) ... since 2 ywk, x< 0.

22.1.0.15 Proof of Perceptron convergence...

(A) Proved: αk +1 ≤ αk + R^2 − 2 R

2 γ y^ ⟨ wopt,x ⟩. (B) sign(⟨ w opt , x ⟩) = y. (C) By margin assumption: ywopt , x ⟩ ≥ γ, ∀( x, y ) ∈ S. (D) αk +1 ≤ αk + R^2 − 2 R 2 γ y^ ⟨ wopt,x ⟩ ≤ αk + R^2 − 2 R

2 γ γαk + R^2 − 2 R^2 ≤ αkR^2_._

22.1.0.16 Proof of Perceptron convergence...

(A) We have: αk +1 ≤ αkR^2

(B) α 0 =

∥∥ ∥∥ ∥^0 −^

R^2

γ

wopt

∥∥ ∥∥ ∥

2

R^4

γ^2

∥∥ ∥ wopt

∥∥ ∥

2

R^4

γ^2

(C) ∀ i αi ≥ 0. (D) Q: max # classification errors can make? (E) ... # of updates (F) .. # of updates ≤ α 0 /R^2 ...

(G) A: ≤

R^2

γ^2

(C) ∀( x, y, x^2 + y^2 ) ∈ ( B ): ax + by + c ( x^2 + y^2 ) + d ≥ 0.

(D) U ( h ) =

{ ( x, y )

∣∣ ∣ h (( x, y, x^2 + y^2 )) ≤ 0

} . (E) If U ( h ) is a circle =⇒ RU ( h ) and BU ( h ) = ∅. (F) U ( h ) ≡ ax + by + c ( x^2 + y^2 ) ≤ − d.

(G) ⇐⇒

( x^2 + ac x

)

( y^2 + bc y

) ≤ − dc

(H) ⇐⇒

( x + 2 ac

) 2

( y + 2 bc

) 2 ≤ a

(^2) + b 2 4 c^2 −^

d c (I) This is disk in the plane, as claimed.

22.2.0.22 A closing comment...

Linear separability is a powerful technique that can be used to learn complicated concepts that are considerably more complicated than just hyperplane separation. This lifting technique showed above is the kernel technique or linearization.

22.3 A Little Bit On VC Dimension

22.3.0.23 A Little Bit On VC Dimension

(A) Q: how complex is the function trying to learn? (B) VC-dimension is one way of capturing this notion. (VC = Vapnik, Chervonenkis,1971). (C) A matter of expressivity: What is harder to learn:

(a) A rectangle in the plane. (b) A halfplane. (c) A convex polygon with k sides.

22.3.0.24 Thinking about concepts as binary functions...

(A) X = { p 1 ,p 2 ,... , pm }: points in the plane. (B) H: set of all halfplanes. (C) A half-plane r ∈ H defines a binary vector

r ( X ) = ( b 1 ,... , bm ) where bi = 1 if and only if pi is inside r. (D) Possible binary vectors generated by halfplanes: U ( X, H) = { r ( X ) | r ∈ H }. (E) A set X of m elements is shattered by R if

| U ( X, R )| = 2 m.

(F) What does this mean? (G) The VC-dimension of a set of ranges R is the size of the largest set that it can shatter.

22.3.1 Examples 22.3.1.1 Examples

What is the VC dimensions of circles in the plane? X is set of n points in the plane C is a set of all circles. X = { p, q, r, s }

What subsets of X can we generate by circle?

p

q

r

s

22.3.1.2 Subsets realized by disks

p

q

r

s

{} , { r } , { p } , { q } , { s } , { p, s }, { p, q }, { p, r },{ r, q }{ q, s } and { r, p, q }, { p, r, s }{ p, s, q } , { s, q, r } and { r, p, q, s } We got only 15 sets. There is one set which is not there. Which one? The VC dimension of circles in the plane is 3.

22.3.1.3 Sauer’s Lemma

Lemma 22.3.1 (Sauer Lemma). If R has VC dimension d then | U ( X, R )| = O

( md

) , where m is the size of X.