Need this python work to be done by Feb 14.
Problem Set 4
In this problem set you will get some practice with the proximal gradient algorithm, and also acceleration
.
Specifically, you will be implementing ISTA and FISTA
Problem 1: Gradient Descent and Acceleration
In this problem you will explore the impact of ill-conditioning on gradient descent, and will then see how
acceleration can improve the situation. This exercise will walk you through a very similar situation as to
what we saw in the lecture videos that illustrate the performance of gradient descent vs accelerated
gradient descent as the condition number (ratio of largest to smallest eigenvalues of the Hessian)
increases. This is a “toy” problem, but it is still instructive regarding the performance of these two
algorithms.
You will work with the following simple function:
where is a
2
by 2 matrix, as defined below.
π(π₯) = ππ₯,
1
2
π₯
β€
π
In [ ]: ο
Part (A):
Consider the quadratic functions , , and defined by the quadratic matrices above. For each of
these, say whether they are -smooth and/or -strongly convex, and if so, compute the value of the
condition number, for each function.
ππ€π ππ ππ πππ
π½ πΌ
π = π½/πΌ
Part (B):
Compute the best fixed step size for gradient descent, and the best parameters for accelerated gradient
descent. For each function, plot the error as a function of the number of iterations. For
each function, plot these on the same plot so you can compare — so you should have 3 plots total.
(π( )βπ( )π₯π‘ π₯
β
Problem 2: ISTA and FISTA
Recall the least squares problem with regularization from the previous homework:β1
[π(π₯) = βπ΄π₯β π +πβπ₯ ]minπ₯
1
2
β2
2
β1
# We create the data for this simple problem. We will create three quadratics.
# Q_wc — this is a well-conditioned matrix
# Q_ic — this is an ill-conditioned matrix
# Q_sic — this is… a somewhat-ill-conditioned matrix (a technical term!)
import numpy as np
Q_wc = np.array([[1,0.3],[0.3,1]]); q = np.array([0,0]);
Q_sic = np.array([[1,0.85],[0.85,1]]); q = np.array([0,0]);
Q_ic = np.array([[1,0.99],[0.99,1]]); q = np.array([0,0]);
Recall key characteristics of this problem: it is nonsmooth due to the regularization term, and it is not
strongly convex when has more columns than rows. This is why you used the sub-gradient method on
the previous problem set, rather than Gradient descent.
Recall the goal of the proximal gradient algorithm: when we have a composite function, i.e., a function of
the form , if is -smooth and is “simple” in the sense that it has a simple
prox function, then rather than using the subgradient method, we can get much better results by using
proximal gradient, which takes advantage of the fact that is smooth. We can improve this further by
combining the proximal gradient method with acceleration.
Using the same data (same and ) as in Problem Set 3, minimize using iterations with
and .
Use the proximal gradient algorithm, also known as ISTA for the case where is the LASSO objective.
Now use the accelerated proximal gradient algorithm, also, known as FISTA. Plot these results on the
same plot as your results for sub-gradient descent from the previous lecture.
π΄
π(π₯) = π(π₯)+β(π₯) π(π₯) π½ β(π₯)
π(π₯)
π΄ π π(π₯) 10
4
π‘= 0
= 0π₯0
π
In [ ]: ο
Problem 3: Optional — Why we use LASSO
As an optional exercise, you may want to play around with Lasso and explore its properties in the context of
machine learning problems.
To do this: (A) Generate data for yourself at random: create a matrix where each entry comes
from a standard Gaussian. Choose much larger than , say, and . Now choose the
true solution, to be a -sparse vector. You can do this in many ways. One simple approach is to let
equal 10 on randomly chosen entries, and then zero every where else. Finally, generate according
to
where is zero mean Gaussian noise with variance .
Now solve (via an algorithm of your choice) Lasso. Note that you will have to search for a good value for .
Compare the solution you get with the true solution, as you vary . You may also want to compare it
to the solution when you do not have any regularization.
πΓπ π΄
π π π= 1000 π= 100
π₯β π π₯β
π= 5
π¦
π¦=π΄π₯+ π,
π 0.1
π
οΏ½ΜοΏ½Β lasso π
Problem 4: Logistic Regression
Logistic regression is a simple statistical classification method which models the conditional distribution of
the class variable being equal to class given an input . We will examine two classification tasks,
one classifying newsgroup posts, and the other classifying digits. In these tasks the input is some
description of the sample (e.g., word counts in the news case) and is the category the sample belongs to
(e.g., sports, politics). The Logistic Regression model assumes the class distribution conditioned on is
log-linear:
π¦ π π₯β β
π
π₯
π¦
π₯
π(π¦= π|π₯, ) = ,π1:
πΆ
πβ π₯π
β€
π
β
πΆ
π=1
πβ π₯π
β€
π
import numpy as np
import numpy.random as rn
import numpy.linalg as la
import matplotlib.pyplot as plt
import time
A = np.load(“A.npy”)
b = np.load(“b.npy”)
where is the total number of classes, and the denominator sums over all classes to ensure that is
a proper probability distribution. Each class has a parameter , and is the
vector of concatenated parameters . Let be the data matrix where each
sample is a row and is the number of samples. The maximum likelihood approach seeks to find the
parameter which maximizes the likelihood of the classes given the input data and the model:
For the purposes of optimization, we can equivalently minimize the negative log likelihood:
After optimization, the model can be used to classify a new input by choosing the class that the model
predicts as having the highest likelihood; note that we don’t have to compute the normalizing quantity
as it is constant across all classes:
In this problem, you will optimize the logistic regression model for the two classification tasks mentioned
above which vary in dimension and number of classes. The newsgroup dataset that we consider here has
.
We will compare the performance of gradient descent and Nesterov’s accelerated gradient method on the
-regularized version of the logistic regression model:
In this homework, we will use the training and testing data contained in the four csv files in
logistic_news.zip. In a later homework, we will look into the digits dataset (MNIST).
πΆ π(π¦|π₯)
πβ 1,2,β¦,πΆ ππ πβ β
ππΆ
π= [ , ,β¦,πβ€
1
πβ€
2
πβ€
πΆ
]β€ πβ β
πΓπ
π₯β€
π
π
π
π(π¦|π₯, ) = π( | , ) = .max
π1:πΆ
π1:πΆ β
π=1
π
π¦π π₯π π1:πΆ β
π=1
π
π
βπ
β€
π¦π
π₯π
βπΆ
π=1
πβπ
β€
ππ₯π
β(π½) =β logπ(π²|π,π½)
=
(
+log
)
.min
π½ β
π=1
π
π½β€π¦ππ₯π β
π=1
πΆ
π
βπ½
β€
π π₯π
βπΆ
π=1
πβ π₯π
β€
π
π¦= arg π(π¦= π|π₯,π½) = arg π₯max
π
min
π
π½β€
π
πΆ = 20
β
2
=
(
+log
)
+πβπ· .min
π·
1
π β
π=1
π
π½β€π¦ππ₯π β
π=1
πΆ
π
βπ½
β€
π π₯π β
2
In [ ]: ο
Part (A) — Optional —
Find the value of that gives you (approximately) the best generalization performance (error on test set).
You obtain this by solving the the above optimization problem for different values of , and then checking
the performance of the solution on the testing set, using the unregularized logistic regression loss. Note
that this is not a question about an optimization method.
What value do you get for the test loss after convergence?
π
π
Part (B)
If you did Part (A), use the value of you found there. If you did not, use .
Plot the loss against iterations for both the test and training data using the value of from part (a).
π π= 0.001
π
#sample code to load in logistic_news.zip
#we also create a Z matrix, useful for multiclass logistic regression optimization
import zipfile as zipfile
import csv
class Data:
def __init__(self,X,Y):
self.X=X
self.Y=Y
def loaddata(filename):
import io
data=[]
with zipfile.ZipFile(filename) as z:
for filename in z.namelist():
if filename[0]==’X’or filename[0]==’y’:
each=[]
with z.open(filename) as csvDataFile:
csvReader=csv.reader(csvDataFile)
csvDataFile_utf8 = io.TextIOWrapper(csvDataFile, ‘utf-8’)
csvReader=csv.reader(csvDataFile_utf8)
for row in csvReader:
each.append(row)
each==[[float(string) for string in row] for row in each]
each=np.asarray(each)
data.append(each)
X_te=data[0].astype(float)
X_tr=data[1].astype(float)
y_te=data[2][0].astype(int)
y_tr=data[3][0].astype(int)
Z_tr,Z_te=[],[]
for j in range(len(np.unique(y_tr))):
Z_tr.append(np.sum(X_tr[np.where(y_tr==j)[0],:],axis=0))
Z_te.append(np.sum(X_te[np.where(y_te==j)[0],:],axis=0))
Z_tr=np.asarray(Z_tr).T
Z_te=np.asarray(Z_te).T
train= Data(X_tr,y_tr)
train.Z=Z_tr
test = Data(X_te,y_te)
test.Z=Z_te
return train,test
train,test=loaddata(‘./logistic_news.zip’)
Part (C)
How do the two algorithms differ in performance, and how does this change as you decrease ?π
{
“nbformat”: 4,
“nbformat_minor”: 0,
“metadata”: {
“colab”: {
“name”: “Homework 4.ipynb”,
“provenance”: [],
“collapsed_sections”: [],
“toc_visible”: true
},
“kernelspec”: {
“name”: “python3”,
“display_name”: “Python 3”
}
},
“cells”: [
{
“cell_type”: “markdown”,
“metadata”: {
“id”: “wpTRNuz3KDWV”
},
“source”: [
“# Problem Set 4\n”,
“In this problem set you will get some practice with the proximal gradient algorithm, and also acceleration. Specifically, you will be implementing ISTA and FISTA”
]
},
{
“cell_type”: “markdown”,
“metadata”: {
“id”: “rSlW0VqxKPdg”
},
“source”: [
“# Problem 1: Gradient Descent and Acceleration\n”,
“In this problem you will explore the impact of ill-conditioning on gradient descent, and will then see how acceleration can improve the situation. This exercise will walk you through a very similar situation as to what we saw in the lecture videos that illustrate the performance of gradient descent vs accelerated gradient descent as the condition number (ratio of largest to smallest eigenvalues of the Hessian) increases. This is a “toy” problem, but it is still instructive regarding the performance of these two algorithms.\n”,
“\n”,
“You will work with the following simple function:\n”,
“$\n”,
“f(x) = \\frac{1}{2}x^{\\top}Qx,\n”,
“$\n”,
“where $Q$ is a 2 by 2 matrix, as defined below.”
]
},
{
“cell_type”: “code”,
“metadata”: {
“id”: “IIUDm7AuKcdm”
},
“source”: [
“# We create the data for this simple problem. We will create three quadratics.\n”,
“# Q_wc — this is a well-conditioned matrix\n”,
“# Q_ic — this is an ill-conditioned matrix\n”,
“# Q_sic — this is… a somewhat-ill-conditioned matrix (a technical term!)\n”,
“import numpy as np\n”,
“Q_wc = np.array([[1,0.3],[0.3,1]]); q = np.array([0,0]);\n”,
“Q_sic = np.array([[1,0.85],[0.85,1]]); q = np.array([0,0]);\n”,
“Q_ic = np.array([[1,0.99],[0.99,1]]); q = np.array([0,0]);\n”
],
“execution_count”: null,
“outputs”: []
},
{
“cell_type”: “markdown”,
“metadata”: {
“id”: “cygDOkGVKdcH”
},
“source”: [
“## Part (A): \n”,
“Consider the quadratic functions $f_{wc}$, $f_{sic}$, and $f_{ic}$ defined by the quadratic matrices above. For each of these, say whether they are $\\beta$-smooth and/or $\\alpha$-strongly convex, and if so, compute the value of the condition number, $\\kappa = \\beta/\\alpha$ for each function.”
]
},
{
“cell_type”: “markdown”,
“metadata”: {
“id”: “La1aIXOoKiVa”
},
“source”: [
“## Part (B): \n”,
“Compute the best fixed step size for gradient descent, and the best parameters for accelerated gradient descent. For each function, plot the error $(f(x_t) – f(x^{\\ast})$ as a function of the number of iterations. For each function, plot these on the same plot so you can compare — so you should have 3 plots total.”
]
},
{
“cell_type”: “markdown”,
“metadata”: {
“id”: “-YSu7lbrKlmJ”
},
“source”: [
“# Problem 2: ISTA and FISTA\n”,
“Recall the least squares problem with $\\ell^1$ regularization from the previous homework:\n”,
“$\n”,
“\\min_x \\left[f(x) = \\frac{1}{2}\\|{Ax-b}\\|_2^2 + \\lambda \\|{x}\\|_1 \\right]\n”,
“$\n”,
“\n”,
“Recall key characteristics of this problem: it is nonsmooth due to the regularization term, and it is not strongly convex when $A$ has more columns than rows. This is why you used the sub-gradient method on the previous problem set, rather than Gradient descent. \n”,
“\n”,
“Recall the goal of the proximal gradient algorithm: when we have a composite function, i.e., a function of the form $f(x) = g(x) + h(x)$, if $g(x)$ is $\\beta$-smooth and $h(x)$ is “simple” in the sense that it has a simple prox function, then rather than using the subgradient method, we can get much better results by using proximal gradient, which takes advantage of the fact that $g(x)$ is smooth. We can improve this further by combining the proximal gradient method with acceleration. \n”,
“\n”,
“Using the same data (same $A$ and $b$) as in Problem Set 3, minimize $f(x)$ using $10^4$ iterations with $t=0$ and $x_0 =0$. \n”,
“\n”,
“Use the proximal gradient algorithm, also known as ISTA for the case where $f$ is the LASSO objective. Now use the accelerated proximal gradient algorithm, also, known as FISTA. Plot these results on the same plot as your results for sub-gradient descent from the previous lecture. ”
]
},
{
“cell_type”: “code”,
“metadata”: {
“id”: “c92ZUaqytFPR”
},
“source”: [
“import numpy as np\n”,
“import numpy.random as rn\n”,
“import numpy.linalg as la\n”,
“import matplotlib.pyplot as plt\n”,
“import time\n”,
“\n”,
“A = np.load(\”A.npy\”)\n”,
“b = np.load(\”b.npy\”)”
],
“execution_count”: null,
“outputs”: []
},
{
“cell_type”: “markdown”,
“metadata”: {
“id”: “mK5xRMjtKqWj”
},
“source”: [
“# Problem 3: Optional — Why we use LASSO\n”,
“\n”,
“As an optional exercise, you may want to play around with Lasso and explore its properties in the context of machine learning problems. \n”,
“\n”,
“To do this: (A) Generate data for yourself at random: create a $n \\times d$ matrix $A$ where each entry comes from a standard Gaussian. Choose $d$ much larger than $n$, say, $d = 1000$ and $n = 100$. Now choose the true solution, $x^{\\ast}$ to be a $k$-sparse vector. You can do this in many ways. One simple approach is to let $x^{\\ast}$ equal 10 on $k=5$ randomly chosen entries, and then zero every where else. Finally, generate $y$ according to\n”,
“$\n”,
“y = Ax + \\epsilon,\n”,
“$\n”,
“where $\\epsilon$ is zero mean Gaussian noise with variance $0.1$.\n”,
“\n”,
“Now solve (via an algorithm of your choice) Lasso. Note that you will have to search for a good value for $\\lambda$. Compare the solution you get $\\hat{x}_{\\rm lasso}$ with the true solution, as you vary $\\lambda$. You may also want to compare it to the solution when you do not have any regularization. \n”
]
},
{
“cell_type”: “markdown”,
“metadata”: {
“id”: “DCw3ZCBGKtAS”
},
“source”: [
“# Problem 4: Logistic Regression\n”,
“\n”,
“Logistic regression is a simple statistical classification method which models\n”,
“the conditional distribution of the class variable $y$ being equal to class $c$\n”,
“given an input $x \\in \\mathbb{R}^n$. We will examine two classification tasks, one\n”,
“classifying newsgroup posts, and the other classifying digits. In these tasks\n”,
“the input $x$ is some description of the sample (e.g., word counts in the news\n”,
“case) and $y$ is the category the sample belongs to (e.g., sports, politics).\n”,
“The Logistic Regression model assumes the class distribution conditioned on $x$\n”,
“is log-linear:\n”,
“$\n”,
“p(y=c|x,b_{1:C}) = \\frac{e^{-b_c^\\top x}}{\\sum_{j=1}^C e^{-b_j^\\top x}},\n”,
“$\n”,
“where $C$ is the total number of classes, and the denominator sums over all\n”,
“classes to ensure that $p(y|x)$ is a proper probability distribution. Each\n”,
“class $c \\in {1,2, \\dots, C}$ has a parameter $b_c$, and $\\mathbf{b} \\in\n”,
“\\mathbb{R}^{nC}$ is the vector of concatenated parameters $\\mathbf{b} =\n”,
“[b_1^\\top,b_2^\\top,\\dots,b_C^\\top]^\\top$. Let $X \\in \\mathbb{R}^{N \\times n}$ be the\n”,
“data matrix where each sample $x_i^\\top$ is a row and $N$ is the number of\n”,
“samples. The maximum likelihood approach seeks to find the parameter\n”,
“$\\mathbf{b}$ which maximizes the likelihood of the classes given the input data\n”,
“and the model:\n”,
“\n”,
“$\n”,
“\\max_{b_{1:C}} \\; p(y|x,b_{1:C}) = \\prod_{i=1}^N p(y_i|x_i,b_{1:C}) = \\prod_{i=1}^N \\frac{e^{-b_{y_i}^\\top x_i}}{\\sum_{j=1}^C e^{-b_j^\\top x_i}}.\n”,
“$\n”,
“\n”,
“For the purposes of optimization, we can equivalently minimize the negative log\n”,
“likelihood:\n”,
“$\n”,
“\\min_\\mathbf{\\beta} \\ell(\\mathbf{\\beta}) = -\\log p(\\textbf{y}|X, \\mathbf{\\beta}) = \\sum_{i=1}^N \\left( \\beta_{y_i}^\\top x_i + \\log{\\sum_{j=1}^C e^{-\\beta_j^\\top x_i}} \\right).\n”,
“$\n”,
“\n”,
“After optimization, the model can be used to classify a new input by choosing\n”,
“the class that the model predicts as having the highest likelihood; note that\n”,
“we don’t have to compute the normalizing quantity $\\sum_{j=1}^C e^{-b_j^\\top x}$\n”,
“as it is constant across all classes:\n”,
“$\n”,
“y = \\arg\\max_j p(y=j| x, \\mathbf{\\beta}) = \\arg\\min_j \\beta_j^\\top x\n”,
“$\n”,
“In this problem, you will optimize the logistic regression model for the two\n”,
“classification tasks mentioned above which vary in dimension and number of\n”,
“classes. The newsgroup dataset that we consider here has $C=20$. \n”,
“\n”,
“We will compare the performance of gradient descent and Nesterov’s accelerated gradient\n”,
“method on the $\\ell^2$-regularized version of the logistic regression model:\n”,
“$\n”,
“\\min_{\\boldsymbol{\\beta}} = \\frac{1}{N} \\sum_{i=1}^N \\left( \\beta_{y_i}^\\top x_i + \\log{\\sum_{j=1}^C e^{-\\beta_j^\\top x_i}} \\right) + \\mu \\|\\boldsymbol{\\beta}\\|^2.\n”,
“$\n”,
“\n”,
“In this homework, we will use the training and testing data contained in the four csv files in logistic\\_news.zip. In a later homework, we will look into the digits dataset (MNIST).\n”,
“\n”
]
},
{
“cell_type”: “code”,
“metadata”: {
“id”: “fqKRqJrtshSk”
},
“source”: [
“#sample code to load in logistic_news.zip\n”,
“#we also create a Z matrix, useful for multiclass logistic regression optimization\n”,
“import zipfile as zipfile\n”,
“import csv\n”,
“\n”,
“class Data:\n”,
” def __init__(self,X,Y):\n”,
” self.X=X\n”,
” self.Y=Y \n”,
“\n”,
“def loaddata(filename):\n”,
” import io\n”,
” data=[]\n”,
” with zipfile.ZipFile(filename) as z:\n”,
” for filename in z.namelist():\n”,
” if filename[0]==’X’or filename[0]==’y’:\n”,
” each=[]\n”,
” with z.open(filename) as csvDataFile:\n”,
” csvReader=csv.reader(csvDataFile)\n”,
” csvDataFile_utf8 = io.TextIOWrapper(csvDataFile, ‘utf-8’)\n”,
” csvReader=csv.reader(csvDataFile_utf8)\n”,
” for row in csvReader:\n”,
” each.append(row)\n”,
” each==[[float(string) for string in row] for row in each]\n”,
” each=np.asarray(each)\n”,
” data.append(each) \n”,
” X_te=data[0].astype(float)\n”,
” X_tr=data[1].astype(float)\n”,
” y_te=data[2][0].astype(int)\n”,
” y_tr=data[3][0].astype(int)\n”,
” Z_tr,Z_te=[],[]\n”,
” for j in range(len(np.unique(y_tr))):\n”,
” Z_tr.append(np.sum(X_tr[np.where(y_tr==j)[0],:],axis=0))\n”,
” Z_te.append(np.sum(X_te[np.where(y_te==j)[0],:],axis=0))\n”,
” Z_tr=np.asarray(Z_tr).T\n”,
” Z_te=np.asarray(Z_te).T\n”,
” train= Data(X_tr,y_tr)\n”,
” train.Z=Z_tr\n”,
” test = Data(X_te,y_te)\n”,
” test.Z=Z_te\n”,
” return train,test\n”,
“\n”,
“train,test=loaddata(‘./logistic_news.zip’)”
],
“execution_count”: null,
“outputs”: []
},
{
“cell_type”: “markdown”,
“metadata”: {
“id”: “sO78y9YxKvui”
},
“source”: [
“## Part (A) — Optional — \n”,
“Find the value of $\\mu$ that gives you (approximately) the best generalization performance (error on test set). You obtain this by solving the the above optimization problem for different values of $\\mu$, and then checking the performance of the solution on the testing set, using the unregularized logistic regression loss. Note that this is not a question about an optimization method.\n”,
“\n”,
“What value do you get for the test loss after convergence? ”
]
},
{
“cell_type”: “markdown”,
“metadata”: {
“id”: “VxkrZ-d2rTl3”
},
“source”: [
“## Part (B) \n”,
“\n”,
“If you did Part (A), use the value of $\\mu$ you found there. If you did not, use $\\mu = 0.001$. \n”,
“\n”,
“Plot the loss against iterations for both the test and training data\n”,
“using the value of $\\mu$ from part (a).”
]
},
{
“cell_type”: “markdown”,
“metadata”: {
“id”: “0SidZhForTAT”
},
“source”: [
“## Part (C) \n”,
“\n”,
“How do the two algorithms differ in performance, and how does this change\n”,
“as you decrease $\\mu$? ”
]
}
]
}