Mastering Step Size and Learning Rate in Optimization
Written on
Understanding Step Size and Learning Rate
In the realm of Machine Learning optimization algorithms, the step size, commonly referred to as the learning rate, plays a pivotal role. It dictates how much the model parameters are adjusted in each iteration of the optimization process.
What are the implications of selecting an inappropriate step size or learning rate?
→ A step size that is too small can lead to slow convergence, requiring numerous iterations to approach the minimum of the cost function.
→ Conversely, a step size that is excessively large may cause the algorithm to overshoot the minimum or oscillate without achieving convergence.
Line Search Algorithm: An Essential Tool
The Line Search algorithm is crucial in Machine Learning optimization. It assists in identifying the optimal step size for adjusting model parameters during each iteration of the optimization process. This method can be utilized for both univariate and multivariate optimization tasks.
To implement the Line Search algorithm, a starting point and a search direction are necessary. Throughout the optimization journey, this algorithm automatically chooses a scaling factor, denoted as alpha, to establish the step size in the specified direction. Typically, the search direction is opposite to the gradient of the objective function, which indicates the steepest descent.
The first video discusses line search methods for unconstrained optimization, providing insights into their implementation and effectiveness.
Defining the Search Direction
The alpha value serves as a scaling factor for the search direction, constrained between 0 and 1. During the line search, the objective is to minimize the function by adjusting the current position with a scaled direction:
minimize f(x + alpha * direction)
Implementing the Line Search Algorithm
To effectively execute the line search algorithm, knowledge of the first derivative of the objective function is essential. Additionally, a starting point and a defined search range are required. It is feasible to conduct multiple searches with varying directions (both positive and negative with different magnitudes) to identify the optimal solution.
The line search function necessitates eight initial inputs:
- Objective function
- Derivative of the objective function
- Initial evaluation point
- A direction (often the negative of the derivative)
- Initial step size (alpha) with a default value of 1
- Shrinkage factor (rho) with a default value of 0.5
- A small positive constant to validate the Armijo condition (more on this later)
- Maximum number of iterations = 100
def line_search_function(f, df, x, d, alpha_init=1.0, rho=0.5, c=1e-4, max_iter=100):
Next, several variables must be initialized:
- Set alpha as the initial step size
- Define phi0 as the objective function value at the initial evaluation point x
- Calculate dphi0 as the directional derivative of the objective function f at the evaluation point x along the direction d
- Initialize i as a counter variable to track iterations
alpha = alpha_init
phi0 = f(x)
dphi0 = np.dot(df(x), d)
i = 0
The algorithm enters a loop that continues until the maximum number of iterations is reached or a suitable step size is found:
while i < max_iter:
The next step involves calculating the value of the objective function at the point ( x + alpha d ):
phi = f(x + alpha * d)
Checking the Armijo Condition
The Armijo condition, also known as the Wolfe first condition, ensures that the step size chosen sufficiently reduces the objective function. It asserts that the new objective function value must be less than or equal to a specified proportion of the decrease predicted by the first-order Taylor approximation at the current point. This can be expressed as:
f(x + alpha * d) ≤ f(x) + c * alpha * dphi0
Where ( c ) is a constant between 0 and 1, ( dphi0 ) is the directional derivative at the current point, and ( alpha ) is the step size.
To check for the Armijo condition:
if phi > phi0 + c * alpha * dphi0 or (phi >= f(x + (alpha / 2) * d) and i > 0):
If not satisfied, we call another function, the zoom function, to refine the step size:
return zoom(f, df, x, d, phi0, dphi0, alpha, alpha / 2, rho, c)
The second video explores the fundamental concepts of line search, offering practical examples of its application in optimization.
Curvature Condition: Wolfe Second Condition
This condition guarantees that the step size is not excessively large. It states that the directional derivative at the new point must be greater than or equal to a certain proportion of the derivative at the current point. This condition is verified as:
abs(df(x + alpha * d).dot(d)) ≤ -rho * dphi0
Where ( rho ) is a constant between 0 and 1.
If the curvature condition is not satisfied, the zoom function is invoked again to refine the step size:
if dphi >= 0:
return zoom(f, df, x, d, phi0, dphi0, alpha / 2.0, alpha, rho, c)
Zoom Function: Refining the Step Size
The zoom function is activated when the Armijo or curvature conditions are unmet. It takes in eleven arguments:
- Objective function ( f )
- Derivative of the objective function ( df )
- Current point ( x )
- Search direction ( d )
- Objective function value at the current point ( phi0 )
- Directional derivative at the current point ( dphi0 )
- Two candidate step sizes ( alpha_{lo} ) and ( alpha_{hi} )
- Shrinkage factor ( rho )
- Small positive constant ( c )
- Maximum allowed iterations ( max_{iter} )
Initialize a counter ( i ) to zero and run a loop for a maximum of ( max_{iter} ):
i = 0
while i < max_iter:
Calculate the midpoint between the two candidate step sizes:
alpha = (alpha_lo + alpha_hi) / 2.0
Next, compute the objective function value at the point ( x + alpha d ):
phi = f(x + alpha * d)
Adjust the bounds on the step size based on the sufficient decrease condition and curvature condition. If either condition fails, set ( alpha_{hi} ) to ( alpha ):
if phi > phi0 + c * alpha * dphi0 or phi >= f(x + alpha_lo * d):
alpha_hi = alpha
else:
Compute the directional derivative at ( x + alpha d ). If it meets the curvature condition, return the value of ( alpha ):
dphi = np.dot(df(x + alpha * d), d)
if abs(dphi) <= -c * dphi0:
return alpha
Otherwise, update the bounds on the step size:
if dphi * (alpha_hi - alpha_lo) >= 0:
alpha_hi = alpha_lo
alpha_lo = alpha
Increment the counter ( i ) and repeat the loop if ( i < max_{iter} ):
i += 1
If the loop concludes without returning a value, return ( alpha ), which is the current best estimate for the optimal step size.
Bringing It All Together
import numpy as np
from matplotlib import pyplot
from scipy.optimize import line_search
def line_search_function(f, df, x, d, alpha_init=1.0, rho=0.5, c=1e-4, max_iter=100):
alpha = alpha_init
phi0 = f(x)
dphi0 = np.dot(df(x), d)
i = 0
while i < max_iter:
phi = f(x + alpha * d)
if phi > phi0 + c * alpha * dphi0 or (phi >= f(x + (alpha / 2) * d) and i > 0):
return zoom(f, df, x, d, phi0, dphi0, alpha, alpha / 2.0, rho, c)dphi = np.dot(df(x + alpha * d), d)
if abs(dphi) <= -c * dphi0:
return alphaif dphi >= 0:
return zoom(f, df, x, d, phi0, dphi0, alpha / 2.0, alpha, rho, c)alpha = alpha * rho
i += 1
return alpha, i
def zoom(f, df, x, d, phi0, dphi0, alpha_lo, alpha_hi, rho, c, max_iter=100):
i = 0
while i < max_iter:
alpha = (alpha_lo + alpha_hi) / 2.0
phi = f(x + alpha * d)
if phi > phi0 + c * alpha * dphi0 or phi >= f(x + alpha_lo * d):
alpha_hi = alphaelse:
dphi = np.dot(df(x + alpha * d), d)
if abs(dphi) <= -c * dphi0:
return alphaif dphi * (alpha_hi - alpha_lo) >= 0:
alpha_hi = alpha_loalpha_lo = alpha
i += 1
return alpha
# Define the objective function and its derivative
def f(x):
return x**2 + 2*x + 1
def df(x):
return 2*x + 2
# Define the range of x values to plot
x = np.linspace(-5, 5, 100)
# Plot the objective function in blue
plt.plot(x, f(x), 'b', label='f(x)')
# Plot the derivative in red
plt.plot(x, df(x), 'r', label='f'(x)')
# Add axis labels and a legend
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
# Show the plot
plt.show()
# Define the initial point and direction
x = 10
d = -df(x)
# Call the function
line_search_function(f, df, x, d)
# Find the step size
alpha = line_search(f, df, x, d)
# Print the step size
print(alpha[0])
Conclusion
Thank you for taking the time to read this article! If you have any suggestions for additional content, please let me know. Don't forget to subscribe to receive updates on my future publications. If you enjoyed this piece, consider following me for further insights. For those looking to delve deeper into this topic, my book "Data-Driven Decisions: A Practical Introduction to Machine Learning" is available at an affordable price, providing a comprehensive starting point in Machine Learning. Thank you once again!