Consider the following **generalization curve**, which shows the loss
for both the training set and validation set against the number of
training iterations.

**Figure 1. Loss on training set and validation set.**

Figure 1 shows a model in which training loss gradually decreases,
but validation loss eventually goes up. In other words, this generalization curve
shows that the model is
overfitting
to the data in the training set. Channeling our inner
Ockham,
perhaps we could prevent overfitting by penalizing complex models, a principle
called **regularization**.

In other words, instead of simply aiming to minimize loss (empirical risk minimization):

we'll now minimize loss+complexity, which is called **structural
risk minimization**:

Our training optimization algorithm is now a function of
two terms: the **loss term**, which measures how well the
model fits the data, and the **regularization term**,
which measures model complexity.

Machine Learning Crash Course focuses on two common (and somewhat related) ways to think of model complexity:

- Model complexity as a function of the
*weights*of all the features in the model. - Model complexity as a function of the
*total number of features*with nonzero weights. (A later module covers this approach.)

If model complexity is a function of weights, a feature weight with a high absolute value is more complex than a feature weight with a low absolute value.

We can quantify complexity using the ** L_{2} regularization**
formula, which defines the regularization term as the sum of the squares of all
the feature weights:

In this formula, weights close to zero have little effect on model complexity, while outlier weights can have a huge impact.

For example, a linear model with the following weights:

Has an *L _{2}* regularization term of 26.915:

But \(w_3\) (bolded above), with a squared value of 25, contributes
nearly all the complexity. The sum of the squares of all five other weights
adds just 1.915 to the *L _{2}* regularization term.