Vector spaces

Vector spaces#

Vector spaces are the basic setting in which linear algebra happens. A vector space \(V\) is a set (the elements of which are called vectors) on which two operations are defined: vectors can be added together, and vectors can be multiplied by real numbers called scalars. \(V\) must satisfy

(i) There exists an additive identity (written \(\mathbf{0}\)) in \(V\) such that \(\mathbf{x}+\mathbf{0} = \mathbf{x}\) for all \(\mathbf{x} \in V\)

(ii) For each \(\mathbf{x} \in V\), there exists an additive inverse (written \(\mathbf{-x}\)) such that \(\mathbf{x}+(\mathbf{-x}) = \mathbf{0}\)

(iii) There exists a multiplicative identity (written \(1\)) in \(\mathbb{R}\) such that \(1\mathbf{x} = \mathbf{x}\) for all \(\mathbf{x} \in V\)

(iv) Commutativity: \(\mathbf{x}+\mathbf{y} = \mathbf{y}+\mathbf{x}\) for all \(\mathbf{x}, \mathbf{y} \in V\)

(v) Associativity: \((\mathbf{x}+\mathbf{y})+\mathbf{z} = \mathbf{x}+(\mathbf{y}+\mathbf{z})\) and \(\alpha(\beta\mathbf{x}) = (\alpha\beta)\mathbf{x}\) for all \(\mathbf{x}, \mathbf{y}, \mathbf{z} \in V\) and \(\alpha, \beta \in \mathbb{R}\)

(vi) Distributivity: \(\alpha(\mathbf{x}+\mathbf{y}) = \alpha\mathbf{x} + \alpha\mathbf{y}\) and \((\alpha+\beta)\mathbf{x} = \alpha\mathbf{x} + \beta\mathbf{x}\) for all \(\mathbf{x}, \mathbf{y} \in V\) and \(\alpha, \beta \in \mathbb{R}\)

Euclidean space#

The quintessential vector space is Euclidean space, which we denote \(\mathbb{R}^n\). The vectors in this space consist of \(n\)-tuples of real numbers:

\[\mathbf{x} = (x_1, x_2, \dots, x_n)\]

For our purposes, it will be useful to think of them as \(n \times 1\) matrices, or column vectors:

\[\begin{split}\mathbf{x} = \begin{bmatrix}x_1 \\ x_2 \\ \vdots \\ x_n\end{bmatrix}\end{split}\]

Addition and scalar multiplication are defined component-wise on vectors in \(\mathbb{R}^n\):

\[\begin{split}\mathbf{x} + \mathbf{y} = \begin{bmatrix}x_1 + y_1 \\ \vdots \\ x_n + y_n\end{bmatrix}, \hspace{0.5cm} \alpha\mathbf{x} = \begin{bmatrix}\alpha x_1 \\ \vdots \\ \alpha x_n\end{bmatrix}\end{split}\]

Euclidean space is used to mathematically represent physical space, with notions such as distance, length, and angles. Although it becomes hard to visualize for \(n > 3\), these concepts generalize mathematically in obvious ways. Even when you’re working in more general settings than \(\mathbb{R}^n\), it is often useful to visualize vector addition and scalar multiplication in terms of 2D vectors in the plane or 3D vectors in space.

../_images/6db34a21e6d85321b0264977b97190c39da3dacdf99c34adc7aff458f052f463.png

This visualization intuitively demonstrates how vectors combine to produce a resultant vector in Euclidean space by vector addition. The blue arrow represents vector \(\mathbf{a}\), the green arrow represents vector \(\mathbf{b}\) placed at the tip of vector \(\mathbf{a}\), and the red arrow shows the resulting vector \(\mathbf{a} + \mathbf{b}\).

../_images/4e8ad90501f67e022be26a27ebfd20a1c74fc307b4d1acd8d2053bf01d9a69ec.png

This script shows the original vector \(\mathbf{a}\) (in blue), and three scaled versions using scalars -1 and 0.5, and 2. The scaled vectors demonstrate inversion, shrinking, and stretching, respectively.

\(k\)-means Clustering in Euclidean space#

Now, we have explored the basic properties of Euclidean space, we can apply these concepts to machine learning tasks. We will discuss the \(k\)-means clustering algorithm in Euclidean space, which is a popular unsupervised learning method used to partition data into distinct groups based on their feature vectors. It only uses the operations of vector addition and scalar multiplication, which are the basic operations of a vector space.

The algorithm works as follows:

Initialization: Randomly select \(k\) initial cluster centroids from the dataset.
Iterate over the following steps until convergence:

Assignment Step: For each data point, assign it to the nearest cluster centroid based on the Euclidean distance.

\[ \text{argmin}_k \|\mathbf{x} - \mathbf{c}_k\|^2 \]

where \(\mathbf{c}_k\) is the centroid of cluster \(k\) and \(\mathbf{x}\) is the data point.

Update Step: Recalculate the centroid vectors of the clusters by taking the mean of all data vectors assigned to each cluster. This uses the vector addition and scalar multiplication operations:

\[ \mathbf{c}_k = \frac{1}{N_k}\sum_{i:y_i=k} \mathbf{x}_i \]

where \(N_k\) is the number of points assigned to cluster \(k\) and \(y_i\) is the label of the data point \(\mathbf{x}_i\).

We will implement a python class for the \(k\)-means algorithm, which will include methods for fitting the model to the data and predicting cluster assignments for new data points.

We will implement \(k\)-means Clustering with two methods:

fit() – This method performs the clustering by the iterative procedure above.
predict() – This method returns cluster assignments for data points based on their proximity to the cluster centers.

import numpy as np

class KMeans:
    def __init__(self, n_clusters=3):
      self.n_clusters = n_clusters

    def fit(self, X, num_iterations=10):
      # Randomly initialize cluster centers
      self.centers = X[np.random.choice(X.shape[0], self.n_clusters, replace=False)]
      labels = np.zeros(X.shape[0])

      for _ in range(num_iterations):  # Iterate a fixed number of times
        old_labels = labels.copy()

        # Assign clusters based on closest center
        labels = self.predict(X)

        # Update cluster centers
        for i in range(self.n_clusters):
          self.centers[i] = X[labels == i].mean(axis=0)
        
        # Check for convergence (optional)
        if np.all(labels == old_labels):
          break

    def predict(self, X):
      # Assign clusters based on closest center
      distances = np.linalg.norm(X[:, np.newaxis] - self.centers, axis=2)
      return np.argmin(distances, axis=1)

Now we can generate a dataset of random vectors in \(\mathbb{R}^2\) and apply the \(k\)-means algorithm to it to find the optimal cluster centers \(\mathbf{c}_k\) and the cluster assignments for all data vectors.

../_images/c97f0ab7e4ef24059f8a0ee7bef58e7cc0a50a22b0ccd62aba4145bac06ffd6e.png

In the plot, the red dots represent the cluster centers \(\mathbf{c}_k\) found by the \(k\)-means algorithm, while the colored points represent the data vectors \(\mathbf{x}_n\) assigned to each cluster. The colors indicate which cluster each point belongs to.

Nearest Centroid Classifier in Euclidean space#

As another example of a machine learning algorithm that uses only simple vector operations let’s look at the Nearest Centroid Classifier. As mentioned earlier, classification is a task where the model predicts the class label \(y\) for a given feature vector \(\mathbf{x}\). This algorithm is a straightforward classification method that uses the concept of centroids to classify data points based on their proximity to the centroids of different classes.

Training the Algorithm#

Training the algorithm involves calculating the centroid for each class. The centroid is the mean of the feature vectors for each class, and it can be calculated using the formula:

\[ \mathbf{c}_k = \frac{1}{N_k} \sum_{i=1}^{N_k} \mathbf{x}_i \quad \text{for class} \ k \]

Where:

\(\mathbf{c}_k\) is the centroid for class \(k\)
\(N_k\) is the number of samples in class \(k\)
\(\mathbf{x}_i\) is the feature vector of the \(i\)-th sample in class \(k\)

Prediction#

Prediction involves assigning the class to the observation \(\mathbf{x}\) by measuring the distance between the observation and the centroids. The class is assigned to the centroid that is closest to the observation. The prediction is made using the formula:

\[ \hat{y} = \arg\min_k \, \|\mathbf{x} - \mathbf{c}_k\| \]

Where:

\(\hat{y}\) is the predicted class label
\(\mathbf{x}\) is the feature vector of the observation
\(\mathbf{c}_k\) is the centroid of class \(k\)
\(\|\mathbf{x} - \mathbf{c}_k\|\) is the distance between the observation and the centroid (usually Euclidean distance)

Implementing the Nearest Centroid Classifier#

We will implement the Nearest Centroid Classifier with two methods:

fit() – This method trains the model by calculating the centroid for each class.
predict() – This method makes predictions based on the trained centroids.

class NearestCentroidClassifier:
    def __init__(self):
        self.centroid_0 = None
        self.centroid_1 = None
        self.class_0 = None
        self.class_1 = None

    def fit(self, X, y):
        """
        Fit the model using binary-labeled data X and y.
        Assumes only two unique class labels.
        """
        classes = np.unique(y)
        assert len(classes) == 2, "Only binary classification supported."

        self.class_0, self.class_1 = classes
        self.centroid_0 = X[y == self.class_0].mean(axis=0)
        self.centroid_1 = X[y == self.class_1].mean(axis=0)

    def predict(self, X):
        """
        Predict labels for X based on closest centroid (Euclidean distance).
        """
        dist_0 = np.linalg.norm(X - self.centroid_0, axis=1)
        dist_1 = np.linalg.norm(X - self.centroid_1, axis=1)
        return np.where(dist_1 < dist_0, self.class_1, self.class_0)

Breast Cancer Diagnosis as a Binary Classification Problem#

We will again use the Wisconsin Diagnostic Breast Cancer (WDBC, 1993) dataset. TThe data consists of two numerical features that describe the distribution of cells in breast tissue samples that are visible under the microscope together with the diagnosis wether the tissue is benign (B) or malignant (M). The two features represent the average concavity and texture of the nuclei and have been determined from image processing techniques [1].

Show code cell source Hide code cell source

# fetch dataset from Kaggle
import kagglehub
path = kagglehub.dataset_download("uciml/breast-cancer-wisconsin-data/versions/2")
data = pd.read_csv(path+"/data.csv")
x = data[["concavity_mean", "texture_mean"]] # pick two features
# normalize the columns of x individually
x = (x - x.min()) / (x.max() - x.min())

# y includes our labels and x includes our features
y = data.diagnosis      # M or B 
list = ['diagnosis']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)
# Create and train the Nearest Centroid Classifier
classifier = NearestCentroidClassifier()
classifier.fit(x_train.values,y_train.values)
print("Centroids: ", classifier.centroid_0, classifier.centroid_1)
# Predict the classes for the test data
y_pred = classifier.predict(x_test.values)
# Calculate and print the accuracy
print(("Accuracy: %.2f" % accuracy_score(y_test, y_pred)))

# Create meshgrid for plotting decision boundaries
xx, yy = np.meshgrid(np.linspace(0, 1, 300),
                     np.linspace(0, 1, 300))
grid = np.c_[xx.ravel(), yy.ravel()]
# Predict the class for each point in the meshgrid
y_vals = np.unique(y)
Z = classifier.predict(grid)
Z_bin = Z==y_vals[1]
zz = Z_bin.reshape(xx.shape)
# Plot the decision boundary
plt.figure(figsize=(10, 10))
plt.contourf(xx, yy, zz, alpha=0.8)
# Plot also the training points
y_bin = y_train==y_vals[1]
plt.scatter(x_train.concavity_mean[y_bin], x_train.texture_mean[y_bin], alpha=0.8, color="r")
plt.scatter(x_train.concavity_mean[~y_bin], x_train.texture_mean[~y_bin], alpha=0.8, color="b")
legend = ["$c_1$ (M)", "$c_2$ (B)"]
if x_test is not None:
  plt.scatter(x_test.concavity_mean, x_test.texture_mean, alpha=1, color="w",marker='o',edgecolors='k', s=50)
  legend = ["$c_1$ (M)", "$c_2$ (B)", "?"]
plt.scatter(classifier.centroid_0[0], classifier.centroid_0[1], color="b",marker='o',edgecolors='k', s=250)
plt.scatter(classifier.centroid_1[0], classifier.centroid_1[1], color="r",marker='o',edgecolors='k', s=250)
plt.title("Breast Cancer Diagnosis")
plt.xlabel("concavity (normalized to range [0,1])")
plt.ylabel("texture (normalized to range [0,1])")
plt.legend(legend)
plt.show()

Centroids:  [0.10661361 0.27544877] [0.38011234 0.4145867 ]
Accuracy: 0.86

../_images/a6dfd3a66ba14539c16a27450c5430f1aaf0fa3673c7cc2496739d6241367df4.png

In the plot, the red and blue dots represent the training data points for malignant and benign samples, respectively. The white dots represent the test data points. The decision boundary is shown in the background, where the color indicates the predicted class for each point in the feature space. The large circles indicate the centroids of each class.

Summary#

We have introduced the concept of vector spaces and their properties as well as Euclidean Space as an important vector space. We have also discussed the \(k\)-means clustering algorithm and the Nearest Centroid Classifier, both of which operate in Euclidean space using basic vector operations.