In this article, I will present the concept of data vectorization using a NumPy library. We will benchmark several approaches to compute Euclidean Distance efficiently.
NumPy is a Python library for manipulating multidimensional arrays in a very efficient way. It is also a base for scientific libraries (like pandas or SciPy) that are commonly used by Data Scientists in their daily work. However, sometimes you find yourself in a situation that these libraries are not sufficient and you need to go deeper – right to NumPy. So how can we exploit NumPy to boost our application performance and save a lot of time? The answer is vectorization!
What is vectorization?
The power that dwells within NumPy is that it performs looping over elements in the ‘C layer’ instead of the ‘Python layer.’ A python interpreter is an order-of-magnitude slower that the C program, thus it makes sense to replace any looping over elements with built-in functions of NumPy, which is called vectorization. Let’s see the NumPy in action.
Euclidean Distance is a termbase in mathematics; therefore I won’t discuss it at length. Generally speaking, it is a straight-line distance between two points in Euclidean Space. The formula for euclidean distance for two vectors v, u ∈ R n is:
Let’s write some algorithms for calculating this distance and compare them.
Let’s start with a simple (pure) python algorithm that would most probably come to mind:
This algorithm is an exact reflection of the definition. l basically iterate over every vector element, do some computations and sum the intermediate results. Easy-peasy…
To be more pro, we can replace it with a one-liner:
Let’s rewrite the dummy algorithm using numpy to benefit from the vectorization.
As we can see the solution is straightforward and concise. The ‘numpy.array’ has overloaded the arithmetic operations (+, -, *, /, **) so that we can operate on arrays instead of its coefficient. Please notice that we are using the ‘numpy.sum’ instead of Python’s sum, so we have also optimized the implicit loop inside the sum function.
It occurs that a SciPy library has already a built-in function for calculating the Euclidean distance. Fantabulous!
Sharp-eyed mathematicians should notice that it is nothing other than the L2 Norm of the vectors difference.
Einstein Summation Convention is an elegant way to express a common operation on matrices like a dot product, a sum over indices and a matrix transposition. At first, it may look impractical due to the complex syntax, but it will turn out that its implementation is very efficient.
In this case the ‘i,i->’ expression translates to dot(z, z).
I’ve found a nice python library perfplot for benchmarking and plotting at once. I will now present the benchmark of three common real-life cases.
High Dimensional Vector
Sometimes we need to operate on a very long array. Let’s compare then the algorithm depending on the length of the vectors v and u.
Surprised? It seems that a dummy function can be even 100 times slower than any other approach. The curve’s shape of the rest of the functions is similar and close to each other for larger vectors.
In real applications, there is often a need to calculate the distance from many points at once – for instance, in some map applications or clustering problems. So, instead of computing the distance of pair vectors, we must calculate the distance of many pairs of the vector. How to do that? Again, using vectorization!
We can pack all the vectors into matrices and perform operations over matrices. Voilà!
We need to slightly change our algorithm to operate on 2D-arrays instead of 1D-arrays:
Many 2D points
Omg! Our dummy algorithm is better than SciPy function ‘distance’. Wow! Such win!
That’s true for 2D points and also for 3D. It is to be seen if it also holds for the higher degree of vectors. But the more important conclusions are:
- The fastest algorithm is Einsum and it is about 150x times faster than Dummy!
- The Bare NumPy is almost as fast as Einsum, although Bare NumPy seems to be less complex!
Let’s now compare the results for many and much longer vectors…
Many 100D Points
High dimensional vectors are applicable in many clustering problems. For example, we have a model with a large number of features, and we need to perform k-nearest neighbor (k-NN) classification. Naturally, we don’t want to wait forever for the results, so every optimization plays a role. Let’s see how algorithms performed.
And again Einsum turns out to be the fastest one. The Bare NumPy breathes down the winner’s neck. The Dummy is in the last place.
NumPy naturally supports vectorization, so we can perform the arithmetic computation on data faster. It has a very user-friendly syntax, and this should be enough to boost your application performance dramatically. But for the best results, I recommend you to learn more about such an exotic concept like Einsum.
Thank you for reading! If you enjoyed this article don’t hesitate to like, share and comment 😉 .
Written by: Piotr Skoroszewski, Big Data Engineer at Semantive.