Closest Pair Problem For Beginners

SPOJ: CPP - Closest Pair Problem ( Closest Pair,  Divide and Conquer)

Problem Summary:
According to problem, We will be given n points on the plane, each represented by (x, y) coordinates,  our task is to find a pair of points with the smallest distance between them and print only distance between them.

This is a standard closest pair of points problem or closest pair problem ( Wiki Link ).

So, Closest pair problem is a problem in computational geometry where we will be provided an array of points as an input set and we have to find out two points among them whose distance is minimum and find out distance between these points as an output. Particularly we consider all the points lies in two dimensional space and we euclidean distance as a measure for the distance among points.

Solution Idea:
One obvious brute-force solution could be picking all possible pairs of points and finding out euclidean distance among them and hence selecting a pair with minimum distance among them. The sample code for brute-force approach will be something like below:


#include <bits/stdc++.h>
#define X first
#define Y second
#define inf LLONG_MAX
#define sqr(x) ((x)*(x))
#define pdd pair<double,double>
using namespace std;

// calculates euclidean distance between two points
double euclideanDistance(pdd &p1, pdd &p2) {
    return sqrt( sqr(p1.X-p2.X) + sqr(p1.Y-p2.Y) );
}

// calculates closest pair distance between two points and these two points
double closestPair(vector<pdd>&points, pdd &p1, pdd &p2) {
    double closestDistance = LLONG_MAX * 1.0; // max possible distance
    for (int i = 0; i < points.size(); ++i) {
        for (int j = i - 1; j >= 0; --j) {
            double dis = euclideanDistance(points[i], points[j]);
            if( dis < closestDistance ) {
                closestDistance = dis;
                p1 = points[i], p2 = points[j];
            }
        }
    }
    return closestDistance;
}

int main() {
    int n; // total points
    pdd p1, p2;  // closest pair points will be stored
    cin>>n;
    vector<pdd>points;
    for (int i = 0; i < n; ++i) {
        double x, y;
        cin>>x>>y; //scanning cordinates of each points
        points.push_back( make_pair(x, y) );
    }
    double closestDistance = closestPair(points, p1, p2);

    cout<< "Closest pair points are:\n";
    cout<< "One point is : "<< "( "<<p1.X<< " , "<<p1.Y<< " )\n";
    cout<< "Other point is : "<< "( "<<p2.X<< " , "<<p2.Y<< " )\n";
    cout<<fixed<<setprecision(6)<<closestDistance<<endl;
    return 0;
}

In above code, lets look at the ClosestPair method, which accepts four arguments in which , points represent input array of points and remaining parameters will store the results of Closest pair points computation. p1 and p2 will store one of closest pair points.
Here, we used two for loops , one for loop for iterating over n points and fixing the first one and another one for fixing another one so that we will go through all possible pairs and calculate euclidean distance among them and find out the minimum distance among all. Basically, if there are n points in input data-set then we need to check n*(n-1)/2 pairs of points and find out the minimum distance among them. The overall complexity of this approach is O( n * n )

Lets imagine, What happens if input data-set is huge such that total points are n = 100000 (10^5).

We need to calculate distances approximately ( 10^5 )^2 = 10^10 number of times. If we assume that 10^8 operations executes in 1 sec time, then it will take 100 seconds (1 minute 40 seconds) to executes above code. Therefore, in case of large input brute-force method fails. 😟 😟

Is there any time optimization we can do ? 😕 

For simplicity, lets assume that all of the given points lies in same axis. Then we can simply sort points according to increasing values of that axis and find the minimum distance among all adjacent pair of points. Since,one of the coordinate value is fixed in all points, simply sorting will help us as we do not need to worry about fixed coordinate's value i.e. only one coordinate is our distance deciding factor.
For example , if all points lies in origin x-axis, say (0,0), (6,0), (9,0), (20,0), (4,0).
Then, after sorting based on x-axis: (0,0), (4,0), (6,0), (9,0), (20,0).
Now, closest pair will be adjacent pairs (4,0) and (6,0) with distance 2.
(Why this works?)
As we know the euclidean distance formula for two points (x1,y1) and (x2,y2) is:
            __________________________
distance = √ (x1 - x2)^2 + (y1 - y2)^2

since, y-coordinate is constant for every points, our distance formula will be reduced to:
            __________________________      _________
distance = √ (x1 - x2)^2 + (0 - 0)^2    = √ (x1 - x2)^2

Since, distance is only dependent on changeable coordinate so definitely one of the adjacent pair of points after sorting based on changeable axis will be the closest pair points. We can perform sorting using standard algorithms like merge-sort, quick-sort or heap-sort and store values in linear array and then we can calculate the adjacent distance so as to find out the closest pair points. Moreover, we can perform such calculation during the conquering part of divide and conquer based merge-sort algorithm itself as well. The calculation with divide and conquer based approach works here because, for a range we divide it into two halves ( say, left half and right half with some middle index - mid ). Then, while conquering our closest distance, it can either closest distance of left half or closest distance of right half or the distance between left half's end point and right half's start point.

Now, lets consider what happens when both coordinates are changeable?
In this case, the above applied method of sorting based upon only one axis won't work, since our euclidean distance formula is dependent on both x and y coordinates. 

But, we can we still apply the divide and conquer logic assuming that points are sorted based on x-ordinates. So, each time some mid index point divides the whole range of points into left half and right half each time using recursion and finally results from both sides are conquered to get the final result in order to find the closest pair points for a given range. For each range say [start, end] if we divide it using some index say mid, then closest pair of points can be obtained from either of the following three criteria:
  1. Both closest pair points are in left half i.e [start, mid] having distance dl
  2. Both closest pair points are in right half i.e [mid + 1, end] having distance dr
  3. Two points are in different halves ( i.e one from left half and another from right half ) and they are in a boundary box having width of d = min(dl, dr) to the both sides from the mid point i.e [(mid - d) distance, (mid + d) distance].

Since, we can retrieve closest pair distance for left half and right half recursively, lets think about finding closest pair distance in a boundary box for a particular range [start, end] with middle point - mid. For finding the closest pair point there, we may store only those points which are at most d absolute distance away from the mid point. Say there are m such points, then to find the closest pair between them we can perform either of the following two ways:
  1. For each m points calculate the distances with other remaining points and find out the closest distance among them, which will cost us O(m * m) time complexity to find out the closest pair point in a range.
  2. Sort all those m points based on y -coordinate and calculate the distances between only those pairs which are at most d - distance apart, which will reduce our time complexity in greater extent.
Although in way number- 2, it may seems that even after sorting those m points, we may end up checking lots of different points for each of those m points, but eventually we will have to check at most 6 different points for each of those m points, which leads to the time complexity of O(6 * m) ⇄ O(m) for overall checking. As we are concerning for a rectangle with one side d and other side 2*d, we can find the proof for our conclusion here.

The sample code for our divide and conquer method will be like this:

#include <bits/stdc++.h>
#define X first
#define Y second
#define inf LLONG_MAX
#define sqr(x) ((x)*(x))
#define pdd pair<double,double>
using namespace std;

// find eulidean distance between two points
double euclideanDistance(pdd &p1, pdd &p2) {
    return sqrt( sqr(p1.X-p2.X) + sqr(p1.Y-p2.Y) );
}

// calculates closest pair distance between two points and these two points
double closestPair(vector<pdd>&pointsX, vector<pdd>&pointsY, int lo, int hi, pdd &p1, pdd &p2) {
    if (lo >= hi) return inf; // Base case for only 1 point
    int md = lo + (hi - lo) / 2;
    double dl = closestPair(pointsX, pointsY, lo, md, p1, p2); // left side's minimum distance
    double dr = closestPair(pointsX, pointsY, md + 1, hi, p1, p2);// right side's min distance
    double d = (dl >= dr)? dr : dl; //take min among two sides
    sort(pointsY.begin() + lo, pointsY.begin() + hi + 1); // sort based on y cordinate
    vector<pdd>points(hi - lo + 1);
    int m = 0; // total current points
    for (int i = lo; i <= hi; ++i) {
        if (abs(pointsY[i].X - pointsX[md].X) < d ) {
            points[m++] = pointsY[i]; //consider points which are in valid boundary
        }
    }
    for (int i = 0; i < m ; ++i) {
        for (int j = i + 1; j < m && (points[j].Y - points[i].Y) < d; ++j) {
            //inner loop goest at max 6 times
            double dis = euclideanDistance(points[i], points[j]);
            if (dis < d) {
                d = dis;
                p1 = points[i], p2 = points[j];
            }
        }
    }
    return d;
}

int main() {
    int n;
    pdd p1, p2;
    cin>>n;
    vector<pdd>pointsX(n);
    for (int i = 0; i < n; ++i) {
        double x, y;
        cin>>x>>y;
        pointsX[i] = make_pair(x, y);
    }
    sort( pointsX.begin(), pointsX.end() ); ///sorting on the basis of X
    vector<pdd>pointsY = pointsX;
    double closestDistance  = closestPair(pointsX, pointsY, 0, n - 1, p1, p2);
    cout<< "Closest pair points are:\n";
    cout<< "One point is : "<< "( "<<p1.X<< " , "<<p1.Y<< " )\n";
    cout<< "Other point is : "<< "( "<<p2.X<< " , "<<p2.Y<< " )\n";
    cout<<fixed<<setprecision(6)<<closestDistance<<endl;
    return 0;
}

In above closestPair method, we are dividing the given initial range of length n to upto logn times by dividing it by almost same length of left half and right half each time. So, overall depth of recursion is logn and and each depth, we are performing n*logn operations in total. So, overall time complexity of our above solution is:  O( n * logn * log n)  =  O( n log^2 n).

Can we optimize further ??

Yes. Instead of simply sorting range of points in each depth of recursion, we can apply the idea of merging in one linear pass, as we will have sorted array of points in both left half and right half. So, intermediate step of  O( n * logn ) can be reduced to O(n) by merging already sorted left and right halves of points. Hence, we can reduce our overall time complexity to O(n log n). 😃 😃

The sample code of our final optimized solution for closes pair problem can be like below:


#include <bits/stdc++.h>
#define X first
#define Y second
#define inf LLONG_MAX
#define sqr(x) ((x)*(x))
#define pdd pair<double,double>
using namespace std;

// find eulidean distance between two points
double euclideanDistance(pdd &p1, pdd &p2) {
    return sqrt( sqr(p1.X-p2.X) + sqr(p1.Y-p2.Y) );
}

// combine two sorted pointsY array to make it sorted in linear time
void combineYpoints(vector<pdd>&pointsY, int lo, int md, int hi) {
    int lenA = md - lo + 1, lenB = hi - md, i, j, k = lo;
    vector<pdd>tempA(lenA);
    vector<pdd>tempB(lenB);
    for (i = lo; i <= md ; i++) tempA[i - lo] = pointsY[i];
    for (i = md + 1; i <= hi ; i++) tempB[i - md -1] = pointsY[i];
    for (i = 0, j = 0; i < lenA && j < lenB;  ) {
        if (tempA[i].Y < tempB[j].Y) {
            pointsY[k++] = tempA[i++];
        } else {
            pointsY[k++] = tempB[j++];
        }
    }
    while (i < lenA) pointsY[k++] = tempA[i++];
    while (j < lenB) pointsY[k++] = tempB[j++];
}

// calculates closest pair distance between two points and these two points
double closestPair(vector<pdd>&pointsX, vector<pdd>&pointsY, int lo, int hi, pdd &p1, pdd &p2) {
    if (lo >= hi) return inf; // Base case for only 1 point
    int md = lo + (hi - lo) / 2;
    double dl = closestPair(pointsX, pointsY, lo, md, p1, p2); // left side's minimum distance
    double dr = closestPair(pointsX, pointsY, md + 1, hi, p1, p2);// right side's min distance
    double d = (dl >= dr)? dr : dl; //take min among two sides
    combineYpoints(pointsY, lo, md, hi); 
    vector<pdd>points(hi - lo + 1);
    int m = 0; // total current points
    for (int i = lo; i <= hi; ++i) {
        if (abs(pointsY[i].X - pointsX[md].X) < d ) {
            points[m++] = pointsY[i]; //consider points which are in valid boundary
        }
    }
    for (int i = 0; i < m ; ++i) {
        for (int j = i + 1; j < m && (points[j].Y - points[i].Y) < d; ++j) {
            //inner loop goes at max 6 times
            double dis = euclideanDistance(points[i], points[j]);
            if (dis < d) {
                d = dis;
                p1 = points[i], p2 = points[j];
            }
        }
    }
    return d;
}

int main() {
    int n;
    pdd p1, p2;
    cin>>n;
    vector<pdd>pointsX(n);
    for (int i = 0; i < n; ++i) {
        double x, y;
        cin>>x>>y;
        pointsX[i] = make_pair(x, y);
    }
    sort( pointsX.begin(), pointsX.end() ); ///sorting on the basis of X
    vector<pdd>pointsY = pointsX;
    double closestDistance  = closestPair(pointsX, pointsY, 0, n - 1, p1, p2);
//    cout<< "Closest pair points are:\n";
//    cout<< "One point is : "<< "( "<<p1.X<< " , "<<p1.Y<< " )\n";
//    cout<< "Other point is : "<< "( "<<p2.X<< " , "<<p2.Y<< " )\n";
    cout<<fixed<<setprecision(6)<<closestDistance<<endl;
    return 0;
}

Now, lets dive into the depth of Closest Pair roblem.
Lets take an example for the following sample:
Input:
5
0 0
-4 1
-7 -2
4 5
1 1
Lets assume indices are in the range from 1 to 5. If we represent element information in stack as two middle brackets then, Stack structure throughout the overall recursive calls will be like below:

[]bottom - - - top[]

[1,5]                        -- initial recursive call is made in range of [1,5]        
[1,5][1,3]                   -- [1,5] add left range [1,3] on the top of stack
[1,5][1,3][1,2]              -- [1,3] add left range [1,2] on the top of stack
[1,5][1,3][1,2][1,1]         -- [1,2] add left range [1,1] on the top of stack
[1,5][1,3][1,2]              -- [1,1] now returns inf and control goes to [1,2]
[1,5][1,3][1,2][2,2]         -- [1,2] add another range [2,2] on the top of stack
[1,5][1,3][1,2]              -- [2,2] returns inf and control goes to [1,2] again
[1,5][1,3]                   -- [1,2] perform boundary box calculation and returns (4.24264) and control goes to [1,3]
[1,5][1,3][3,3]              -- [1,2] add another range [3,3] on the top of stack
[1,5][1,3]                   -- [3,3] returns inf and control gies to [1,3] again
[1,5]                        -- [1,3] perform boundary box calculation and returns (4.12311) and control goes to [1,5]
[1,5][4,5]                   -- [1,5] add another range [4,5] on the top of stack
[1,5][4,5][4,4]              -- [4,5] add left range [4,4] on the top of stack
[1,5][4,5]                   -- [4,4] returns inf and control goes to [4,5] again
[1,5][4,5][5,5]              -- [4,5] add another right range [5,5] on the top of stack
[1,5][4,5]                   -- [5,5] returns inf and control goes to [4,5] again
[1,5]                        -- [4,5] performs boundary box calculation and returns (5) and control goes to [1,5]

// Now, [1,5] performs bounday box calculation and returns (1.414214) as an output, which is the final answer !!


We can apply similar idea to solve following problems.
1.)UVA 10245 - The Closest Pair Problem
2.) URI -1295 - The Closest Pair Problem
3.) Codechef - Closest Point queries

(Please let me know if you know any other similar problems !!)
(Also, Feel free to share this post if you think that anyone could be beneficial by this!!)
(Any corrections or suggestions or doubts regarding the post will be highly appreciated!!)


Comments

Popular posts from this blog

Two Pointers Technique & Binary Search For Beginners

HackerRank: Poisonous Plants ( Stacks, DS )

HackerRank: Find Maximum Index Product ( Stack, DS, Histogram Logic )