Its edges are straight lines which connect the outermost points. The convex hull of points in 2D is the smallest convex polygon that encloses all the points. Polygons: convex hulls The convex hull of a set of points. This is the main reason the number of tests for the polygon algorithm is much higher. 2Īnother issue is that it is harder to account for all cases.Įdge cases can be isolated directly in the mathematics for ellipses, but for polygons it requires actually testing each case. This means operations require looping through all the edges and so are of order $O(n)$ or higher, such as finding the area, the perimeter or whether or not a point lies within the polygon.Ĭompare this to ellipses which are defined by a single equation, so all the previous operations can be found directly via formulas in $O(1)$ time. This method returns an IoU of 0 for any distribution that does not intersect it so we have no information about the third distribution.Īnother difficulty with the polygon method is that polygons are described by $n$ edges. For example, in the following image is distribution 2 or distribution 3 more similar to distribution 1? While further statistical techniques can be applied to mitigate the outlier problem, a fundamental issue is that it does not balance the contributions of means and variance well. It has a time complexity $O(nh)$ for $n$ points and $h$ points on each convex hull.Ĭalculating the intersections once they are found is much faster: $O(h_1 h_2)$.Ī problem with this approach is that outliers can have a dramatic effect: The outlier on the top left greatly reduces the IoU while the outlier on the far right greatly increases it. This gives a value that varies between 0 and 1.įinding the convex hulls is the bottleneck in this algorithm. The widget above allows you to adjust these five variables to create many variants of this distribution.Ī bivariate normal distribution with correlation $\rho$ can be constructed from two normal distributions $Z_1$ and $Z_2$ as follows: The correlation $\rho$ between the $x$ and $y$ direction.The standard deviations $\sigma_x$ and $\sigma_y$ in the $x$ and $y$ directions respectively.The means $\mu_x$ and $\mu_y$ in the $x$ and $y$ directions respectively.This is a very common type of distribution that arises naturally in many different situation thanks to the central limit theorem.Ī bivariate (2D) normal distribution is described by five variables: The data that I was working with was normally distributed. There are many different kinds of 2D distributions. Preliminaries Bivariate normal distributions If one bases the effort solely on LOC, the original polygon method required more than 30× the effort to the Wasserstein metric.Ī more detailed breakdown is available at the end in the Code analysis section.įor those who just want to compare scatter plots the right way, jump to How to compare scatter plots. The above table shows where the code can be viewed, the time complexity and the approximate number lines of code (LOC). Gnimuc/Hungarian.jl, /LiorSinai/AssignmentProblem.jl Instead I provide a high level overview and references are given for further information. There are too many algorithms to go into proper detail for each one. This post is the outcome of that investigation. We also brainstormed using circles and ellipses. How much more difficult was it to do the polygon method? Later in my own capacity I challenged myself to try the other methods. (Users weren’t shown the scores just similar plots.) Thankfully the client was open to my suggestions and we did implement the Wasserstein metric. However the other two points remain valid. I’ve crossed out the third point because the convex hull method can be made fast. The convex hull method would be much slower than the Wasserstein metric.It would take too much development time and this was only one of many tasks.For example, it does not handle outliers well. The statistical properties of the polygon method are not adequate.A comparison of different similarity metrics for 2D scatter plots. In more technical terms, they wanted to find convex hulls of the points and then calculate the areas of intersection between them. They wanted to enclose the points in shapes and then calculate the overlap of those shapes. Unfortunately for me, the client came with a solution in mind already. This is the “statistical distance” in the widget. I was drawn to the Wasserstein metric which is used in the popular Fréchet Inception Distance (FID) in machine learning. ( Source code.)įurthermore, there are standard statistical techniques for this sort of problem. The above is a demo of the problem at hand. It is an unusual request but not unheard of - see this question or this one. A while back I was given an intriguing task: rank scatter plots by similarity.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |