lundi 4 mai 2015

SQL: How to sort overlapping clusters efficiently

I'm trying to make clusters on a database with 10.000+ rows. I need to be fast and efficient, so I'm doing binary variables for each cluster. One, Two, Four, Five and Six is in Cluster1.

But 'Two' might also be in Cluster nr. 2, because of errors I cannot overcome because my dataset is from a webscrape. I try to sort everything in a unique way, but it's basically impossible not to do errors, if I wish to be efficient and fast.

ID   Title    Cluster1    Cluster2   Cluster3    Unclustered
1    One      1           0          0           0
2    Two      1           1          0           0
3    Three    0           1          1           0
4    Four     1           0          1           0
5    Five     1           0          0           0
6    Six      1           1          1           0
7    Seven    0           0          0           1

My idea for a sollution:

  1. Assign clusters (one's) until everything is clustered one or more times.
  2. Make a query for everything that has more than one cluster assigned (2, 3, 4, 6)
  3. Manually decide which 1's to remove, until they only have one cluster assigned each.

It's actually a good idea to do the 3rd part manually, because it requires content analysis of the documents)

My question:

How do I specify, that I need to see everything with more than one cluster? Does it have something to do with constraints and unique values, or is there a more simple and obvious way that I'm not seeing?

Aucun commentaire:

Enregistrer un commentaire