Selecting SQL Data with Non-Duplicate Column Values

I ran into an interesting SQL problem this week. I needed to select “unambiguous” data in a sense — only rows that did not contain duplicate values in one of the columns. Really the only way to explain is with an example. Consider this hypothetical data in a table named tblSectorData:

pid  sector
==========
1111  1
2222  2
2222  3
3333  4
5555  5
5555  6
6666  2
7777  2
8888  2
8888  4

The PIDs represent users and sector is a geo-location. Most users have just one location sector. But some users have two or more sectors. For example user 2222 is in sector 2 and also sector 3. Given a particular sector value, I want to select all the users that are in that sector, but only those users that are associated with just one sector. For example, if my target sector is 2, then a simple SELECT pid FROM tblSectorData WHERE sector=2 would give me users 2222, 6666, 7777, 8888. But I only want users 6666 and 7777 because 2222 and 8888 have multiple sectors.

This is a cousin to, but significantly different from, the well-known problem of identifying and removing duplicate rows.

As usual with SQL, there are about a bazillion different ways to attack this problem. Here is one possibility I finally got working:

select A.pid
from tblSectorData A
join tblSectorData B
on A.pid=B.pid
where B.sector=2
group by A.pid
having COUNT(A.pid) = 1

To be honest, the statement is really ugly to me as a procedural paradigm programmer, but sometimes with SQL you’re satisfied just to get anything that works. I also hate SQL because I know in my heart there must be some super-efficient, uber-cool trick I don’t know about. Curse you SQL!

The key is the self-join. Consider this statement:

select A.pid, A.sector, B.pid, B.sector
from tblSectorData A
join tblSectorData B
on A.pid=B.pid
where B.sector=2

The result is:

pid   sector  pid   sector
--------------------------
2222   2      2222   2
2222   3      2222   2
6666   2      6666   2
7777   2      7777   2
8888   2      8888   2
8888   4      8888   2

I could also have used A.sector=2 in the WHERE clause. Notice that if we group by pid, we can find instances where there are more than just a single associated sector by using COUNT(pid), or in other words, by selecting just rows where COUNT(A.pid) = 1 we get the desired result.

2 Responses to Selecting SQL Data with Non-Duplicate Column Values

mvaneerde says:

September 20, 2012 at 8:39 am

This technique works for columns that admit for the < operator:

SELECT
pid
FROM
tblSectorData
GROUP BY
pid
HAVING
COUNT(*) = 1 AND MAX(sector) = @sector

This version is slightly more general. Note the inner query does not depend on the outer, so this is not really a JOIN.a

SELECT
pid
FROM
tblSectorData
WHERE
sector = @sector AND
pid IN (
SELECT
pid
FROM
tblSectorData
GROUP BY
pid
HAVING
COUNT(*) = 1
)

Loading...
- jamesdmccaffrey says:
  
  September 27, 2012 at 10:03 am
  
  Very cool; thank you very much for the comments. I prefer your two techniques because they are a bit cleaner and easier to understand.
  
  Loading...