Distant subsets.
Version of Thursday 7 February 2019.
Dave Barber's other pages.

Section 1. This report looks at finding a collection of equal-size subsets of a master set, each subset being "sufficiently distant" from all the others. Exactly what this means will be made clear as the explanation progresses.

We are dealing with small finite sets, and naïve set theory suffices.

Elements of the sets are represented by majuscules from the earlier part of the alphabet, such as A, B, and C. These elements cannot be assumed to have any properties, except that they exist and that distinct letters stand for assuredly unequal elements.

The names of subsets of the master set are majuscules from the latter part of the alphabet, such as W, X, and Y. An exception is Z, sometimes subscripted with its size, which denotes the master set itself. For example, U might equal {C, F, H}, which will usually be written as the more compact CFH when no confusion will result. A collection of subsets of Z9 = {A, B, C, D, E, F, G, H, I} = ABCDEFGHI might be written equivalently in any of the three following formats:

{{B, C, G}, {C, F, H}, {D, E, H}}
{BCG, CFH, DEH}
BCG CFH DEH

For standardization, everything is written in alphabetical order unless there is a reason not to.

The usual term for the quantity of elements in a set is cardinality, although the shorter term size is adequate here. If two sets are the same size, we define the distance between them to be the size of either set minus the size of their intersection; or equivalently, half the size of the symmetric difference between them. Examples:

• ABC and DEF — distance 3
• ABC and CDE — distance 2
• ABC and BCD — distance 1
• ABC and ABC — distance 0; in other words, equality

If two sets are of different sizes, we do not define a distance.

A briefer way of saying "less distant from" is "closer to".

Section 2. Now it can be more precisely asked: Given a master set, what is the largest collection of its subsets, all of a specified size, such that the distance between any pair of subsets satisfies a specified minimum?

A concise notation is DSC [m, s, d], where:

• "DSC" stands for "distant subset collection";
• m is the size of the master set;
• s is the size of each subset;
• d is the minimum distance between subsets.

In explaining the subject, it helps to start with a concrete example, namely DSC [7, 3, 2]: Given the master set Z7 = ABCDEFG, what is the largest collection of subsets of size 3 such that the distance between any pair of them is at least 2?

For reference, here is a table of all the 35 size-3 subsets of Z:

 ABC ABD ACD BCD ABE ACE ADE BCE BDE CDE ABF ACF ADF AEF BCF BDF BEF CDF CEF DEF ABG ACG ADG AEG AFG BCG BDG BEG BFG CDG CEG CFG DEG DFG EFG

The 30 solutions of DSC [7, 3, 2] are listed later, but here is one of them:

ABC ADE AFG BDF BEG CDG CEF

Note for instance that ABC and ABD cannot co-exist because the distance between them is only 1.

In all nontrivial cases, the DSC problems have multiple solutions, often related by permutation. For example, exchanging A and B and re-alphabetizing gives this:

ABC ADF AEG BDE BFG CDG CEF

Perhaps surprisingly, if E and F had been exchanged instead of A and B, the result would have turned out the same. This behavior helps to illustrate why there are 30 solutions, and not 7! = 5040.

Here is a recommendation for DSC [m, s, d]:

• s should be less than m, so that the subsets are proper;
• d should not be greater than s, or no collection will have a size greater than 1;
• d should be greater than 1, or the collection will contain every possible subset.

Section 3. There is a greedy algorithm that produces a collection, but not necessarily of maximum size. Here it is, applied to DSC [7, 3, 2]:

• Step 0 in the table below lists the 35 subsets, now written vertically to make the table more graphically manageable. Each subset is preceded by a question mark, because it is not yet determined which subsets will end up in the final collection.

• Step 1 arbitrarily selects ABC (plus sign), and marks ABD, BCD, and 10 others for deletion (minus sign) because their distance from ABC is less than 2.

• Step 2 arbitrarily selects ADE and marks 8 others for deletion because their distance from ADE is less than 2.

• Step 3 selects AFG and marks 4 for deletion.

• Step 4 selects BDF and marks 3 for deletion.

• Step 5 selects BEG and marks 1 for deletion.

• Step 6 selects CDG and marks 0 for deletion.

• Step 7 selects CDG. Nothing is undetermined, so the algorithm is complete.

The final collection is ABC ADE AFG BDF BEG CDG CEF.

The next table shows all 30 collections, obtained by a non-greedy algorithm to be revealed in the next section:

Although the greedy algorithm turns out to produce a maximum-size collection for DSC [7, 3, 2], it comes up short with DSC [8, 3, 2] among others.

Section 4. There is an algorithm to find all solutions to the DSC [m, s, d] problem, and it uses recursion.

All the subsets of Z are placed in a list in arbitrary order. At each level of the recursion two branches are executed:

• greedy branch:
• Select the first remaining subset.
• Reject all other subsets too close to it.
• Recourse one level deeper.
• After returning from recursion, restore the computational state prior to this branch.
• non-greedy branch:
• Reject the first remaining subset.
• Recourse one level deeper.
• After returning from recursion, restore the computational state prior to this branch.

Whenever the recursion reaches bottom, a collection will have been produced. Particularly in the latter stages of computation, the collections generated will often be of less than maximum size and will be discarded. In fact, the final collection will contain precisely zero subsets. As an alternative, there might be some criterion that a highly efficient program could employ to detect when no further maximum-size collections can exist, and so to terminate early, saving time.

This algorithm can be characterized as a depth-first search of a binary tree.

In the case of DSC [8, 3, 2], this algorithm produces 18 collections of size 7 before it starts finding any collections of size 8, of which there are 840. Hence the results must be examined carefully.

Section 5. Here are examples of maximum-size collections for selected values of m, s, and d. Each is whatever appeared first in the search of section 4, above. The qty column is the total number of maximum-size collections.

DSC [m, s, d]
m  s  d qtyexample
62215 AB CD EF
310 ABC DEF
4215 ABCD ABEF CDEF
722105 AB CD EF
3230 ABC ADE AFG BDF BEG CDG CEF
370 ABC DEF
435 ABC
4230 ABCD ABEF ACEG ADFG BCFG BDEG CDEF
370 ABCD AEFG
321 ABCDE
822105 AB CD EF GH
32840 ABC ADE AFG BDF BEH CEG CFH DGH
3280 ABC DEF
4230 ABCD ABEF ABGH ACEG ACFH ADEH ADFG BCEH BCFG BDEG BDFH CDEF CDGH EFGH
3595 ABCD AEFG
52840 ABCDE ABCFG ABDFH ACDGH AEFGH BCEFH BDEGH CDEFG
3280 ABCDE ABFGH
62105 ABCDEF ABCDGH ABEFGH CDEFGH
932840 ABC ADE AFG AHI BDF BEH BGI CDI CEG CFH DGH EFI
3280 ABC DEF GHI
437560 ABCD AEFG BEHI
4315 ABCD EFGH
537560 ABCDE ABFGH CDFGI
4315 ABCDE AFGHI
63280 ABCDEF ABCGHI DEFGHI