Document number: Dxxxx=xx-xxxx

Howard Hinnant
2011-02-12

Combinations and Permutations

Introduction

It doesn't happen very often, but every once in awhile one needs to iterate over all of the combinations or permutations of a set of objects. Or more specifically, given a set of N objects, you want to consider r of them at a time (for each combination or permutation). The standard library offers next_permutation, but this offering alone has a few drawbacks:

Below is a solution to each of these problems and more. The following generic algorithms permit a client to visit every combination or permuation of a sequence of length N, r items at time.

template <class BidirIter, class Function>
Function
for_each_permutation(BidirIter first,
                     BidirIter mid,
                     BidirIter last,
                     Function f);

template <class BidirIter, class Function>
Function
for_each_reversible_permutation(BidirIter first,
                                BidirIter mid,
                                BidirIter last,
                                Function f);

template <class BidirIter, class Function>
Function
for_each_circular_permutation(BidirIter first,
                              BidirIter mid,
                              BidirIter last,
                              Function f);

template <class BidirIter, class Function>
Function
for_each_reversible_circular_permutation(BidirIter first,
                                         BidirIter mid,
                                         BidirIter last,
                                         Function f);

template <class BidirIter, class Function>
Function
for_each_combination(BidirIter first,
                     BidirIter mid,
                     BidirIter last,
                     Function f);

These each follow a for_each style: The algorithm calls a user supplied function object for each combination/permutation in the sequence: f(begin, end). That function object can of course maintain state. The sequence need not be sorted, nor even contain unique objects. The algorithms do not consider the value of the sequence elements at all. That is, the element type need not support LessThanComparable nor EqualityComparable. Furthermore the algorithms follow three additional rules:

  1. On normal (non-exceptional) completion, the sequence is always left in the original order.
  2. The functor is always called with [first, mid). This enables the functor to also access the elements not in the sequence if it is aware of the sequence: [mid, last). This can come in handy when dealing with nested combination/permutation problems where for each permutation you also need to compute combinations and/or permutations on those elements not selected.

  3. The functor should return true or false:

The following generic algorithms take a (preferably unsigned) integral type and return the number of combinations or permutations that exist. Traditionally these functions represent functions such as nCr and nPr. This API breaks with tradition concerning the definition of n. These functions compute: d1+d2Cd1, and d1+d2Pd1, and the related variations for reversible and circular permuations. The rationale for this API is to mesh well with the aforementioned for_each_* algorithms. d1 is distance(first, mid) and d2 is distance(mid, last).

template <class UInt>
UInt
count_each_combination(UInt d1, UInt d2);

template <class UInt>
UInt
count_each_permutation(UInt d1, UInt d2);

template <class UInt>
UInt
count_each_circular_permutation(UInt d1, UInt d2);

template <class UInt>
UInt
count_each_reversible_permutation(UInt d1, UInt d2);

template <class UInt>
UInt
count_each_reversible_circular_permutation(UInt d1, UInt d2);

Each of the above algorithms avoids intermediate overflow. That is, if the final answer is representable as a UInt, then they will return the correct answer. Otherwise they will throw a std::overflow_error. This noisy overflow protection is necessary as these functions can easily overflow. For example count_each_combination(33ull, 34ull) returns 14,226,520,737,620,288,370. But count_each_combination(33ull, 35ull) overflows for a 64 bit unsigned long long. Without overflow protection, the returned result will look quite reasonable to the casual observer and overflow is likely to happen accidentally, go unnoticed, and cause further problems in program logic which will be more difficult to diagnose.

Furthermore the following convenience functions are provided in order to more easily "preflight" ranges sent to the for_each_* algorithms. The reason that the UInt overloads of these algorithms exist is so that you don't have to create a range to compute 3000C2. Instead you can simply call count_each_combination(2ull, 2998ull) (for example).

template <class BidirIter>
std::uintmax_t
count_each_combination(BidirIter first,
                       BidirIter mid,
                       BidirIter last);


template <class BidirIter>
std::uintmax_t
count_each_permutation(BidirIter first,
                       BidirIter mid,
                       BidirIter last);

template <class BidirIter>
std::uintmax_t
count_each_circular_permutation(BidirIter first,
                                BidirIter mid,
                                BidirIter last);

template <class BidirIter>
std::uintmax_t
count_each_reversible_permutation(BidirIter first,
                                  BidirIter mid,
                                  BidirIter last);

template <class BidirIter>
std::uintmax_t
count_each_reversible_circular_permutation(BidirIter first,
                                           BidirIter mid,
                                           BidirIter last);

Examples

Below an example simply fills a vector with consecutive integers, and calls for_each_permutation on a functor that will print out the permutation. The functor will also print out items not in the permutation after a trailing '|'. The functor also counts the number of permutations visited so that the client can query that amount after the call to for_each_permutation.

#include <iostream>
#include <vector>
#include <numeric>
#include <cstdint>
#include <cassert>

// print out a range separated by commas,
//    return number of values printed.
template <class It>
unsigned
display(It begin, It end)
{
    unsigned r = 0;
    if (begin != end)
    {
        std::cout << *begin;
        ++r;
        for (++begin; begin != end; ++begin)
        {
            std::cout << ", " << *begin;
            ++r;
        }
    }
    return r;
}

// functor called for each permutation
class f
{
    unsigned len;
    std::uint64_t count;
public:
    explicit f(unsigned l) : len(l), count(0) {}

    template <class It>
        bool operator()(It first, It last)  // called for each permutation
        {
            // count the number of times this is called
            ++count;
            // print out [first, mid) surrounded with [ ... ]
            std::cout << "[ ";
            unsigned r = display(first, last);
            // If [mid, last) is not empty, then print it out too
            //     prefixed by " | "
            if (r < len)
            {
                std::cout << " | ";
                display(last, std::next(last, len - r));
            }
            std::cout << " ]\n";
            return false;
        }

    operator std::uint64_t() const {return count;}
};

int main()
{
    const int r = 3;
    const int n = 5;
    std::vector<int> v(n);
    std::iota(v.begin(), v.end(), 0);
    std::uint64_t count = for_each_permutation(v.begin(),
                                               v.begin() + r,
                                               v.end(),
                                               f(v.size()));
    // print out "---" to the correct length for the above output
    unsigned e = 3 * r + 2;
    if (r < v.size())
        e += 1 + 3 * (v.size() - r);
    for (unsigned i = 0; i < e; ++i)
        std::cout << '-';
    // print out the permuted vector to show that it has the original order
    std::cout << "\n[ ";
    display(v.begin(), v.end());
    std::cout << " ]\n";
    // sanity check
    assert(count == count_each_permutation(v.begin(), v.begin() + r, v.end()));
    // print out summary of what has happened,
    //   using 'count' from functor state returned from for_each_permutation algorithm.
    std::cout << "Found " << count << " permutations of " << v.size()
              << " objects taken " << r << " at a time.\n";
}

And this is the output of this program:

[ 0, 1, 2 | 3, 4 ]
[ 0, 2, 1 | 3, 4 ]
[ 1, 0, 2 | 3, 4 ]
[ 1, 2, 0 | 3, 4 ]
[ 2, 0, 1 | 3, 4 ]
[ 2, 1, 0 | 3, 4 ]
[ 0, 1, 3 | 2, 4 ]
[ 0, 3, 1 | 2, 4 ]
[ 1, 0, 3 | 2, 4 ]
[ 1, 3, 0 | 2, 4 ]
[ 3, 0, 1 | 2, 4 ]
[ 3, 1, 0 | 2, 4 ]
[ 0, 1, 4 | 2, 3 ]
[ 0, 4, 1 | 2, 3 ]
[ 1, 0, 4 | 2, 3 ]
[ 1, 4, 0 | 2, 3 ]
[ 4, 0, 1 | 2, 3 ]
[ 4, 1, 0 | 2, 3 ]
[ 0, 2, 3 | 1, 4 ]
[ 0, 3, 2 | 1, 4 ]
[ 2, 0, 3 | 1, 4 ]
[ 2, 3, 0 | 1, 4 ]
[ 3, 0, 2 | 1, 4 ]
[ 3, 2, 0 | 1, 4 ]
[ 0, 2, 4 | 1, 3 ]
[ 0, 4, 2 | 1, 3 ]
[ 2, 0, 4 | 1, 3 ]
[ 2, 4, 0 | 1, 3 ]
[ 4, 0, 2 | 1, 3 ]
[ 4, 2, 0 | 1, 3 ]
[ 0, 3, 4 | 1, 2 ]
[ 0, 4, 3 | 1, 2 ]
[ 3, 0, 4 | 1, 2 ]
[ 3, 4, 0 | 1, 2 ]
[ 4, 0, 3 | 1, 2 ]
[ 4, 3, 0 | 1, 2 ]
[ 1, 2, 3 | 0, 4 ]
[ 1, 3, 2 | 0, 4 ]
[ 2, 1, 3 | 0, 4 ]
[ 2, 3, 1 | 0, 4 ]
[ 3, 1, 2 | 0, 4 ]
[ 3, 2, 1 | 0, 4 ]
[ 1, 2, 4 | 0, 3 ]
[ 1, 4, 2 | 0, 3 ]
[ 2, 1, 4 | 0, 3 ]
[ 2, 4, 1 | 0, 3 ]
[ 4, 1, 2 | 0, 3 ]
[ 4, 2, 1 | 0, 3 ]
[ 1, 3, 4 | 0, 2 ]
[ 1, 4, 3 | 0, 2 ]
[ 3, 1, 4 | 0, 2 ]
[ 3, 4, 1 | 0, 2 ]
[ 4, 1, 3 | 0, 2 ]
[ 4, 3, 1 | 0, 2 ]
[ 2, 3, 4 | 0, 1 ]
[ 2, 4, 3 | 0, 1 ]
[ 3, 2, 4 | 0, 1 ]
[ 3, 4, 2 | 0, 1 ]
[ 4, 2, 3 | 0, 1 ]
[ 4, 3, 2 | 0, 1 ]
------------------
[ 0, 1, 2, 3, 4 ]
Found 60 permutations of 5 objects taken 3 at a time.

The above list can be cut in half by instead calling:

for_each_reversible_permutation(v.begin(), v.begin() + r, v.end(), f(v.size()));

[ 0, 1, 2 | 3, 4 ]
[ 0, 2, 1 | 3, 4 ]
[ 1, 0, 2 | 3, 4 ]
[ 0, 1, 3 | 2, 4 ]
[ 0, 3, 1 | 2, 4 ]
[ 1, 0, 3 | 2, 4 ]
[ 0, 1, 4 | 2, 3 ]
[ 0, 4, 1 | 2, 3 ]
[ 1, 0, 4 | 2, 3 ]
[ 0, 2, 3 | 1, 4 ]
[ 0, 3, 2 | 1, 4 ]
[ 2, 0, 3 | 1, 4 ]
[ 0, 2, 4 | 1, 3 ]
[ 0, 4, 2 | 1, 3 ]
[ 2, 0, 4 | 1, 3 ]
[ 0, 3, 4 | 1, 2 ]
[ 0, 4, 3 | 1, 2 ]
[ 3, 0, 4 | 1, 2 ]
[ 1, 2, 3 | 0, 4 ]
[ 1, 3, 2 | 0, 4 ]
[ 2, 1, 3 | 0, 4 ]
[ 1, 2, 4 | 0, 3 ]
[ 1, 4, 2 | 0, 3 ]
[ 2, 1, 4 | 0, 3 ]
[ 1, 3, 4 | 0, 2 ]
[ 1, 4, 3 | 0, 2 ]
[ 3, 1, 4 | 0, 2 ]
[ 2, 3, 4 | 0, 1 ]
[ 2, 4, 3 | 0, 1 ]
[ 3, 2, 4 | 0, 1 ]
------------------
[ 0, 1, 2, 3, 4 ]
Found 30 permutations of 5 objects taken 3 at a time.

For example [ 2, 1, 0 ] is not found in the above list.

By instead calling:

for_each_circular_permutation(v.begin(), v.begin() + r, v.end(), f(v.size()));

the original list can be cut by a third:

[ 0, 1, 2 | 3, 4 ]
[ 0, 2, 1 | 3, 4 ]
[ 0, 1, 3 | 2, 4 ]
[ 0, 3, 1 | 2, 4 ]
[ 0, 1, 4 | 2, 3 ]
[ 0, 4, 1 | 2, 3 ]
[ 0, 2, 3 | 1, 4 ]
[ 0, 3, 2 | 1, 4 ]
[ 0, 2, 4 | 1, 3 ]
[ 0, 4, 2 | 1, 3 ]
[ 0, 3, 4 | 1, 2 ]
[ 0, 4, 3 | 1, 2 ]
[ 1, 2, 3 | 0, 4 ]
[ 1, 3, 2 | 0, 4 ]
[ 1, 2, 4 | 0, 3 ]
[ 1, 4, 2 | 0, 3 ]
[ 1, 3, 4 | 0, 2 ]
[ 1, 4, 3 | 0, 2 ]
[ 2, 3, 4 | 0, 1 ]
[ 2, 4, 3 | 0, 1 ]
------------------
[ 0, 1, 2, 3, 4 ]
Found 20 permutations of 5 objects taken 3 at a time.

For example you will not find the following in the above list:

[ 1, 2, 0 ]  // rotate(0, 1, 3) of permutation 1
[ 2, 0, 1 ]  // rotate(0, 2, 3) of permutation 1

And if your circular permutations are also reversible, your list (and time spent) can be halved again:

for_each_reversible_circular_permutation(v.begin(), v.begin() + r, v.end(), f(v.size()));

[ 0, 1, 2 | 3, 4 ]
[ 0, 1, 3 | 2, 4 ]
[ 0, 1, 4 | 2, 3 ]
[ 0, 2, 3 | 1, 4 ]
[ 0, 2, 4 | 1, 3 ]
[ 0, 3, 4 | 1, 2 ]
[ 1, 2, 3 | 0, 4 ]
[ 1, 2, 4 | 0, 3 ]
[ 1, 3, 4 | 0, 2 ]
[ 2, 3, 4 | 0, 1 ]
------------------
[ 0, 1, 2, 3, 4 ]
Found 10 permutations of 5 objects taken 3 at a time.

For example you will not find the following in the above list:

[ 2, 1, 0 ]  // reverse(0, 3) of permutation(1)
[ 1, 2, 0 ]  // rotate(0, 1, 3) of permutation(1)
[ 0, 2, 1 ]  // reverse of rotate(0, 1, 3) of permutation(1)
[ 2, 0, 1 ]  // rotate(0, 2, 3) of permutation(1)
[ 1, 0, 2 ]  // reverse of rotate(0, 2, 3) of permutation(1)

When r is 3 or less, then reversible circular permutations are the exact same as combinations. In this case use of for_each_combination should be preferred as it will be a little more efficient. Additionally when r is 2 or less, for_each_combination produces the same result as for_each_circular_permutation and for_each_reversible_permutation. When r == 1, all five algorithms produce the same N permutations. When r == 0, all five algorithms call the functor once with the empty range [first, mid).

A more complex example

You may be wondering: when would I ever need to use any of this stuff in a real world example?

Wikipedia documents a real-world problems such as the cutting-stock problem. The idea is to optimally cut a long roll of paper to meet customers orders of varying length and quantity of rolls. While the complete solution is outside of the scope of this paper, these algorithms can easily be used for parts of this problem. For example the article gives a specific problem and lets us know that there are 308 possible ways to cut the roll, some of which can be chosen to meet the customer's orders. for_each_combination can be easily used to enumerate those 308 possibilities in the following code.

#include <iostream>
#include <vector>
#include <numeric>
#include <map>

// collect unique solutions
struct cost
{
    std::map<std::vector<int>, int> set_;

    bool
    operator()(int* first, int* last)
    {
        if (first != last)
        {
            int len = std::accumulate(first, last, 0);
            if (len <= 5600)  // do not collect non-solutions
            {
                std::vector<int> s(first, last);
                std::sort(s.begin(), s.end());
                // reject duplicate solutions
                set_.insert(std::make_pair(s, 5600-len));
            }
        }
        return false;
    }
};

int main()
{
    int widths[] = {1380, 1380, 1380, 1380, 1520, 1520, 1520, 1560, 1560, 1560,
                    1710, 1710, 1710, 1820, 1820, 1820, 1880, 1880, 1930, 1930,
                    2000, 2000, 2050, 2050, 2100, 2100, 2140, 2140, 2150, 2150,
                    2200, 2200};
    int N = sizeof(widths)/sizeof(widths[0]);
    // Collect solutions of lengths 1 - 4.  Other solutions are not possible
    cost c = for_each_combination(widths, widths+1, widths+N, cost());
    c = for_each_combination(widths, widths+2, widths+N, std::move(c));
    c = for_each_combination(widths, widths+3, widths+N, std::move(c));
    c = for_each_combination(widths, widths+4, widths+N, std::move(c));
    // Sort the solutions by waste
    std::multimap<int, std::vector<int> > mm;
    for (std::map<std::vector<int>, int>::iterator i = c.set_.begin(); i != c.set_.end(); ++i)
        mm.insert(std::make_pair(i->second, i->first));
    // Output solutions
    for (std::multimap<int, std::vector<int> >::iterator i = mm.begin(); i != mm.end(); ++i)
    {
        std::cout << *i->second.begin();
        if (i->second.size() > 1)
        {
            for (std::vector<int>::const_iterator j = i->second.begin()+1; j != i->second.end(); ++j)
                std::cout << ", " << *j;
        }
        std::cout << ":  waste = " << i->first << '\n';
    }
}

The output is 308 lines long:

1520, 1880, 2200:  waste = 0
1520, 1930, 2150:  waste = 0
1520, 1930, 2140:  waste = 10
1560, 1880, 2150:  waste = 10
1560, 1930, 2100:  waste = 10
1710, 1880, 2000:  waste = 10
1380, 2000, 2200:  waste = 20
1380, 2050, 2150:  waste = 20
1380, 2100, 2100:  waste = 20
1560, 1820, 2200:  waste = 20
1560, 1880, 2140:  waste = 20
1710, 1820, 2050:  waste = 20
1820, 1880, 1880:  waste = 20
1380, 2050, 2140:  waste = 30
1520, 2000, 2050:  waste = 30
1710, 1710, 2150:  waste = 30
1710, 1930, 1930:  waste = 30
1820, 1820, 1930:  waste = 30
1560, 2000, 2000:  waste = 40
1710, 1710, 2140:  waste = 40
1520, 1880, 2150:  waste = 50
1520, 1930, 2100:  waste = 50
1520, 1820, 2200:  waste = 60
1520, 1880, 2140:  waste = 60
1560, 1880, 2100:  waste = 60
1560, 1930, 2050:  waste = 60
1380, 2000, 2150:  waste = 70
1380, 2050, 2100:  waste = 70
1560, 1820, 2150:  waste = 70
1710, 1820, 2000:  waste = 70
1380, 1380, 1380, 1380:  waste = 80
1380, 2000, 2140:  waste = 80
...
2200, 2200:  waste = 1200
2150, 2200:  waste = 1250
2140, 2200:  waste = 1260
1380, 1380, 1560:  waste = 1280
2100, 2200:  waste = 1300
2150, 2150:  waste = 1300
2140, 2150:  waste = 1310
1380, 1380, 1520:  waste = 1320
2140, 2140:  waste = 1320
2050, 2200:  waste = 1350
2100, 2150:  waste = 1350
...
2200:  waste = 3400
2150:  waste = 3450
2140:  waste = 3460
2100:  waste = 3500
2050:  waste = 3550
2000:  waste = 3600
1930:  waste = 3670
1880:  waste = 3720
1820:  waste = 3780
1710:  waste = 3890
1560:  waste = 4040
1520:  waste = 4080
1380:  waste = 4220

Obviously this isn't a final solution to the cutting-stock problem. However it is an important and instructive intermediate step. The subsequent steps (including finding the minimum number of knife changes) involve searching permutations - the order of the cutting patterns, and the order of the cuts in each pattern.

Performance Considerations

Hervé Brönnimann published N2639 proposing functionality somewhat similar as to what is outlined herein, including:

template <class BidirIter>
bool
next_partial_permutation(BidirIter first,
                         BidirIter mid,
                         BidirIter last);

template <class BidirIter >
bool
next_combination(BidirIter first,
                 BidirIter mid,
                 BidirIter last);

This interface follows the style of the existing std::next_permutation algorithm. It is the inverse of the for_each style: The client calls the functor directly, and then calls the algorithm to find the next iteration.

The problem with this interface is that it can get expensive to find the next iteration. For example when calling for_each_permutation and next_partial_permutation with r == 4 and varying N from 4 to 100, the time to iterate through the permutations skyrockets for next_partial_permutation, but not for for_each_permutation. By the time N == 100, for_each_permutation is running nearly 45 times faster than next_partial_permutation, taking 20 seconds to do what for_each_permutation is accomplishing in under half a second. This trend rapidly gets worse as N grows.

The same performance comparison can be made between next_combination and for_each_combination. It should be stressed that this isn't a lack of quality or care put into the N2639 implementation. I've studied that implementation carefully and consider it very high quality. It is simply that the number of comparisons that need to be done to find out which swaps need to be done gets outrageously expensive. The number of swaps actually performed in both algorithms is approximately the same. In the case of combinations, N has to grow much higher before the iteration time starts to swamp the algorithm, but that is because the number of generated permutations is low with combinations, compared to permutations. At N == 4000 for_each_combination is running well over 1000 times faster than next_combination.

If the next_* algorithms are to be put to use, it should be made clear that these algorithms should only be employed when it is expected that the client code will break out of the loop early. When it is expected that all permutations will be visited then the for_each algorithms are always faster, often orders of magnitude so.

Synopsis

// Integral-based count_each_*

template <class UInt>
UInt
count_each_permutation(UInt d1, UInt d2);

template <class UInt>
UInt
count_each_reversible_permutation(UInt d1, UInt d2);

template <class UInt>
UInt
count_each_circular_permutation(UInt d1, UInt d2);

template <class UInt>
UInt
count_each_reversible_circular_permutation(UInt d1, UInt d2);

template <class UInt>
UInt
count_each_combination(UInt d1, UInt d2);

// Iterator-based count_each_*

template <class BidirIter>
std::uintmax_t
count_each_permutation(BidirIter first,
                       BidirIter mid,
                       BidirIter last);

template <class BidirIter>
std::uintmax_t
count_each_reversible_permutation(BidirIter first,
                                  BidirIter mid,
                                  BidirIter last);

template <class BidirIter>
std::uintmax_t
count_each_circular_permutation(BidirIter first,
                                BidirIter mid,
                                BidirIter last);

template <class BidirIter>
std::uintmax_t
count_each_reversible_circular_permutation(BidirIter first,
                                           BidirIter mid,
                                           BidirIter last);

template <class BidirIter>
std::uintmax_t
count_each_combination(BidirIter first,
                       BidirIter mid,
                       BidirIter last);

// Iterator-based for_each_* algorithms

template <class BidirIter, class Function>
Function
for_each_permutation(BidirIter first,
                     BidirIter mid,
                     BidirIter last,
                     Function f);

template <class BidirIter, class Function>
Function
for_each_reversible_permutation(BidirIter first,
                                BidirIter mid,
                                BidirIter last,
                                Function f);

template <class BidirIter, class Function>
Function
for_each_circular_permutation(BidirIter first,
                              BidirIter mid,
                              BidirIter last,
                              Function f);

template <class BidirIter, class Function>
Function
for_each_reversible_circular_permutation(BidirIter first,
                                         BidirIter mid,
                                         BidirIter last,
                                         Function f);

template <class BidirIter, class Function>
Function
for_each_combination(BidirIter first,
                     BidirIter mid,
                     BidirIter last,
                     Function f);

Specification

Remarks: In this section a post-fix ! indicates a factorial operator and has higher precedence than any other operator.

template <class UInt>
UInt
count_each_permutation(UInt d1, UInt d2);

Requires: Uint is an integral type or a type emulating an integral type. If Uint is a signed integral type (or emulation of a signed integral type), then d1 and d2 are non-negative.

Returns: (d1 + d2)!/d2!.

Throws: If the computed value is not representable in the type UInt, throws std::overflow_error.

Remarks: If the computed value is representable in the type UInt, returns the correct value. This algorithm avoids intermediate overflow.

template <class UInt>
UInt
count_each_reversible_permutation(UInt d1, UInt d2);

Requires: Uint is an integral type or a type emulating an integral type. If Uint is a signed integral type (or emulation of a signed integral type), then d1 and d2 are non-negative.

Returns: If d1 <= 1 returns (d1 + d2)!/d2!. Else returns (d1 + d2)!/(2*(d2!)).

Throws: If the computed value is not representable in the type UInt, throws std::overflow_error.

Remarks: If the computed value is representable in the type UInt, returns the correct value. This algorithm avoids intermediate overflow.

template <class UInt>
UInt
count_each_circular_permutation(UInt d1, UInt d2);

Requires: Uint is an integral type or a type emulating an integral type. If Uint is a signed integral type (or emulation of a signed integral type), then d1 and d2 are non-negative.

Returns: If d1 == 0 returns 1. Else returns (d1 + d2)!/(d1*(d2!)).

Throws: If the computed value is not representable in the type UInt, throws std::overflow_error.

Remarks: If the computed value is representable in the type UInt, returns the correct value. This algorithm avoids intermediate overflow.

template <class UInt>
UInt
count_each_reversible_circular_permutation(UInt d1, UInt d2);

Requires: Uint is an integral type or a type emulating an integral type. If Uint is a signed integral type (or emulation of a signed integral type), then d1 and d2 are non-negative.

Returns: If d1 == 0 returns 1. Else if d1 <= 2 returns (d1 + d2)!/(d1*(d2!)). Else returns (d1 + d2)!/(2*d1*(d2!)).

Throws: If the computed value is not representable in the type UInt, throws std::overflow_error.

Remarks: If the computed value is representable in the type UInt, returns the correct value. This algorithm avoids intermediate overflow.

template <class UInt>
UInt
count_each_combination(UInt d1, UInt d2);

Requires: Uint is an integral type or a type emulating an integral type. If Uint is a signed integral type (or emulation of a signed integral type), then d1 and d2 are non-negative.

Returns: (d1 + d2)!/((d1!)*(d2!)).

Throws: If the computed value is not representable in the type UInt, throws std::overflow_error.

Remarks: If the computed value is representable in the type UInt, returns the correct value. This algorithm avoids intermediate overflow.

template <class BidirIter>
std::uintmax_t
count_each_permutation(BidirIter first,
                           BidirIter mid,
                           BidirIter last);

Requires: [first, mid) and [mid, last) are valid ranges.

Returns: count_each_permutation<std::uintmax_t>(std::distance(first, mid), std::distance(mid, last)).

template <class BidirIter>
std::uintmax_t
count_each_reversible_permutation(BidirIter first,
                                      BidirIter mid,
                                      BidirIter last);

Requires: [first, mid) and [mid, last) are valid ranges.

Returns: count_each_reversible_permutation<std::uintmax_t>(std::distance(first, mid), std::distance(mid, last)).

template <class BidirIter>
std::uintmax_t
count_each_circular_permutation(BidirIter first,
                                    BidirIter mid,
                                    BidirIter last);

Requires: [first, mid) and [mid, last) are valid ranges.

Returns: count_each_circular_permutation<std::uintmax_t>(std::distance(first, mid), std::distance(mid, last)).

template <class BidirIter>
std::uintmax_t
count_each_reversible_circular_permutation(BidirIter first,
                                               BidirIter mid,
                                               BidirIter last);

Requires: [first, mid) and [mid, last) are valid ranges.

Returns: count_each_reversible_circular_permutation<std::uintmax_t>(std::distance(first, mid), std::distance(mid, last)).

template <class BidirIter>
std::uintmax_t
count_each_combination(BidirIter first,
                           BidirIter mid,
                           BidirIter last);

Requires: [first, mid) and [mid, last) are valid ranges.

Returns: count_each_combination<std::uintmax_t>(std::distance(first, mid), std::distance(mid, last)).

template <class BidirIter, class Function>
Function
for_each_permutation(BidirIter first,
                     BidirIter mid,
                     BidirIter last,
                     Function f);

Requires:

Effects: Repeatedly permutes the range [first, last) such that the range [first, mid) represents each permutation of the values in [first, last) taken distance(first, mid) at a time. For each permutation calls f(first, mid). On each call, the range [mid, last) holds the values not in the current permutation. If f returns true then returns immediately without permuting the sequence any futher. Otherwise, after the last call to f, and prior to returning, the range [first, last) is restored to its original order. [Note: If f always returns false it is called count_each_permutation(first, mid, last) times. — end note]

Returns: std::move(f).

Notes: The type referenced by *first need not be EqualityComparable nor LessThanComparable. The input range need not be sorted. The algorithm does not take the values in the range [first, last) into account in any way.

template <class BidirIter, class Function>
Function
for_each_reversible_permutation(BidirIter first,
                                BidirIter mid,
                                BidirIter last,
                                Function f);

Requires:

Effects: Repeatedly permutes the range [first, last) such that the range [first, mid) represents each permutation of the values in [first, last) taken distance(first, mid) at a time, except that f is never called with the reverse of a permutation which has been previously called, assuming that all values in the range are unique. For each permutation calls f(first, mid). On each call, the range [mid, last) holds the values not in the current permutation. If f returns true then returns immediately without permuting the sequence any futher. Otherwise, after the last call to f, and prior to returning, the range [first, last) is restored to its original order. [Note: If f always returns false it is called count_each_reversible_permutation(first, mid, last) times. — end note]

Returns: std::move(f).

Notes: The type referenced by *first need not be EqualityComparable nor LessThanComparable. The input range need not be sorted. The algorithm does not take the values in the range [first, last) into account in any way.

template <class BidirIter, class Function>
Function
for_each_circular_permutation(BidirIter first,
                              BidirIter mid,
                              BidirIter last,
                              Function f);

Requires:

Effects: Repeatedly permutes the range [first, last) such that the range [first, mid) represents each permutation of the values in [first, last) taken distance(first, mid) at a time, except that f is never called with a circular permutation which has been previously called, assuming that all values in the range are unique. For each permutation calls f(first, mid). On each call, the range [mid, last) holds the values not in the current permutation. If f returns true then returns immediately without permuting the sequence any futher. Otherwise, after the last call to f, and prior to returning, the range [first, last) is restored to its original order. [Note: If f always returns false it is called count_each_circular_permutation(first, mid, last) times. — end note]

Returns: std::move(f).

Notes: The type referenced by *first need not be EqualityComparable nor LessThanComparable. The input range need not be sorted. The algorithm does not take the values in the range [first, last) into account in any way.

template <class BidirIter, class Function>
Function
for_each_reversible_circular_permutation(BidirIter first,
                                         BidirIter mid,
                                         BidirIter last,
                                         Function f);

Requires:

Effects: Repeatedly permutes the range [first, last) such that the range [first, mid) represents each permutation of the values in [first, last) taken distance(first, mid) at a time, except that f is never called with a circular permutation which has been previously called, or the reverse of that permutation, assuming that all values in the range are unique. For each permutation calls f(first, mid). On each call, the range [mid, last) holds the values not in the current permutation. If f returns true then returns immediately without permuting the sequence any futher. Otherwise, after the last call to f, and prior to returning, the range [first, last) is restored to its original order. [Note: If f always returns false it is called count_each_reversible_circular_permutation(first, mid, last) times. — end note]

Returns: std::move(f).

Notes: The type referenced by *first need not be EqualityComparable nor LessThanComparable. The input range need not be sorted. The algorithm does not take the values in the range [first, last) into account in any way.

template <class BidirIter, class Function>
Function
for_each_combination(BidirIter first,
                     BidirIter mid,
                     BidirIter last,
                     Function f);

Requires:

Effects: Repeatedly permutes the range [first, last) such that the range [first, mid) represents each combination of the values in [first, last) taken distance(first, mid) at a time. For each permutation calls f(first, mid). On each call, the range [mid, last) holds the values not in the current permutation. If f returns true then returns immediately without permuting the sequence any futher. Otherwise, after the last call to f, and prior to returning, the range [first, last) is restored to its original order. [Note: If f always returns false it is called count_each_combination(first, mid, last) times. — end note]

Returns: std::move(f).

Notes: The type referenced by *first need not be EqualityComparable nor LessThanComparable. The input range need not be sorted. The algorithm does not take the values in the range [first, last) into account in any way.

Oh, and gcd. I'm tired of reinventing it.

template 
UInt
gcd(UInt x, UInt y) noexcept;

Requires: Uint is an integral type or a type emulating an integral type. x and y are positive.

Returns: The greatest common divisor of x and y.

Implementation

Here is the code.

//  (C) Copyright Howard Hinnant 2005-2011.
//  Use, modification and distribution are subject to the Boost Software License,
//  Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at
//  http://www.boost.org/LICENSE_1_0.txt).
//
//  See http://www.boost.org/libs/type_traits for most recent version including documentation.

//  Details are in namespace detail.  Every effort has been made to make
//  combine_discontinuous and permute as fast as possible.  They minimize the number
//  of swaps that are performed. Everything else builds on these two primitives. 
//  The most complicated algorithm is for_each_reversible_permutation.  But it
//  builds on combine_discontinuous and permute and I believe represents a minimum
//  number of swaps.  Without care, algorithms such as for_each_reversible_permutation
//  will take longer than for_each_permutation instead of the intended half the time.

//  Speed is everything.  Lest you could just use std::next_permutation and manually
//  eliminate duplicate permutations.  If the implementation fails in being orders
//  of magnitude faster than that, then it has failed miserably.

#include <iterator>
#include <algorithm>

namespace detail
{

// Rotates two discontinuous ranges to put *first2 where *first1 is.
//     If last1 == first2 this would be equivalent to rotate(first1, first2, last2),
//     but instead the rotate "jumps" over the discontinuity [last1, first2) -
//     which need not be a valid range.
//     In order to make it faster, the length of [first1, last1) is passed in as d1,
//     and d2 must be the length of [first2, last2).
//  In a perfect world the d1 > d2 case would have used swap_ranges and
//     reverse_iterator, but reverse_iterator is too inefficient.
template <class BidirIter>
void
rotate_discontinuous(BidirIter first1, BidirIter last1,
                     typename std::iterator_traits<BidirIter>::difference_type d1,
                     BidirIter first2, BidirIter last2,
                     typename std::iterator_traits<BidirIter>::difference_type d2)
{
    using std::swap;
    if (d1 <= d2)
        std::rotate(first2, std::swap_ranges(first1, last1, first2), last2);
    else
    {
        BidirIter i1 = last1;
        while (first2 != last2)
            swap(*--i1, *--last2);
        std::rotate(first1, i1, last1);
    }
}

// Rotates the three discontinuous ranges to put *first2 where *first1 is.
// Just like rotate_discontinuous, except the second range is now represented by
//    two discontinuous ranges: [first2, last2) + [first3, last3).
template <class BidirIter>
void
rotate_discontinuous3(BidirIter first1, BidirIter last1,
                      typename std::iterator_traits<BidirIter>::difference_type d1,
                      BidirIter first2, BidirIter last2,
                      typename std::iterator_traits<BidirIter>::difference_type d2,
                      BidirIter first3, BidirIter last3,
                      typename std::iterator_traits<BidirIter>::difference_type d3)
{
    rotate_discontinuous(first1, last1, d1, first2, last2, d2);
    if (d1 <= d2)
        rotate_discontinuous(std::next(first2, d2 - d1), last2, d1, first3, last3, d3);
    else
    {
        rotate_discontinuous(std::next(first1, d2), last1, d1 - d2, first3, last3, d3);
        rotate_discontinuous(first2, last2, d2, first3, last3, d3);
    }
}

// Call f() for each combination of the elements [first1, last1) + [first2, last2)
//    swapped/rotated into the range [first1, last1).  As long as f() returns
//    false, continue for every combination and then return [first1, last1) and
//    [first2, last2) to their original state.  If f() returns true, return
//    immediately.
//  Does the absolute mininum amount of swapping to accomplish its task.
//  If f() always returns false it will be called (d1+d2)!/(d1!*d2!) times.
template <class BidirIter, class Function>
bool
combine_discontinuous(BidirIter first1, BidirIter last1,
                      typename std::iterator_traits<BidirIter>::difference_type d1,
                      BidirIter first2, BidirIter last2,
                      typename std::iterator_traits<BidirIter>::difference_type d2,
                      Function& f,
                      typename std::iterator_traits<BidirIter>::difference_type d = 0)
{
    typedef typename std::iterator_traits<BidirIter>::difference_type D;
    using std::swap;
    if (d1 == 0 || d2 == 0)
        return f();
    if (d1 == 1)
    {
        for (BidirIter i2 = first2; i2 != last2; ++i2)
        {
            if (f())
                return true;
            swap(*first1, *i2);
        }
    }
    else
    {
        BidirIter f1p = std::next(first1);
        BidirIter i2 = first2;
        for (D d22 = d2; i2 != last2; ++i2, --d22)
        {
            if (combine_discontinuous(f1p, last1, d1-1, i2, last2, d22, f, d+1))
                return true;
            swap(*first1, *i2);
        }
    }
    if (f())
        return true;
    if (d != 0)
        rotate_discontinuous(first1, last1, d1, std::next(first2), last2, d2-1);
    else
        rotate_discontinuous(first1, last1, d1, first2, last2, d2);
    return false;
}

// A binder for binding arguments to call combine_discontinuous
template <class Function, class BidirIter>
class call_combine_discontinuous
{
    typedef typename std::iterator_traits<BidirIter>::difference_type D;
    Function f_;
    BidirIter first1_;
    BidirIter last1_;
    D d1_;
    BidirIter first2_;
    BidirIter last2_;
    D d2_;

public:
    call_combine_discontinuous(
                      BidirIter first1, BidirIter last1,
                      D d1,
                      BidirIter first2, BidirIter last2,
                      D d2,
                      Function& f)
        : f_(f), first1_(first1), last1_(last1), d1_(d1),
                 first2_(first2), last2_(last2), d2_(d2) {}

    bool operator()()
    {
        return combine_discontinuous(first1_, last1_, d1_, first2_, last2_, d2_, f_);
    }
};

// See combine_discontinuous3
template <class BidirIter, class Function>
bool
combine_discontinuous3_(BidirIter first1, BidirIter last1,
                        typename std::iterator_traits<BidirIter>::difference_type d1,
                        BidirIter first2, BidirIter last2,
                        typename std::iterator_traits<BidirIter>::difference_type d2,
                        BidirIter first3, BidirIter last3,
                        typename std::iterator_traits<BidirIter>::difference_type d3,
                        Function& f,
                        typename std::iterator_traits<BidirIter>::difference_type d = 0)
{
    typedef typename std::iterator_traits<BidirIter>::difference_type D;
    using std::swap;
    if (d1 == 1)
    {
        for (BidirIter i2 = first2; i2 != last2; ++i2)
        {
            if (f())
                return true;
            swap(*first1, *i2);
        }
        if (f())
            return true;
        swap(*first1, *std::prev(last2));
        swap(*first1, *first3);
        for (BidirIter i2 = std::next(first3); i2 != last3; ++i2)
        {
            if (f())
                return true;
            swap(*first1, *i2);
        }
    }
    else
    {
        BidirIter f1p = std::next(first1);
        BidirIter i2 = first2;
        for (D d22 = d2; i2 != last2; ++i2, --d22)
        {
            if (combine_discontinuous3_(f1p, last1, d1-1, i2, last2, d22, first3,
                                        last3, d3, f, d+1))
                return true;
            swap(*first1, *i2);
        }
        i2 = first3;
        for (D d22 = d3; i2 != last3; ++i2, --d22)
        {
            if (combine_discontinuous(f1p, last1, d1-1, i2, last3, d22, f, d+1))
                return true;
            swap(*first1, *i2);
        }
    }
    if (f())
        return true;
    if (d1 == 1)
        swap(*std::prev(last2), *first3);
    if (d != 0)
    {
        if (d2 > 1)
            rotate_discontinuous3(first1, last1, d1, std::next(first2), last2, d2-1, first3, last3, d3);
        else
            rotate_discontinuous(first1, last1, d1, first3, last3, d3);
    }
    else
        rotate_discontinuous3(first1, last1, d1, first2, last2, d2, first3, last3, d3);
    return false;
}

// Like combine_discontinuous, but swaps/rotates each combination out of
//    [first1, last1) + [first2, last2) + [first3, last3) into [first1, last1).
//    If f() always returns false, it is called (d1+d2+d3)!/(d1!*(d2+d3)!) times.
template <class BidirIter, class Function>
bool
combine_discontinuous3(BidirIter first1, BidirIter last1,
                       typename std::iterator_traits<BidirIter>::difference_type d1,
                       BidirIter first2, BidirIter last2,
                       typename std::iterator_traits<BidirIter>::difference_type d2,
                       BidirIter first3, BidirIter last3,
                       typename std::iterator_traits<BidirIter>::difference_type d3,
                       Function& f)
{
    typedef call_combine_discontinuous<Function&, BidirIter> F;
    F fbc(first2, last2, d2, first3, last3, d3, f);  // BC
    return combine_discontinuous3_(first1, last1, d1, first2, last2, d2, first3, last3, d3, fbc);
}

// See permute
template <class BidirIter, class Function>
bool
permute_(BidirIter first1, BidirIter last1,
         typename std::iterator_traits<BidirIter>::difference_type d1,
         Function& f)
{
    using std::swap;
    switch (d1)
    {
    case 0:
    case 1:
        return f();
    case 2:
        if (f())
            return true;
        swap(*first1, *std::next(first1));
        return f();
    case 3:
        {
        if (f())
            return true;
        BidirIter f2 = std::next(first1);
        BidirIter f3 = std::next(f2);
        swap(*f2, *f3);
        if (f())
            return true;
        swap(*first1, *f3);
        swap(*f2, *f3);
        if (f())
            return true;
        swap(*f2, *f3);
        if (f())
            return true;
        swap(*first1, *f2);
        swap(*f2, *f3);
        if (f())
            return true;
        swap(*f2, *f3);
        return f();
        }
    }
    BidirIter fp1 = std::next(first1);
    for (BidirIter p = fp1; p != last1; ++p)
    {
        if (permute_(fp1, last1, d1-1, f))
            return true;
        std::reverse(fp1, last1);
        swap(*first1, *p);
    }
    return permute_(fp1, last1, d1-1, f);
}

// Calls f() for each permutation of [first1, last1)
// Divided into permute and permute_ in a (perhaps futile) attempt to
//    squeeze a little more performance out of it.
template <class BidirIter, class Function>
bool
permute(BidirIter first1, BidirIter last1,
        typename std::iterator_traits<BidirIter>::difference_type d1,
        Function& f)
{
    using std::swap;
    switch (d1)
    {
    case 0:
    case 1:
        return f();
    case 2:
        {
        if (f())
            return true;
        BidirIter i = std::next(first1);
        swap(*first1, *i);
        if (f())
            return true;
        swap(*first1, *i);
        }
        break;
    case 3:
        {
        if (f())
            return true;
        BidirIter f2 = std::next(first1);
        BidirIter f3 = std::next(f2);
        swap(*f2, *f3);
        if (f())
            return true;
        swap(*first1, *f3);
        swap(*f2, *f3);
        if (f())
            return true;
        swap(*f2, *f3);
        if (f())
            return true;
        swap(*first1, *f2);
        swap(*f2, *f3);
        if (f())
            return true;
        swap(*f2, *f3);
        if (f())
            return true;
        swap(*first1, *f3);
        }
        break;
    default:
        BidirIter fp1 = std::next(first1);
        for (BidirIter p = fp1; p != last1; ++p)
        {
            if (permute_(fp1, last1, d1-1, f))
                return true;
            std::reverse(fp1, last1);
            swap(*first1, *p);
        }
        if (permute_(fp1, last1, d1-1, f))
            return true;
        std::reverse(first1, last1);
        break;
    }
    return false;
}

// Creates a functor with no arguments which calls f_(first_, last_).
//   Also has a variant that takes two It and ignores them.
template <class Function, class It>
class bound_range
{
    Function f_;
    It first_;
    It last_;
public:
    bound_range(Function f, It first, It last)
        : f_(f), first_(first), last_(last) {}

    bool
    operator()()
    {
        return f_(first_, last_);
    }

    bool
    operator()(It, It)
    {
        return f_(first_, last_);
    }
};

// A binder for binding arguments to call permute
template <class Function, class It>
class call_permute
{
    typedef typename std::iterator_traits<It>::difference_type D;
    Function f_;
    It first_;
    It last_;
    D d_;
public:
    call_permute(Function f, It first, It last, D d)
        : f_(f), first_(first), last_(last), d_(d) {}

    bool
    operator()()
    {
        return permute(first_, last_, d_, f_);
    }
};

}  // detail

template <class BidirIter, class Function>
Function
for_each_combination(BidirIter first, BidirIter mid,
                     BidirIter last, Function f)
{
    detail::bound_range<Function&, BidirIter> wfunc(f, first, mid);
    detail::combine_discontinuous(first, mid, std::distance(first, mid),
                                  mid, last, std::distance(mid, last),
                                  wfunc);
    return std::move(f);
}

template <class UInt>
UInt
gcd(UInt x, UInt y)
{
    while (y != 0)
    {
        UInt t = x % y;
        x = y;
        y = t;
    }
    return x;
}

template <class UInt>
UInt
count_each_combination(UInt d1, UInt d2)
{
    if (d2 < d1)
        std::swap(d1, d2);
    if (d1 == 0)
        return 1;
    if (d1 > std::numeric_limits<UInt>::max() - d2)
        throw std::overflow_error("overflow in count_each_combination");
    UInt n = d1 + d2;
    UInt r = n;
    --n;
    for (UInt k = 2; k <= d1; ++k, --n)
    {
        // r = r * n / k, known to not not have truncation error
        UInt g = gcd(r, k);
        r /= g;
        UInt t = n / (k / g);
        if (r > std::numeric_limits<UInt>::max() / t)
            throw std::overflow_error("overflow in count_each_combination");
        r *= t;
    }
    return r;
}

template <class BidirIter>
std::uintmax_t
count_each_combination(BidirIter first, BidirIter mid, BidirIter last)
{
    return count_each_combination<std::uintmax_t>
                          (std::distance(first, mid), std::distance(mid, last));
}

// For each of the permutation algorithms, use for_each_combination (or
//    combine_discontinuous) to handle the "r out of N" part of the algorithm.
//    Thus each permutation algorithm has to deal only with an "N out of N"
//    problem.  I.e. For each combination of r out of N items, permute it thusly.
template <class BidirIter, class Function>
Function
for_each_permutation(BidirIter first, BidirIter mid,
                     BidirIter last, Function f)
{
    typedef typename std::iterator_traits<BidirIter>::difference_type D;
    typedef detail::bound_range<Function&, BidirIter> Wf;
    typedef detail::call_permute<Wf, BidirIter> PF;
    Wf wfunc(f, first, mid);
    D d1 = std::distance(first, mid);
    PF pf(wfunc, first, mid, d1);
    detail::combine_discontinuous(first, mid, d1,
                                  mid, last, std::distance(mid, last),
                                  pf);
    return std::move(f);
}

template <class UInt>
UInt
count_each_permutation(UInt d1, UInt d2)
{
    // return (d1+d2)!/d2!
    if (d1 > std::numeric_limits<UInt>::max() - d2)
        throw std::overflow_error("overflow in count_each_permutation");
    UInt n = d1 + d2;
    UInt r = 1;
    for (; n > d2; --n)
    {
        if (r > std::numeric_limits<UInt>::max() / n)
            throw std::overflow_error("overflow in count_each_permutation");
        r *= n;
    }
    return r;
}

template <class BidirIter>
std::uintmax_t
count_each_permutation(BidirIter first, BidirIter mid, BidirIter last)
{
    return count_each_permutation<std::uintmax_t>
                          (std::distance(first, mid), std::distance(mid, last));
}

namespace detail
{

// Adapt functor to permute over [first+1, last)
//   A circular permutation of N items is done by holding the first item and
//   permuting [first+1, last).
template <class Function, class BidirIter>
class circular_permutation
{
    typedef typename std::iterator_traits<BidirIter>::difference_type D;

    Function f_;
    D s_;

public:
    explicit circular_permutation(Function f, D s) : f_(f), s_(s) {}

    bool
    operator()(BidirIter first, BidirIter last)
    {
        if (s_ <= 1)
            return f_(first, last);
        bound_range<Function, BidirIter> f(f_, first, last);
        return permute(std::next(first), last, s_ - 1, f);
    }
};

}  // detail

template <class BidirIter, class Function>
Function
for_each_circular_permutation(BidirIter first,
                              BidirIter mid,
                              BidirIter last, Function f)
{
    for_each_combination(first, mid, last, detail::circular_permutation<Function&,
                          BidirIter>(f, std::distance(first, mid)));
    return std::move(f);
}    

template <class UInt>
UInt
count_each_circular_permutation(UInt d1, UInt d2)
{
    // return d1 > 0 ? (d1+d2)!/(d1*d2!) : 1
    if (d1 == 0)
        return 1;
    UInt r;
    if (d1 <= d2)
    {
        try
        {
            r = count_each_combination(d1, d2);
        }
        catch (const std::overflow_error&)
        {
            throw std::overflow_error("overflow in count_each_circular_permutation");
        }
        for (--d1; d1 > 1; --d1)
        {
            if (r > std::numeric_limits<UInt>::max()/d1)
                throw std::overflow_error("overflow in count_each_circular_permutation");
            r *= d1;
        }
    }
    else
    {   // functionally equivalent but faster algorithm
        if (d1 > std::numeric_limits<UInt>::max() - d2)
            throw std::overflow_error("overflow in count_each_circular_permutation");
        UInt n = d1 + d2;
        r = 1;
        for (; n > d1; --n)
        {
            if (r > std::numeric_limits<UInt>::max()/n)
                throw std::overflow_error("overflow in count_each_circular_permutation");
            r *= n;
        }
        for (--n; n > d2; --n)
        {
            if (r > std::numeric_limits<UInt>::max()/n)
                throw std::overflow_error("overflow in count_each_circular_permutation");
            r *= n;
        }
    }
    return r;
}

template <class BidirIter>
std::uintmax_t
count_each_circular_permutation(BidirIter first, BidirIter mid, BidirIter last)
{
    return count_each_circular_permutation<std::uintmax_t>
                          (std::distance(first, mid), std::distance(mid, last));
}

namespace detail
{

// Difficult!!!  See notes for operator().
template <class Function, class Size>
class reversible_permutation
{
    Function f_;
    Size s_;

public:
    reversible_permutation(Function f, Size s) : f_(f), s_(s) {}

    template <class BidirIter>
    bool
    operator()(BidirIter first, BidirIter last);
};

// rev1 looks like call_permute
template <class Function, class BidirIter>
class rev1
{
    typedef typename std::iterator_traits<BidirIter>::difference_type D;

    Function f_;
    BidirIter first1_;
    BidirIter last1_;
    D d1_;

public:
    rev1(Function f, BidirIter first, BidirIter last, D d)
        : f_(f), first1_(first), last1_(last), d1_(d) {}

    bool operator()()
    {
        return permute(first1_, last1_, d1_, f_);
    }
};

// For each permutation in [first1, last1),
//     call f() for each permutation of [first2, last2).
template <class Function, class BidirIter>
class rev2
{
    typedef typename std::iterator_traits<BidirIter>::difference_type D;

    Function f_;
    BidirIter first1_;
    BidirIter last1_;
    D d1_;
    BidirIter first2_;
    BidirIter last2_;
    D d2_;

public:
    rev2(Function f, BidirIter first1, BidirIter last1, D d1,
                     BidirIter first2, BidirIter last2, D d2)
        : f_(f), first1_(first1), last1_(last1), d1_(d1),
                 first2_(first2), last2_(last2), d2_(d2) {}

    bool operator()()
    {
        call_permute<Function, BidirIter> f(f_, first2_, last2_, d2_);
        return permute(first1_, last1_, d1_, f);
    }
};

// For each permutation in [first1, last1),
//     and for each permutation of [first2, last2)
//     call f() for each permutation of [first3, last3).
template <class Function, class BidirIter>
class rev3
{
    typedef typename std::iterator_traits<BidirIter>::difference_type D;

    Function f_;
    BidirIter first1_;
    BidirIter last1_;
    D d1_;
    BidirIter first2_;
    BidirIter last2_;
    D d2_;
    BidirIter first3_;
    BidirIter last3_;
    D d3_;

public:
    rev3(Function f, BidirIter first1, BidirIter last1, D d1,
                     BidirIter first2, BidirIter last2, D d2,
                     BidirIter first3, BidirIter last3, D d3)
        : f_(f), first1_(first1), last1_(last1), d1_(d1),
                 first2_(first2), last2_(last2), d2_(d2),
                 first3_(first3), last3_(last3), d3_(d3) {}

    bool operator()()
    {
        rev2<Function, BidirIter> f(f_, first2_, last2_, d2_, first3_, last3_, d3_);
        return permute(first1_, last1_, d1_, f);
    }
};

// There are simpler implementations.  I believe the simpler ones are far more
//     expensive.
template <class Function, class Size>
template <class BidirIter>
bool
reversible_permutation<Function, Size>::operator()(BidirIter first,
                                                   BidirIter last)
{
    typedef typename std::iterator_traits<BidirIter>::difference_type difference_type;
    typedef rev2<bound_range<Function&, BidirIter>, BidirIter> F2;
    typedef rev3<bound_range<Function&, BidirIter>, BidirIter> F3;
    // When the range is 0 - 2, then this is just a combination of N out of N
    //   elements.
    if (s_ < 3)
        return f_(first, last);
    using std::swap;
    // Hold the first element steady and call f_(first, last) for each
    //    permutation in [first+1, last).
    BidirIter a = std::next(first);
    bound_range<Function&, BidirIter> f(f_, first, last);
    if (permute(a, last, s_-1, f))
        return true;
    // Beginning with the first element, swap the previous element with the
    //    next element.  For each swap, call f_(first, last) for each
    //    permutation of the discontinuous range:
    //    [prior to the orignal element] + [after the original element].
    Size s2 = s_ / 2;
    BidirIter am1 = first;
    BidirIter ap1 = std::next(a);
    for (Size i = 1; i < s2; ++i, ++am1, ++a, ++ap1)
    {
        swap(*am1, *a);
        F2 f2(f, first, a, i, ap1, last, s_ - i - 1);
        if (combine_discontinuous(first, a, i, ap1, last, s_ - i - 1, f2))
            return true;
    }
    // If [first, last) has an even number of elements, then fix it up to the
    //     original permutation.
    if (2 * s2 == s_)
    {
        std::rotate(first, am1, a);
    }
    // else if the range has length 3, we need one more call and the fix is easy.
    else if (s_ == 3)
    {
        swap(*am1, *a);
        if (f_(first, last))
            return true;
        swap(*am1, *a);
    }
    // else the range is an odd number greater than 3.  We need to permute
    //     through exactly half of the permuations with the original element in
    //     the middle.
    else
    {
        // swap the original first element into the middle, and hold the current
        //   first element steady.  This creates a discontinuous range:
        //     [first+1, middle) + [middle+1, last).  Run through all permutations
        //     of that discontinuous range.
        swap(*am1, *a);
        BidirIter b = first;
        BidirIter bp1 = std::next(b);
        F2 f2(f, bp1, a, s2-1, ap1, last, s_ - s2 - 1);
        if (combine_discontinuous(bp1, a, s2-1, ap1, last, s_ - s2 - 1, f2))
            return true;
        // Swap the current first element into every place from first+1 to middle-1.
        //   For each location, hold it steady to create the following discontinuous
        //   range (made of 3 ranges): [first, b-1) + [b+1, middle) + [middle+1, last).
        //   For each b in [first+1, middle-1), run through all permutations of
        //      the discontinuous ranges.
        b = bp1;
        ++bp1;
        BidirIter bm1 = first;
        for (Size i = 1; i < s2-1; ++i, ++bm1, ++b, ++bp1)
        {
            swap(*bm1, *b);
            F3 f3(f, first, b, i, bp1, a, s2-i-1, ap1, last, s_ - s2 - 1);
            if (combine_discontinuous3(first, b, i, bp1, a, s2-i-1, ap1, last, s_-s2-1, f3))
                return true;
        }
        // swap b into into middle-1, creates a discontinuous range:
        //     [first, middle-1) + [middle+1, last).  Run through all permutations
        //     of that discontinuous range.
        swap(*bm1, *b);
        F2 f21(f, first, b, s2-1, ap1, last, s_ - s2 - 1);
        if (combine_discontinuous(first, b, s2-1, ap1, last, s_ - s2 - 1, f21))
            return true;
        // Revert [first, last) to original order
        std::reverse(first, b);
        std::reverse(first, ap1);
    }
    return false;
}

}  // detail

template <class BidirIter, class Function>
Function
for_each_reversible_permutation(BidirIter first,
                                BidirIter mid,
                                BidirIter last, Function f)
{
    typedef typename std::iterator_traits<BidirIter>::difference_type D;
    for_each_combination(first, mid, last,
                  detail::reversible_permutation<Function&, D>(f,
                                              std::distance(first, mid)));
    return std::move(f);
}    

template <class UInt>
UInt
count_each_reversible_permutation(UInt d1, UInt d2)
{
    // return d1 > 1 ? (d1+d2)!/(2*d2!) : (d1+d2)!/d2!
    if (d1 > std::numeric_limits<UInt>::max() - d2)
        throw std::overflow_error("overflow in count_each_reversible_permutation");
    UInt n = d1 + d2;
    UInt r = 1;
    if (d1 > 1)
    {
        r = n;
        if ((n & 1) == 0)
            r /= 2;
        --n;
        UInt t = n;
        if ((t & 1) == 0)
            t /= 2;
        if (r > std::numeric_limits<UInt>::max() / t)
            throw std::overflow_error("overflow in count_each_reversible_permutation");
        r *= t;
        --n;
    }
    for (; n > d2; --n)
    {
        if (r > std::numeric_limits<UInt>::max() / n)
            throw std::overflow_error("overflow in count_each_reversible_permutation");
        r *= n;
    }
    return r;
}

template <class BidirIter>
std::uintmax_t
count_each_reversible_permutation(BidirIter first, BidirIter mid, BidirIter last)
{
    return count_each_reversible_permutation<std::uintmax_t>
                          (std::distance(first, mid), std::distance(mid, last));
}

namespace detail
{

// Adapt functor to permute over [first+1, last)
//   A reversible circular permutation of N items is done by holding the first
//   item and reverse-permuting [first+1, last).
template <class Function, class BidirIter>
class reverse_circular_permutation
{
    typedef typename std::iterator_traits<BidirIter>::difference_type D;

    Function f_;
    D s_;

public:
    explicit reverse_circular_permutation(Function f, D s) : f_(f), s_(s) {}

    bool
    operator()(BidirIter first, BidirIter last)
    {
        if (s_ == 1)
            return f_(first, last);
        typedef typename std::iterator_traits<BidirIter>::difference_type D;
        typedef bound_range<Function, BidirIter> BoundFunc;
        BoundFunc f(f_, first, last);
        BidirIter n = std::next(first);
        return reversible_permutation<BoundFunc, D>(f, std::distance(n, last))(n, last);
    }
};

}  // detail

template <class BidirIter, class Function>
Function
for_each_reversible_circular_permutation(BidirIter first,
                                         BidirIter mid,
                                         BidirIter last, Function f)
{
    for_each_combination(first, mid, last, detail::reverse_circular_permutation<Function&,
                          BidirIter>(f, std::distance(first, mid)));
    return std::move(f);
}    

template <class UInt>
UInt
count_each_reversible_circular_permutation(UInt d1, UInt d2)
{
    // return d1 == 0 ? 1 : d1 <= 2 ? (d1+d2)!/(d1*d2!) : (d1+d2)!/(2*d1*d2!)
    UInt r;
    try
    {
        r = count_each_combination(d1, d2);
    }
    catch (const std::overflow_error&)
    {
        throw std::overflow_error("overflow in count_each_reversible_circular_permutation");
    }
    if (d1 > 3)
    {
        for (--d1; d1 > 2; --d1)
        {
            if (r > std::numeric_limits<UInt>::max()/d1)
                throw std::overflow_error("overflow in count_each_reversible_circular_permutation");
            r *= d1;
        }
    }
    return r;
}

template <class BidirIter>
std::uintmax_t
count_each_reversible_circular_permutation(BidirIter first, BidirIter mid,
                                               BidirIter last)
{
    return count_each_reversible_circular_permutation<std::uintmax_t>
                          (std::distance(first, mid), std::distance(mid, last));
}