Howard E. Hinnant
2014-05-18

This work is licensed under a Creative Commons Attribution 4.0 International License.

Dining Philosophers Rebooted

Introduction
Problem Description
The Solutions
The Results
Summary
Appendix A: Round Table Code
Appendix B: 3-D Table Code
Acknowledgments

Introduction

The dining philosophers problem has been a famous computer science problem for half a century. When first presented, the challenge was simply to find a solution that did not deadlock. I.e. a solution where none of the diners starved to death in the process of waiting for a chance to eat.

This paper presents four solutions to the problem, and compares the performance of each solution. Additionally the impact of the number of diners, and the number of processors is explored to see how these factors impact overall performance. And finally, an alternate topology, where each diner requires 3 forks to eat, is explored with these same 4 algorithms. It will be shown that there is a single algorithm that consistently performs as well as, or better than, the other 3 algorithms, as all factors are varied (number of processors, number of diners, topology).

The complete C++11 source code for this study is included in this paper so that others may easily confirm or refute the results presented herein.

A practical application of the results of this study is the algorithm underlying the std::lock(L1, L2, ...) function in section 30.4.3 [thread.lock.algorithm] of the C++11 and C++14 standards:

template <class L1, class L2, class... L3> void lock(L1&, L2&, L3&...);
Requires: Each template parameter type shall meet the Lockable requirements, [Note: The unique_lock> class template meets these requirements when suitably instantiated. — end note]

Effects: All arguments are locked via a sequence of calls to lock(), try_lock(), or unlock() on each argument. The sequence of calls shall not result in deadlock, but is otherwise unspecified. [Note: A deadlock avoidance algorithm such as try-and-back-off must be used, but the specific algorithm is not specified to avoid over-constraining implementations. — end note] If a call to lock() or try_lock() throws an exception, unlock() shall be called for any argument that had been locked by a call to lock() or try_lock().

Indeed, all of the test software used by this paper will call ::lock to obtain "forks" (lock two or three std::mutexes). One can safely assume that ::lock() in this paper represents nothing but example implementations of std::lock().

Problem Description

This Wikipedia link has a great description of the problem. If you haven't read it yet, read it now. I'll wait...

Ok, now that we're all on the same page (and back on this one), I'll explain the minor deviations that this paper takes, and why.

As it turns out, any algorithm that solves this problem without deadlock, will perform wonderfully when the philosophers spend any appreciable amount of time thinking, compared to eating. I.e. they can each think independently (that is what makes them philosophers after all). When they do as much thinking as they do eating, we call this a low contention scenario because the philosophers spend relatively little time contending for the forks.

However when the philosophers are all really hungry, they tend to eat more and think less (just like the rest of us). This means they spend more time contending for the forks, and subsequently the nature of the algorithm which arbitrates who gets what fork when, becomes more and more important. In the limit, as thinking time goes to zero, the importance of the algorithm which hands out the forks becomes paramount, and dominates the performance of the entire meal (the application).

Therefore in the tests included herein:

The diners do not think at all. They only wish to eat.
The diners can only eat for at most 10 milliseconds at a time, before they have to relinquish their forks while they chew. This is sometimes as little as 1 millisecond. The exact duration is chosen randomly each time a diner obtains the forks.
The diner randomly chooses what order to request the forks in. I.e. what order the forks are given to the ::lock() algorithm. The actual order the forks are obtained is encapsulated within the ::lock() algorithm.
The meal is considered complete when each diner has eaten for a cumulative total of 30 seconds.
The efficiency of the algorithm is measured by measuring the wall clock time elapsed from the beginning of the meal to the end of the meal (shorter is better).

These parameters have been chosen to highlight the performance differences between the 4 algorithms presented herein, and to somewhat approximate a real-world high-contention scenario. If the scenario is low-contention, the algorithm does not matter, as long as it avoids deadlock.

The Solutions

For each solution, the forks are represented as a vector<mutex>:

std::vector<std::mutex> table(nt);

There is a class Philosopher which will reference two (or three) mutexes, and the table setting is represented as a vector<Philosopher>:

std::vector<Philosopher> diners;

Ordered

The ordered solution is the one originally proposed by Edsger Dijkstra in 1965. One assigns an order to the forks (we will use their address in the vector), and each Philosopher will first lock the lower-numbered mutex, and then lock the higher-numbered mutex.

The two-mutex variant of lock can be coded like so if we assume that the template parameter L0 is always a std::unique_lock (which is not a good assumption for std::lock but we overlook that detail here).
```
template <class L0>
void
lock(L0& l0, L0& l1)
{
    if (l0.mutex() < l1.mutex())
    {
        std::unique_lock<L0> u0(l0);
        l1.lock();
        u0.release();
    }
    else
    {
        std::unique_lock<L0> u1(l1);
        l0.lock();
        u1.release();
    }
}
```
The purpose of the internal use of std::unique_lock<L0> is to make the code exception safe. If the second lock() throws, the local std::unique_lock<L0> ensures that the first lock is unlocked while propagating the exception out.

Note that this algorithm has a very predictable code flow. There are only two ways to get the job done, and there are no loops.
Persistent

No one that I'm aware of has seriously proposed this algorithm as a good solution. However it is included here as there is at least one shipping implementation of std::lock that implements this exact algorithm. Furthermore, a naive reading of the C++ specification for std::lock has led more than one person to believe that this is the only algorithm possible which will conform to the C++ specification. Nothing could be further from the truth.

This algorithm is called "persistent" as the Philosopher simply doggedly locks one mutex, and then try_lock() the others. If the try_lock()s succeed, he is done, else he unlocks everything, and immediately tries the exact same thing again, and again, and again, until he succeeds.

This is simple to code for the two-mutex variant as:
```
template <class L0, class L1>
void
lock(L0& l0, L1& l1)
{
    while (true)
    {
        std::unique_lock<L0> u0(l0);
        if (l1.try_lock())
        {
            u0.release();
            break;
        }
    }
}
```
Results will later show that this algorithm, while not ideal, actually performs relatively well when there is plenty of spare processing power. Or said differently, the algorithm works, but is power hungry.
Smart

A simple but important variation of the persistent algorithm is when a try_lock() fails, then block on that mutex. This eliminates a lot of useless spinning, thus consuming less power, and thus leaving more processing power available to other diners. As it turns out, a diner needs not only forks to eat, he also needs an available processor. If you don't have the forks, don't waste power.

This is simple to code for the two-mutex variant as:
```
template <class L0, class L1>
void
lock(L0& l0, L1& l1)
{
    while (true)
    {
        {
            std::unique_lock<L0> u0(l0);
            if (l1.try_lock())
            {
                u0.release();
                break;
            }
        }
        {
            std::unique_lock<L1> u1(l1);
            if (l0.try_lock())
            {
                u1.release();
                break;
            }
        }
    }
}
```
This algorithm still contains a theoretical infinite loop. However it will go through this loop rather slowly, as each mutex locked besides the very first, is one for which the thread has just failed a call to try_lock() (and so the subsequent lock() is likely to block).

Some people have expressed a great deal of concern about livelock with this algorithm. And indeed experiments have revealed a very small amount of temporary livelock can occur under high-contention conditions as emphasized by the tests herein. The 4th algorithm is designed to reduce this small amount of experimentally observed livelock to zero.
Smart & Polite

This algorithm is a minor variant of the smart algorithm. It has experimentally shown small performance advantages over the smart algorithm on OS X. After the thread has failed a try_lock() on a mutex, but prior to blocking on that mutex, yield the processor to anyone else who might need it. Results may be different on other OS's. But on OS X, it is just the polite thing to do. And the system as a whole (though not the individual Philosopher) benefits under conditions where processing power is already maxed out.
```
template <class L0, class L1>
void
lock(L0& l0, L1& l1)
{
    while (true)
    {
        {
            std::unique_lock<L0> u0(l0);
            if (l1.try_lock())
            {
                u0.release();
                break;
            }
        }
        std::this_thread::yield();
        {
            std::unique_lock<L1> u1(l1);
            if (l0.try_lock())
            {
                u1.release();
                break;
            }
        }
        std::this_thread::yield();
    }
}
```

The Results

Round Table on a 4 Core Machine

In the figure to the right I have set up 31 tests. Each test has a number of philosophers seated at a round table. The number of philosophers varies from 2 to 32. There is one fork (mutex) between every two adjacent philosophers. Each philosopher must obtain his two adjacent forks in order to eat. This test shows timing results for a 4 processor Intel Core i5 running Apple OS X 10.9.2.

The vertical axis represents the total number of seconds it takes for all philosophers at the table to get a total of 30 seconds each of "quality fork time."

When there are only two philosophers at the table, it seems clear that only one of them can eat at a time, and thus it will take 60 seconds for each of them to eat for 30 seconds. This is true whether that each eat for 30 seconds at a time, or if they each eat for a few milliseconds at a time.

The chart shows in blue a theoretical minimum amount of time the meal must take for everyone to get 30 seconds. When there is a even number of philosophers at the table, it is easy to see that every other philosopher can eat at one time (say the even numbered philosophers if you numbered them). And then the odd-numbered philosophers can eat. Whether they all eat for a few milliseconds or 30 seconds each, it should take no more than 60 seconds for everyone to get a full meal, assuming one has a sufficient number of processors to supply to each philosopher whenever he needs one (has obtained two forks).

When the number of philosophers at the table is odd, it can be shown that the theoretical minimum meal time is 30 * N /((N-1)/2) seconds. When N is 3, it is easy to see that there is no room for concurrency. Just like the N == 2 case, only one philosopher can eat at a time. Thus the minimum meal time is 90 seconds. For higher odd numbers of philosophers at the table, the minimum time actually drops, and approaches 60 seconds as N goes to infinity.

All 4 algorithms nail the N == 2, and N == 3 cases. However, with N >= 4, the ordered algorithm breaks away from the theoretical minimum and climbs in a close to linear fashion with the number of dining philosophers. This essentially represents the algorithm's lack of ability to evenly distribute processing power equally to each philosopher. To the right of the chart are CPU Load graphics, one for each algorithm at the N == 32 case. The second from the top is for the ordered algorithm. In this figure one can see that the cpu load tapers off somewhat at the end of the test, indicating that towards the end there are fewer and fewer philosophers that are still hungry. This also indicates that those still-hungry philosophers were experiencing mild starvation earlier in the test (else they would not still be hungry).

The persistent algorithm starts off outperforming the ordered algorithm. Up through N == 5, the results are virtually indistinguishable from the theoretical minimum. And up through N == 7, the results are only slightly worse than the theoretical minimum. But by N == 8, the cost of this algorithm starts to sharply rise as more philosophers come to the table. The rise is so sharp that the algorithm begins to get outperformed by the ordered algorithm by N == 13. The top cpu load graphic indicates that a significant portion of the cpu is spent in system calls (the red area), which is consistent with the notion that this algorithm is needlessly locking and unlocking mutexes without getting any real work (eating) done.

The smart and smart & polite algorithms have very similar performance. They both perform identically to the theoretical minimum up through N == 9. This is the point that even with a perfectly fair distribution of processing power, the philosophers start collectively asking for more than 4 cpus. As the number of philosophers increases, the slope of the smart & polite algorithm is slightly less than the smart algorithm, and appears equal to the slope of the ordered algorithm. Note that in the CPU load factor graphics for these two algorithms, the meal ends relatively abruptly, with all philosophers finishing at about the same time. Indeed, the success of these algorithms resides in the fact that all philosophers are treated equally during the meal, none getting starved any more than the others.

The question naturally arises now:

How do these results scale with the number of processors?

Round Table on a 8 Core Machine

As luck would have it, I have an 8 core Intel Core i7 running Apple OS X 10.9.2 available to run the exact same code.

The results are qualitatively the same.

The ordered algorithm again breaks from the minimum at N == 4, and rises. The rise is not as rapid as on the 4 core machine, but neither is the performance twice as fast.
The persistent algorithm sticks to the theoretical minimum up to about N == 13, and then sharply rises. The rise is sharp enough such that the ordered algorithm will eventually outperform it, but that no longer happens below N == 32. The CPU load graphic again shows excessive cpu devoted to system calls.
The smart and smart & polite algorithms stay on the theoretical minimum out to 17 philosophers, and then rise approximately linearly.
The smart & polite algorithm is consistently the highest performing algorithm, or at the very least, never outperformed by any of the other 3 algorithms. Concerns of any significant livelock are not supported by these results. Though please feel free to run this test on your own machine and publicly report your results.

Explanation

To help visualize the problem, lets concentrate on the case of five philosophers. And to simplify the analysis, assume we have enough cores to give each philosopher a dedicated processor.

Imagine that all five philosophers try to initiate grabbing their lower-numbered fork at once. Philosophers 2, 3 and 4 will lock mutexes 1, 2 and 3 respectively. Philosophers 0 and 1 will race to get mutex 0. One of them will succeed, and the other will block.

Next, the non-blocked philosophers (say 0, 2, 3 and 4) will try to lock their higher-numbered mutex. Philosophers 2 and 3 will block when they do this, because their higher-numbered mutex is already owned by the philosopher to their right. Philosophers 0 and 4 will race to obtain mutex 4. One of them will get the mutex, and the other will block.

The end result is that out of the 5 philosophers, 4 of them will be blocked, waiting for the 1 runnable philosopher to finish eating.

This doesn't always happen. Otherwise the graphs above would show this case taking 150 seconds (5 x 30 seconds) instead of about 95 seconds. However the 95 second result indicates that one philosopher is blocking the other four from eating about half the time.

Ideally, two philosophers should be able to eat at once, 100% of the time.

Using the "smart" locking algorithm, assume we start out the same way as the ordered algorithm, with philosophers 2, 3 and 4 obtaining mutexes 1, 2 and 3 respectively, and philosophers 0 and 1 racing to lock mutex 0. Note that we don't have to start out this way. The philosophers can try to grab either mutex with this algorithm. But for the moment, assume things start out the same.

Next, the non-blocked philosophers (say 0, 2, 3 and 4) will try to lock their other mutex, which in this example just happens to be their higher-numbered mutex (just like in the ordered algorithm). But this is where things begin to deviate from the ordered algorithm.

Assume that in the race between philosopher 0 and philosopher 4 for mutex 4, philosopher 0 wins. If this were the ordered algorithm, philosopher 2, 3 and 4 would block on failing to get their second mutex. However in this algorithm the result is a failed try_lock. The response of philosophers 2, 3 and 4 is to unlock their first mutex and lock their second. When this happens, only philosopher 4 blocks as mutex 4 is already owned by philosopher 0. Philosophers 2 and 3 succeed in locking mutexes 2 and 3 respectively.

Now philosophers 2 and 3 attempt to lock mutexes 1 and 2 respectively. Philosopher 2 succeeds and philosopher 3 fails, and in response philosopher 3 unlocks mutex 3. Philosopher 3 then blocks on attempting to lock mutex 2.

At this point philosophers 0 and 2 are eating, and philosophers 1, 3 and 4 are blocked. For this topology, 2 philosophers eating while the other 3 are blocked is an optimum. The smart locking algorithm will quickly find this optimum in every situation. This is evidenced by the experimental result of 75 seconds: exactly half of the purely sequential result of 150 seconds.

The next question to naturally arise is:

How do these results scale with the number of mutexes one needs to lock?

A 3-D Table on a 4 Core Machine

Though not a complete answer, I have created a partial answer to the above question which should shed some light.

Instead of the philosophers sitting at a round table, imagine they are each seated at a vertex of a Platonic solid. And further imagine that we restrict ourselves to those solids for which each vertex is the intersection of exactly 3 faces (and thus 3 edges). And finally imagine that there is a fork at the middle of each edge.

Due to the limits of both Euclidean geometry, and my visualization abilities, this experiment will take on only 3 forms:

A tetrahedron, representing 4 philosophers sharing 6 forks.
A cube, representing 8 philosophers sharing 12 forks.
A dodecahedron, representing 20 philosophers sharing 30 forks.

In order to eat, each philosopher must obtain all 3 of his adjacent forks.

The tetrahedron case is like the 2 and 3 philosopher round table case. It is impossible for two or more philosophers to eat at once, and so the best that can be done is for the 4 philosophers to each eat by themselves (30 seconds each) for a total of 120 seconds. All 4 algorithms nail this degenerate case.

The cube case is easy to analyze. Four of the eight philosophers can eat at any one time. Thus in two 30 second time slots, everyone can get a full tummy. However only the smart, and smart & polite algorithms are able to achieve this minimum (with 4 cores). The persistent and ordered algorithms are both significantly slower than this. As we saw in the 2-D tests, the ordered algorithm is never faster than the degenerate sequential case.

As we move to the dodecahedron, 4 cores are not enough for even the smart and smart & polite algorithms to stay at the theoretical minimum. But the smart & polite shows a small advantage over the smart algorithm. A small amount of livelock is visible on the smart algorithm's CPU load graphic (3rd from the top), visualized by the bits of red.

The ordered algorithm running on the dodecahedron shows a relatively gradual "trailing off" slope in the CPU load graphic, indicating mild starvation, and thus penalizing the performance of the algorithm.

The persistent algorithm running on the dodecahedron performs the worst. In the CPU load graphic a significant amount of cpu wasted on system calls is quite evident.

A 3-D Table on a 8 Core Machine

Running the same test on an 8 core machine is qualitatively the same but with a few differences worth pointing out:

The persistent algorithm now has enough processing power to nail the theoretical minimum on the cube.
The slope of the expense of the persistent algorithm from the cube to the dodecahedron appears greater than that of the ordered algorithm, and thus would likely eventually become more expensive than the ordered algorithm as more philosophers are added.
The smart and smart & polite algorithms now have enough processing power to very nearly nail the theoretical minimum on the dodecahedron. This theoretical minimum requires 8 processors, so this machine is just on the edge of delivering that minimum.

Concerns of any significant livelock appear to be 100% addressed by the smart & polite algorithm. Please feel free to run this test on your own machine and publicly report your results.

Summary

Four algorithms have been presented and performance tested on 4 and 8 core machines, and under a variety of tests. Each of these four algorithms is a different implementation for std::lock() for the two and three mutex locking case. An algorithm termed herein as smart & polite has been shown to never be worse than any other algorithm, and often significantly superior.

The smart & polite algorithm attempts to lock the first mutex in the group and then try-lock the rest. If any mutex fails the try-lock, everything is unlocked, a call to yield is made, and then the algorithm repeats except that the mutex that previously failed the try-lock is the first one locked. Code is given for the 2 and 3 mutex cases. Expanding this algorithm to N variadic mutexes is left as an exercise for the reader, but has been implemented in libc++ (i.e. it is not unreasonably difficult).

Appendix A: Round Table Code

#include <chrono>
#include <mutex>
#include <random>
#include <array>
#include <vector>
#include <thread>
#include <iostream>

#ifdef SMART_POLITE

template <class L0, class L1>
void
lock(L0& l0, L1& l1)
{
    while (true)
    {
        {
            std::unique_lock<L0> u0(l0);
            if (l1.try_lock())
            {
                u0.release();
                break;
            }
        }
        std::this_thread::yield();
        {
            std::unique_lock<L1> u1(l1);
            if (l0.try_lock())
            {
                u1.release();
                break;
            }
        }
        std::this_thread::yield();
    }
}

#elif defined(SMART)

template <class L0, class L1>
void
lock(L0& l0, L1& l1)
{
    while (true)
    {
        {
            std::unique_lock<L0> u0(l0);
            if (l1.try_lock())
            {
                u0.release();
                break;
            }
        }
        {
            std::unique_lock<L1> u1(l1);
            if (l0.try_lock())
            {
                u1.release();
                break;
            }
        }
    }
}

#elif defined(PERSISTENT)

template <class L0, class L1>
void
lock(L0& l0, L1& l1)
{
    while (true)
    {
        std::unique_lock<L0> u0(l0);
        if (l1.try_lock())
        {
            u0.release();
            break;
        }
    }
}

#elif defined(ORDERED)

template <class L0>
void
lock(L0& l0, L0& l1)
{
    if (l0.mutex() < l1.mutex())
    {
        std::unique_lock<L0> u0(l0);
        l1.lock();
        u0.release();
    }
    else
    {
        std::unique_lock<L0> u1(l1);
        l0.lock();
        u1.release();
    }
}

#endif

class Philosopher
{
    std::mt19937_64 eng_{std::random_device{}()};

    std::mutex& left_fork_;
    std::mutex& right_fork_;
    std::chrono::milliseconds eat_time_{0};
    static constexpr std::chrono::seconds full_{30};

public:
    Philosopher(std::mutex& left, std::mutex& right);
    void dine();

private:
    void eat();
    bool flip_coin();
    std::chrono::milliseconds get_eat_duration();
};

constexpr std::chrono::seconds Philosopher::full_;

Philosopher::Philosopher(std::mutex& left, std::mutex& right)
    : left_fork_(left)
    , right_fork_(right)
{}

void
Philosopher::dine()
{
    while (eat_time_ < full_)
        eat();
}

void
Philosopher::eat()
{
    using Lock = std::unique_lock<std::mutex>;
    Lock first;
    Lock second;
    if (flip_coin())
    {
        first = Lock(left_fork_, std::defer_lock);
        second = Lock(right_fork_, std::defer_lock);
    }
    else
    {
        first = Lock(right_fork_, std::defer_lock);
        second = Lock(left_fork_, std::defer_lock);
    }
    auto d = get_eat_duration();
    ::lock(first, second);
    auto end = std::chrono::steady_clock::now() + d;
    while (std::chrono::steady_clock::now() < end)
        ;
    eat_time_ += d;
}

bool
Philosopher::flip_coin()
{
    std::bernoulli_distribution d;
    return d(eng_);
}

std::chrono::milliseconds
Philosopher::get_eat_duration()
{
    std::uniform_int_distribution<> ms(1, 10);
    return std::min(std::chrono::milliseconds(ms(eng_)), full_ - eat_time_);
}

int
main()
{
#ifdef SMART_POLITE
    std::cout << "SMART_POLITE\n";
#elif defined(SMART)
    std::cout << "SMART\n";
#elif defined(PERSISTENT)
    std::cout << "PERSISTENT\n";
#elif defined(ORDERED)
    std::cout << "ORDERED\n";
#endif
    for (unsigned nt = 2; nt <= 32; ++nt)
    {
        std::vector<std::mutex> table(nt);
        std::vector<Philosopher> diners;
        for (unsigned i = 0; i < table.size(); ++i)
        {
            int j = i;
            int k = j < table.size() -1 ? j+1 : 0;
            diners.push_back(Philosopher(table[j], table[k]));
        }
        std::vector<std::thread> threads(diners.size());
        unsigned i = 0;
        auto t0 = std::chrono::high_resolution_clock::now();
        for (auto& t : threads)
        {
            t = std::thread(&Philosopher::dine, diners[i]);
            ++i;
        }
        for (auto& t : threads)
            t.join();
        auto t1 = std::chrono::high_resolution_clock::now();
        using secs = std::chrono::duration<float>;
        std::cout << "nt = " << nt << " : " << secs(t1-t0).count() << std::endl;
    }
}

Appendix B: 3-D Table Code

#include <chrono>
#include <mutex>
#include <random>
#include <array>
#include <vector>
#include <thread>
#include <algorithm>
#include <iostream>

#ifdef SMART_POLITE

template <class L0, class L1, class L2>
void
lock(L0& l0, L1& l1, L2& l2)
{
    int i = 0;
    while (true)
    {
        switch (i)
        {
        case 0:
            {
                std::unique_lock<L0> u0(l0);
                i = std::try_lock(l1, l2);
                if (i == -1)
                {
                    u0.release();
                    return;
                }
            }
            ++i;
            std::this_thread::yield();
            break;
        case 1:
            {
                std::unique_lock<L1> u1(l1);
                i = try_lock(l2, l0);
                if (i == -1)
                {
                    u1.release();
                    return;
                }
            }
            if (i == 1)
                i = 0;
            else
                i = 2;
            std::this_thread::yield();
            break;
        case 2:
            {
                std::unique_lock<L2> u2(l2);
                i = try_lock(l0, l1);
                if (i == -1)
                {
                    u2.release();
                    return;
                }
            }
            std::this_thread::yield();
            break;
        }
    }
}

#elif defined(SMART)

template <class L0, class L1, class L2>
void
lock(L0& l0, L1& l1, L2& l2)
{
    int i = 0;
    while (true)
    {
        switch (i)
        {
        case 0:
            {
                std::unique_lock<L0> u0(l0);
                i = std::try_lock(l1, l2);
                if (i == -1)
                {
                    u0.release();
                    return;
                }
            }
            ++i;
            break;
        case 1:
            {
                std::unique_lock<L1> u1(l1);
                i = try_lock(l2, l0);
                if (i == -1)
                {
                    u1.release();
                    return;
                }
            }
            if (i == 1)
                i = 0;
            else
                i = 2;
            break;
        case 2:
            {
                std::unique_lock<L2> u2(l2);
                i = try_lock(l0, l1);
                if (i == -1)
                {
                    u2.release();
                    return;
                }
            }
            break;
        }
    }
}

#elif defined(PERSISTENT)

template <class L0, class L1, class L2>
void
lock(L0& l0, L1& l1, L2& l2)
{
    while (true)
    {
        std::unique_lock<L0> u0(l0);
        if (std::try_lock(l1, l2) == -1)
        {
            u0.release();
            break;
        }
    }
}

#elif defined(ORDERED)

template <class Compare, class ForwardIterator>
void
sort3(ForwardIterator x, ForwardIterator y, ForwardIterator z, Compare c)
{
    using std::swap;
    if (!c(*y, *x))          // if x <= y
    {
        if (!c(*z, *y))      // if y <= z
            return;            // x <= y && y <= z
                                   // x <= y && y > z
        swap(*y, *z);          // x <= z && y < z
        if (c(*y, *x))       // if x > y
            swap(*x, *y);      // x < y && y <= z
        return;                // x <= y && y < z
    }
    if (c(*z, *y))           // x > y, if y > z
    {
        swap(*x, *z);          // x < y && y < z
        return;
    }
    swap(*x, *y);              // x > y && y <= z
    if (c(*z, *y))           // if y > z
        swap(*y, *z);          // x <= y && y < z
}                                  // x <= y && y <= z

template <class L0>
void
lock(L0& l0, L0& l1, L0& l2)
{
    L0* locks[] = {&l0, &l1, &l2};
    sort3(&locks[0], &locks[1], &locks[2],
          [](L0* x, L0* y)
          {
               return x->mutex() < y->mutex();
          });
    std::unique_lock<L0> u0(*locks[0]);
    std::unique_lock<L0> u1(*locks[1]);
    locks[2]->lock();
    u1.release();
    u0.release();
}

#endif

class Philosopher
{
    std::mt19937_64 eng_{std::random_device{}()};

    std::mutex& left_fork_;
    std::mutex& center_fork_;
    std::mutex& right_fork_;
    std::chrono::milliseconds eat_time_{0};
    static constexpr std::chrono::seconds full_{30};

public:
    Philosopher(std::mutex& left, std::mutex& center, std::mutex& right);
    void dine();

private:
    void eat();
    std::chrono::milliseconds get_eat_duration();
};

constexpr std::chrono::seconds Philosopher::full_;

Philosopher::Philosopher(std::mutex& left, std::mutex& center, std::mutex& right)
    : left_fork_(left)
    , center_fork_(center)
    , right_fork_(right)
{}

void
Philosopher::dine()
{
    while (eat_time_ < full_)
        eat();
}

void
Philosopher::eat()
{
    using Lock = std::unique_lock<std::mutex>;
    std::mutex* mutexes[] = {&left_fork_, &center_fork_, &right_fork_};
    std::shuffle(std::begin(mutexes), std::end(mutexes), eng_);
    Lock first(*mutexes[0], std::defer_lock);
    Lock second(*mutexes[1], std::defer_lock);
    Lock third(*mutexes[2], std::defer_lock);
    auto d = get_eat_duration();
    ::lock(first, second, third);
    auto end = std::chrono::steady_clock::now() + d;
    while (std::chrono::steady_clock::now() < end)
        ;
    eat_time_ += d;
}

std::chrono::milliseconds
Philosopher::get_eat_duration()
{
    std::uniform_int_distribution<> ms(1, 10);
    return std::min(std::chrono::milliseconds(ms(eng_)), full_ - eat_time_);
}

int
main()
{
#ifdef SMART_POLITE
    std::cout << "SMART_POLITE\n";
#elif defined(SMART)
    std::cout << "SMART\n";
#elif defined(PERSISTENT)
    std::cout << "PERSISTENT\n";
#elif defined(ORDERED)
    std::cout << "ORDERED\n";
#endif
#if 0
    std::cout << "tetrahedron\n";
    std::vector<std::mutex> table(6);
    std::vector<Philosopher> diners;
    diners.push_back(Philosopher(table[0], table[2], table[3]));
    diners.push_back(Philosopher(table[0], table[1], table[4]));
    diners.push_back(Philosopher(table[1], table[2], table[5]));
    diners.push_back(Philosopher(table[3], table[4], table[5]));
#elif 0
    std::cout << "cube\n";
    std::vector<std::mutex> table(12);
    std::vector<Philosopher> diners;
    diners.push_back(Philosopher(table[0], table[3], table[4]));
    diners.push_back(Philosopher(table[0], table[1], table[5]));
    diners.push_back(Philosopher(table[1], table[2], table[6]));
    diners.push_back(Philosopher(table[2], table[3], table[7]));
    diners.push_back(Philosopher(table[4], table[8], table[11]));
    diners.push_back(Philosopher(table[5], table[8], table[9]));
    diners.push_back(Philosopher(table[6], table[9], table[10]));
    diners.push_back(Philosopher(table[7], table[10], table[11]));
#elif 1
    std::cout << "dodecahedron\n";
    std::vector<std::mutex> table(30);
    std::vector<Philosopher> diners;
    diners.push_back(Philosopher(table[0], table[4], table[5]));
    diners.push_back(Philosopher(table[0], table[1], table[6]));
    diners.push_back(Philosopher(table[1], table[2], table[7]));
    diners.push_back(Philosopher(table[2], table[3], table[8]));
    diners.push_back(Philosopher(table[3], table[4], table[9]));
    diners.push_back(Philosopher(table[5], table[10], table[19]));
    diners.push_back(Philosopher(table[10], table[11], table[20]));
    diners.push_back(Philosopher(table[6], table[11], table[12]));
    diners.push_back(Philosopher(table[12], table[13], table[21]));
    diners.push_back(Philosopher(table[7], table[13], table[14]));
    diners.push_back(Philosopher(table[14], table[15], table[22]));
    diners.push_back(Philosopher(table[8], table[15], table[16]));
    diners.push_back(Philosopher(table[16], table[17], table[23]));
    diners.push_back(Philosopher(table[9], table[17], table[18]));
    diners.push_back(Philosopher(table[18], table[19], table[24]));
    diners.push_back(Philosopher(table[20], table[25], table[29]));
    diners.push_back(Philosopher(table[21], table[25], table[26]));
    diners.push_back(Philosopher(table[22], table[26], table[27]));
    diners.push_back(Philosopher(table[23], table[27], table[28]));
    diners.push_back(Philosopher(table[24], table[28], table[29]));
#endif
    std::vector<std::thread> threads(diners.size());
    unsigned i = 0;
    auto t0 = std::chrono::high_resolution_clock::now();
    for (auto& t : threads)
    {
        t = std::thread(&Philosopher::dine, diners[i]);
        ++i;
    }
    for (auto& t : threads)
        t.join();
    auto t1 = std::chrono::high_resolution_clock::now();
    using secs = std::chrono::duration<float>;
    std::cout << secs(t1-t0).count() << '\n';
}

Acknowledgments

Thanks to Jonathan Wakely for his careful review.