1. Foundations of Property-Based Testing

Testing tools and techniques are rarely considered exciting; many are promising, but ultimately, testing is often perceived as a necessary process in order to help solve the problem at hand. Solving the problem at hand itself is the actual interesting objective, and the more of your testing can be automated away, the better the situation.

Property-based testing changes that. It is one of the most exciting practices in software development from the last few years. It promises better, more solid tests than nearly any other tool out there, with very little code. This means, similarly, that the software developed with its help should also get better.[1] It offers a lot of automation to keep the boring parts out, but at the cost of a steep learning curve and requiring a lot of thinking to get it right. This is what is so exciting about it! It’s a high-investment, high-reward kind of deal.

1.1. Promises of Property-Based Testing

What are these promises? At the Erlang Factory in March 2016, in the slide set[2] from Thomas Arts, it is said that they used quickcheck (canonical property-based testing tool) to run tests on Project FIFO, an open source cloud project.[3] Their end results was:

  • 60,000 lines of production code

  • 460 lines of quickcheck tests

  • 25 bugs uncovered including: timing errors, race conditions, type errors, incorrect use of library API, errors in documentation, errors in the logic, system limits errors, errors in fault handling, and a hardware error.

In another example, Joseph Wayne Norton ran a quickcheck suite of under 600 lines over Google’s leveldb to find sequences of 17 and 31 calls that could corrupt databases with ghost keys[4], which probably nobody would have found easy to test or replicate otherwise.

Those are some amazing results; that’s a surprisingly low amount of code to find a very high number of errors on software that was already otherwise tested and running in production.

Property-based testing ended up wedging itself in multiple industries, including mission-critical telecommunication components,[5] Databases,[6] components of Heroku’s routing and certificate-management layer,[7] and even in cars.[8]

The reward is good, but what is the investment? Property tests have to be written differently from traditional ones, and tend to require more thinking. A lot more. Good property-based testing is a learned and a practiced skill, much like playing a musical instrument or getting good at sports: there’s plenty of friendly ways to get started and get benefits from it, but the skill ceiling is very high. People can write property tests for years and still find ways to improve and get more out of them.

In the same manner, more advanced users can be impressive to the point of involuntarily instilling a feeling of inadequacy to newcomers who cannot imagine themselves doing what veterans are doing. To add insult to injury, a lot of tutorials about property-based testing tend to show very simple programs tested in a complex manner. The effort to reward ratio appears very high, and it becomes easy to set property tests aside as one of the many things labelled 'wizardry' on the programming shelf; nice to know about in the academic sense, but never to be used in practice. This is important to keep in mind. Effort is continuous, but progress is gradual and stepwise. Each wall hit reveals an opportunity for improvement.

1.2. Properties First

The first step is to stop thinking property-based testing is about tests. It’s about properties. The difference comes to light when both get compared to each other.

Let’s assume there is a function whose goal is to act the same way a cash register does. Given bills and coins already in the register, an amount due to be paid by a customer, and money handed by the customer to the cashier, the function should return bills and coins in an amount equal to the change due to the customer.

An approach based on unit tests may look like the following:

%% Money in the cash register
Register = [{20.00, 1}, {10.00, 2}, {5.00, 4}, {1.00, 10}, {0.25, 10}, {0.01, 100}]
%% Change               = cash(Register, Price, MoneyPaid),
[{10.00, 1}]            = cash(Register, 10.00, 20.00),  (1)
[{10.00, 1}, {0.25, 1}] = cash(Register,  9.75, 20.00),  (2)
[{0.01, 18}]            = cash(Register,  0.82,  1.00),
[{10.00, 1}, {5.00, 1}, {1.00, 3}, {0.25, 2}, {0.01, 13}]
                        = cash(Register,  1.37, 20.00).  (3)
1 If the customer is paying a $10 item with a $20 bill, $10 is expected back
2 For a $9.75 purchase paid with $20, a $10 bill with a quarter should be returned, for a total of $10.25
3 A $1.37 item paid with a $20 bill similarly returns $18.63, with the specific cuts shown

This test-based approach may be familiar. Changing to a properties mindset is, surprisingly, also familiar. When coming up with regular tests, guiding principles and expectations about the code at hand are already around. That’s where the examples come from. Property-based testing aims to harness these guiding principles and expectations and use these directly as a test, rather than specific examples. The difficult part is figuring out how to translate vague ideas into general rules expressed as code.

In this example case, the properties could probably be:

  • The amount of change is always going to sum up to the money paid minus the price charged.

  • The bills and coins handed back for change are going to start from the biggest bill possible first, down to the smallest coin possible. This could alternatively be defined as trying to hand the customer as few individual pieces of money as possible. Those seem to be equivalent, just approached differently.[9]

Property-based testing is the act of using a framework with which these rules can be encoded and tested to see if they hold.

A property-based testing framework will automate what would otherwise be tedious manual labour: it will generate a series of possible inputs, pass them to the properties, which must say if they pass or fail. Then it will look into the failing test case, and try to find what exactly it is that made it break. For this example, it would therefore have to generate an amount of money in the register, a price to pay, and an amount of money paid. It would call the function, and check that the two properties above remain true in all cases.

Running the cash register function through such a framework would sooner or later generate an input such as cash([{5.00, 1}], 20.00, 30.00), and then the program would crash. Paying a $20 purchase with $30, even if the register holds only $5 is entirely doable: take $10 from the $30 and give it back to the customer. Is that specific amount possible to give back though?

Because the money taken from the customer does not come in as bills or coins, there is no way to use part of the input money to form the output. The approach taken for this function is wrong.[10]

The reasonable looking test cases from earlier turned out to show too few of the possibilities. The true weaknesses of the design were not those that could easily be imagined, but those that came from exercising and exploring the problem space.

Whereas TDD helps design code to have it conform to expectations and demands, trying a similar approach with property-based testing forces the exploration of the program’s behaviour to see what it can or cannot do.

The reality is that when doing property-based testing, the design and growth of tests tends to require an equal part of growth and design of the program itself. Half the time will be spent debugging programs so they better fix the problem at hand, while the other half will be spent correcting tests so they better represent reality. A hundred percent of the time will be spent fixing bugs in the understanding of the problem space as a whole as properties nobody could suspect were there get discovered.

That’s the power of property-based testing.

1. which is good, since in the author’s opinion, overall software quality is fairly dreadful (including his own software)
2. see slides, video
3. https://project-fifo.net
4. see slides and issue
5. These include Ericsson 4G base stations and Motorola gateways
6. Basho’s Riak is the biggest example
7. At least it did while I worked there
8. the AUTOSAR standard has been tested with QuickCheck
9. in fact they are not! The first one does not equal the last one with all currency amounts, and the latter is equivalent to the knapsack problem, which is NP complete
10. When I wrote this example, I was only intending to show how we could find edge cases, but just thinking like a property-based testing framework showed me my initial approach was just not good enough