Properties are general rules that describe a program’s behaviour. One single rule should logically be applicable to all kinds of input and output for a given piece of code. By opposition, a traditional test suite is made up of multiple specific examples which, if all goes well, should be representative of the overall program’s desired behaviour. Properties are operating at a higher level, generically describing the expected outcome of an operation.
As an example, a function may have the role of finding the biggest element in a list. With regular tests, which usually consist of enumerated sample runs, the tests may look like:
5 = biggest([1, 2, 3, 4, 5])
8 = biggest([3, 8, 7, -1])
-5 = biggest([-10, -5, -901])
0 = biggest()
The list of assertions put together by the developer will ideally cover all edge cases. The four assertions here cover lists that are pre-sorted, lists without any specific order, single-element lists, lists containing negative and/or positive numbers. Any other edge case that could exist will only be tested if the programmer both knows about them and remembers to test them.
The quality of such tests is purely determined by their author’s ability to think about all kinds of cases, and is defined by their experience, attention to detail, and overall knowledge. It is as unreliable as the number of "unknown unknowns" surrounding it. The way to frame these tests is that they "show the presence, not the absence of bugs"
The expectation of property-based testing would be to be able to define a property such as "for any given list, the function returns the biggest number possible", and then many lists can be passed to it. The more creative list generation will be, the more reliable the test should be. The underlying assumption is that by this kind of dynamic exploration, "unknowns unknowns" can be discovered. The test would still show the presence (and not the absence) of bugs, but it should find a lot more bugs.
The tricky aspect for this specific test case is that the property is mostly identical to what the code should do already—return the biggest number possible in a list. This is a frequent problem with property-based testing used for unit testing; the property is so simple that it is hard to reduce it to a general rule that will be different from the existing code.
Instead of falling back to standard testing, modeling can be a very effective approach. Modeling essentially requires the developer to use an indirect and very simple implementation (and often algorithmically inefficient) to validate the property. That is the model. It should be so simple it is obviously correct. The actual implementation of the code (which usually tries to be efficient) can then be compared with the model. If both implementations consistently behave similarly, there is a good chance that they may be equivalent.
In the case of a function like
biggest, the usual efficient implementation will be done in O(N) complexity, using a single pass: the largest value seen yet is held in memory, and replaced any time a larger one is spotted. Once the list has been fully scanned, the largest value seen so far is also the largest value of the list. It can be returned as the correct answer. An alternative, slower implementation, would be to sort the list in increasing order, likely through a trusted function in the standard library, and then take the last entry of the list as the biggest one. The complexity of such a function, is O(n log n), which is not as good.
The latter implementation can however act as an acceptable model:
biggest(List) =:= last(sort(List))
with the assumption that
sort(List) is correct, a property-based testing framework will then be able to generate all kinds of lists to check that this is always true. In the cases where the property does not hold, the test run can abort. Potential errors could include:
the list is empty and a max value is undefined
[0,0]: the list has two values that compare equal and if the implementation used
>as a comparator rather than
>=, it would then crash
The first one could be considered to be forgetfulness of some edge case, whereas the second one could only be found with traditional test if their author had thought of repeating the same value twice.
Modeling is a particularly useful approach when implementing data structures, since nearly all data structures aim to improve over a simpler, less efficient one based on some usage pattern. Nearly all dictionaries or maps can be compared to simple lists of keys and values, and those can also be used to compare special operations that are often more familiar to trees and heaps (such as finding the next smallest element). This allows to slowly but surely grow a zoo of data structures that increase in complexity, but remain simple in their test cases.
Modeling is also useful for integration tests of stateful systems with lots of side-effects or dependencies, where "how the system does something" is complex, but "what the user perceives" is simple. Proper libraries or systems often hide such complexities from the user in order to appear useful at all. Databases, for example, can do a lot of fancy operations to maintain proper transactional semantics, avoid loss of data, and keep good performance, but a lot of their operations can be modeled with simple in-memory data structures.
Finally, there’s a rare but great type of modeling that may be available, called "the oracle". An oracle is a reference implementation, possibly coming from a different language or software package, and can therefore act as a pre-written model. The only thing required of the programmer to test the system is to compare their implementation with the oracle and ensure the same results are given.
Modeling tends to work well, with the caveat that it is possible to write the same program multiple times, with one of the implementations so simple it is obviously correct. This is not always practical, and sometimes not possible. Going back to properties becomes a necessity.
The first step then is to find better properties, a task often significantly harder than finding any property, which can already prove difficult and requires a solid understanding of the problem space. A good trick to find a property is to just word what the piece of code should do, but differently.
For a function like
sort(), the basic behaviour could be "a sorted list has all the numbers in it ordered from smallest to largest". This property is often circular, since that’s exactly what the function tries to do. Wording the property differently may result in something like "each number in a sorted list is smaller than (or equal to) the one that follows". The difference is small, but important. In the first case, the property states the final state of the entire list; the intended outcome. In the latter case, the property states an invariant, a fact that should remain true at all times, but can be observed on any part of the list at any given time.
By applying that rule to every element of the list, the entire output can be validated. The implementation is however almost guaranteed to be very different from the test.
Additional properties could then be found by looking for other invariants or things we expect to always be true, such as:
the sorted and unsorted list should both remain of the same size
any element in the sorted list has to have its equivalent in the unsorted list
any element in the unsorted list has to have its equivalent in the sorted list (this is different!)
Eventually though, a point will be reached where the functions to test are so basic it is too hard to come up with properties. These functions could be considered axiomatic (you just trust them to be true and correct), acting as fundamental blocks of the system. In other cases, falling back to traditional unit testing as a supplementary insurance policy around property-based testing is entirely acceptable. Property-based testing is not a replacement for all testing, just a new tool that can often improve results.
In fact, traditional tests can also be useful in figuring out properties. The
last() function could have the following unit tests:
5 = last([1,2,3,4,5])
3 = last([5,4,3])
When writing such tests by hand, the author must:
construct a list
pick the value desired to be last
add the value desired to be last at the end of the list
check that the value expected is the one obtained
In doing so, one could write a property that looks like:
given a generated list L: (1)
N = random_number() (2)
NewL = append(L, N) (3)
N =:= last(NewL) (4)
|a list is constructed (generated by the framework)
|a value chosen to be last is generated
|the value is added to the end of the list
|the value is checked to be what was expected
This is a valid property test, although there is now a dependency on the
append() function. The same choice must be repeated: is it possible to make a model out of it? Can it be turned into a simpler property, or just tested with traditional unit tests?
There is a third type of property: symmetric properties. These properties tend to work well whenever two pieces of codes are written that do opposite actions, such as an encoder and a decoder. The trick is to test them together; if one action is the opposite of the other, then applying both operations should yield the initial input as its final output. Of course, the chains of operations can be longer than just two items.
A pseudocode example of such a property could look like:
given a generated map or dictionary MapOrDict:
MapOrDict =:= decode(encode(MapOrDict))
This will demonstrate that
MapOrDict can go a roundabout encoding and decoding without error. Proponents of Test-Driven Development who are fans of "make the test pass really simply, and then refactor" approach will be able to defeat this test by writing an implementation like:
encode(T) -> T.
decode(T) -> T.
The property will pass all the time. The problem is that the property chosen is useful, but not sufficient on its own to be a good test of encoding and decoding.
An additional property of encoding could be added:
given a generated map or dictionary MapOrDict:
Assuming that the encoder should yield string-encoded data, then this property can combine with the circular one to ensure at least some transformation has to have taken place during the roundtrip. Other properties may include ideas like "only ASCII characters are used", or "the returned binary value has proper byte alignment". Each one of them would be very short to write to provide enormous coverage for all kinds of errors. A few example cases with traditional tests to double-check formatting could be added on top, and a solid test suite would be the result.