Case Study: Stateless Properties with a Checkout

7. Case Study: Stateless Properties with a Checkout

Test-Driven Development proponents frequently like the Red-Green-Refactor approach^[1], which implies developing in a cycle of thinking, then writing a test, then seeing it fail, then making it pass, and then refactoring (and then rinse and repeat). This chapter puts these principles into practice through a code kata called back to the checkout.

7.1. The Specification

The exercise asks to implement a checkout counter that can give the price of items scanned, with the caveat that some items are cheaper per unit if they are bought in bulk. The following sample table is given:

Item   Unit      Special
       Price     Price
--------------------------
  A     50       3 for 130
  B     30       2 for 45
  C     20
  D     15

The checkout function should accept items in any order, so that the sequence ["B","A","B"] would give the 2 for 45 rebate for the Bs being bought.

The only call that needs to be exposed is checkout:total(ListOfItemsBought, UnitPrices, SpecialPrices) → Price.

7.2. Planning the Tests

The requirements are vague, but the following can be extracted:

items are identified by unique names (non-empty strings)
the list of items has no set length (though it appears non-empty), and the items it contains are in no specific order
the unit prices appear to be integers
the special prices are triggered only once the right number of items of the proper type can match a given special
all items have a unit price
it is not mandatory for a given item to have a special price.

The rest is seemingly undefined or underspecified.

Some properties that can be extracted from the requirements include:

without any special prices defined, the total is just the sum of each item’s price
with a known count for each item, the total comes from two parts added together:
1. the special price times the number of times it can be applied to a given set of items
2. the remainder (not evenly divisible by the special’s required number of items) is just summed up

7.3. Writing a Test

The first test will require to count the total for the checkout without specials. To avoid a test definition such as sum(ItemList, PriceList) =:= checkout:total(ItemList, PriceList, []), which has the test doing roughly the same thing as the implementation, a generator will be written to return the expected price so that the property is not too circular:

prop_no_special1() ->
    ?FORALL({ItemList, ExpectedPrice, PriceList}, item_price_list(),
            ExpectedPrice =:= checkout:total(ItemList, PriceList, [])).

The generator for the property will need to generate the three expected arguments: a list of items bought by the customer (ItemList), the expected price of those items, and then the list of items with their prices as expected by the register itself (PriceList). Since the price list is required to generate the item list and expected prices, the generator will need to come in layers with ?LET macros:

item_price_list() ->
    ?LET(PriceList, price_list(),
         ?LET({ItemList, ExpectedPrice}, item_list(PriceList),
              {ItemList, ExpectedPrice, PriceList})).

The price list itself is a list of tuples; the list is actualized through the ?LET macro so that the item_list/1 generator can use it as the actual Erlang data structure rather than the abstract intermediary format PropEr uses:

%% generate a list of {ItemName, Price} to configure the checkout
price_list() ->
    ?LET(PriceList, non_empty(list({non_empty(string()), integer()})),
         lists:ukeysort(1, PriceList)). % remove duplicates

The item_list/1 generator just needs to pick one item from the list at a time, and add their prices together:

%% set up recursive generator for the purchased item list along with
%% its expected price, based off the price list.
item_list(PriceList) ->
    ?SIZED(Size, item_list(Size, PriceList, {[], 0})).

item_list(0, _, Acc) -> Acc;
item_list(N, PriceList, {ItemAcc, PriceAcc}) ->
    ?LET({Item, Price}, elements(PriceList),
         item_list(N-1, PriceList, {[Item|ItemAcc], Price+PriceAcc})).

Running the tests will show them to fail. A fix for it is a basic implementation:

-module(checkout).
-export([total/3]).

-type item() :: string().
-type price() :: integer().

-spec total(list(item()), list({item(), price()}), any()) -> price().
total(ItemList, PriceList, _Specials) ->
    lists:sum([proplists:get_value(Item, PriceList) || Item <- ItemList]).

Which passes successfully.

7.4. Validating Generators

To validate generators, using the collect/2 function to gather metrics is a good way to start:

prop_no_special2() ->
    ?FORALL({ItemList, ExpectedPrice, PriceList}, item_price_list(),
            collect(bucket(length(ItemList), 10),
                    ExpectedPrice =:= checkout:total(ItemList, PriceList, []))).

bucket(N, Unit) ->
    (N div Unit) * Unit.

Revealing the metrics:

This appears reasonable. Alternatively running the call with collect(lists:usort(ItemList), <Property>) shows good results as well.

7.5. Handling Specials

The TODO list looks like this at this point:

items are identified by unique names (non-empty strings)
the list of items has no set length (though it appears non-empty), and the items it contains is in no specific order
the unit prices appear to be integers
the special prices are triggered only once the right number of items of the proper type can match a given special
all items have a unit price
it is not mandatory for a given item to have a special price. (tested implicitly)

The special prices property is the last one left to check, but it is a more complex one. The way the existing properties have been written means they should not require changes for this phase of the project, and a new one can be added instead:

prop_special() ->
    ?FORALL({ItemList, ExpectedPrice, PriceList, SpecialList},
            item_price_special(),
            ExpectedPrice =:= checkout:total(ItemList, PriceList, SpecialList)).

The property is similar to the one before, but expects a fourth generated term, which is a list of special prices (SpecialList). Generating the list should be done dynamically, but it will need to rely on the price list to work. That may just take one more ?LET nested into the structure to work. The challenge will be to properly generate the expected price while accounting for specials and while ensuring non-special items keep working in a mixed mode.

A generator that does not pay attention to that mixed mode would create problems. For example, if the checkout counter sells 3 donuts for the price of 2, but the generator created a list of 4 donuts without expecting a special, a correct implementation would account for them and the test would fail. Not a problem since tests can be corrected, but better realize it sooner than later.

An interesting work-around can be discovered by looking at how lists are being generated. As long as a given type of item is generated in a quantity smaller than required for the special price, the special won’t trigger. Similarly, as long as the items are generated in an even multiple of the quantity required for a special, the special will trigger. It follows that both types of lists will never clash even if they are generated independently, and yet they will still merge fine. If a list of 2 donuts at a non-special price is generated, and then a list of 6 donuts (twice the special quantity) is generated for the specials, then a list containing 8 donuts and the sum of both expected prices remains correct.

This should work fine. The generator should therefore have this kind of flow:

Diagram showing control flow of the generator

Giving the following generator:

The last two steps can be done within the same ?LET expression:

item_price_special() ->
    %% first LET: PriceList
    ?LET(PriceList, price_list(), (1)
         %% second LET: SpecialList
         ?LET(SpecialList, special_list(PriceList), (2)
              %% third LET: Regular + Special items and prices
              ?LET({{RegularItems, RegularExpected},
                    {SpecialItems, SpecialExpected}},
                   {regular_gen(PriceList, SpecialList), (3)
                    special_gen(PriceList, SpecialList)}, (4)
                   %% And merge + return initial lists:
                   {RegularItems ++ SpecialItems, (5)
                    RegularExpected + SpecialExpected,
                    PriceList, SpecialList}))). (6)

1	Generate a list of prices for items and make it discrete with a `?LET` so it can be reused
2	Generate a list of specials that may hit some of these items and make it discrete with a `?LET` so it can be reused
3	Generate a list of items that will not qualify for specials and account their expected prices
4	Generate a list of items that will definitely qualify for specials and sum up their expected prices
5	Merge both lists of items and sum up both expected prices
6	Return the merged list and combined expected price, along with the list of acceptable items and specials

The last two steps can be combined in a single ?LET. There are also three generators used within this one that are new (the item list is being reused from the previous section). The first one is special_list:

%% Generates specials in a list of the form
%% [{Name, Count, SpecialPrice}]
special_list(PriceList) ->
    Items = [Name || {Name, _} <- PriceList],
    ?LET(Specials, list({elements(Items), choose(2,5), integer()}),
         lists:ukeysort(1, Specials)). % no dupes

The chosen format for the list of specials contains the name of the item, how many are required for the special to start counting, and the price for the whole group. The specials are generated from randomly selected (and deduplicated) entries from the item list so that there is always a matching non-special item to any special one. Note that it is possible for the special price to be higher than the non-special price with this generator.

The two other generators are those having to do with generating lists composed strictly of items that are not on special, and then strictly of items that are on special. The first one follows:

%% Generates lists of regular items, at a price below the special value.
regular_gen(PriceList, SpecialList) ->
    regular_gen(PriceList, SpecialList, [], 0).

regular_gen([], _, List, Price) ->
    {List, Price}; (7)
regular_gen([{Item, Cost}|PriceList], SpecialList, Items, Price) -> (1)
    CountGen = case lists:keyfind(Item, 1, SpecialList) of
        {_, CountReq, _} ->
            %% The item can be on special. Generate fewer than needed
            choose(0, CountReq-1); (2)
        _ ->
            %% Item not on specials. Generate at will
            non_neg_integer() (3)
    end,
    %% Use the conditional generator to generate items
    ?LET(Count, CountGen, (4)
         regular_gen(PriceList, SpecialList,
                     ?LET(V, vector(Count, Item), V ++ Items), (5)
                     Cost * Count + Price)). (6)

1	The generator strictly iterates over the entire list to generate items
2	If the item is found to be in the list of specials, the maximal value generated is the special requirement minus one
3	If the item is not in the special lists, 0..N are generated. It is always possible for an item not to be bought
4	The counter generator returned for points 2 and 3 is actualized
5	The counter generated is used to create a vector (a list) of that many entries, which is appended to the accumulator of bought items
6	The cost is calculated through the same counter, also added to the accumulator.
7	The accumulator is returned.

The specials generator is a bit simpler, although it is based on the same model:

special_gen(_, SpecialList) ->
    %% actually do not need the item list
    special_gen(SpecialList, [], 0).

special_gen([], Items, Price) ->
    {Items, Price};
special_gen([{Item, Count, Cost} | SpecialList], Items, Price) ->
    %% Generate sequences of items equal to the special, based on a
    %% multiplier. If we have a need for 3 items for a special, we can
    %% generate 0, 3, 6, 9, ... of such items at once
    ?LET(Multiplier, non_neg_integer(),
         special_gen(SpecialList,
                     ?LET(V, vector(Count * Multiplier, Item), V ++ Items),
                     Cost * Multiplier + Price)).

The major difference is that it only needs to care about the list of specials, and can return any number of matching items as long as it’s a multiple of the specials count. Running the property unsurprisingly yields a failure:

===> Testing prop_checkout:prop_special()
...!
Failed: After 4 test(s).
{[[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0],[0]],0,[{[0],1}],[{[0],4,0}]}

Shrinking ....(4 time(s))
{[[0],[0]],0,[{[0],1}],[{[0],2,0}]}

The tests fail whenever a list of specials is passed to the function. Implementing the specials accounting should help. A simple way to do it can be to count how many of each items are in the list. Account for the specials first, reducing the count every time the specials apply, and then run the list over the items that are not on sale:

-type item() :: string().
-type price() :: integer().
-type special() :: {item(), pos_integer(), price()}.

-spec total(list(item()), list({item(), price()}), any()) -> price().
total(ItemList, PriceList, Specials) ->
    Counts = count_seen(ItemList),
    {CountsLeft, Prices} = apply_specials(Counts, Specials),
    Prices + apply_regular(CountsLeft, PriceList).

The three helper functions are as follows:

-spec count_seen([item()]) -> [{item(), pos_integer()}].
count_seen(ItemList) ->
    Counter = fun(X) -> X+1 end, (1)
    maps:to_list( (3)
      lists:foldl(fun(Item, Map) -> maps:update_with(Item, Counter, 1, Map) end, (2)
                  maps:new(), ItemList)
    ).

1	A counter function is defined, and it only increments a value with one
2	The function is passed to `maps:update_with/4`, which allows to count each item seen
3	The map is returned as a key/value list for convenience

-spec apply_specials([{item(), pos_integer()}], [special()]) ->
    {[{item(), pos_integer()}], price()}.
apply_specials(Items, Specials) ->
    lists:mapfoldl(fun({Name, Count}, Price) -> (1)
        case lists:keyfind(Name, 1, Specials) of
            false -> % not found
                {{Name, Count}, Price}; (2)
            {_, Needed, Value} ->
                {{Name, Count rem Needed}, (3)
                 Value * (Count div Needed) + Price}
        end
    end, 0, Items).

1	The mapfold function allows to both map over a list (apply a function for every value of the list and build a new list from them), and to fold over it (return a single accumulator)
2	The returned value to the left is the one to be used as a map operation, the one on the right of the tuple as a fold accumulator
3	The special is applied; the remaining items are put back into the list of counters after being decremented, and the price is incremented to show the price of specials

-spec apply_regular([{item(), pos_integer()}], [{item(), price()}]) -> price().
apply_regular(Items, PriceList) ->
    lists:sum([Count * proplists:get_value(Name, PriceList)
               || {Name, Count} <- Items]).

The last function simply sums up the cost of leftover items, without specials. The property passes, and the full check list is covered:

items are identified by unique names (non-empty strings)
the list of items has no set length (though it appears non-empty), and the items it contains is in no specific order
the unit prices appear to be integers
the special prices are triggered only once the right number of items of the proper type can match a given special
all items have a unit price
it is not mandatory for a given item to have a special price.

7.6. Sanity Checks

Aside from looking at the distribution of results with collect/2 and aggregate/2, a few more tools can be interesting as a way to ensure the tests are exercising code properly. The first of them is code coverage.

Code coverage is simply a measure of how many lines of code have been executed during the test run, in order to get an idea of which areas of the code base might be under-tested. For property-based testing with PropEr, metrics can be obtained by calling:

→ rebar3 do proper -m prop_checkout -c, cover -v
[...]
===> Performing cover analysis...
  |------------------------|------------|
  |                module  |  coverage  |
  |------------------------|------------|
  |              checkout  |      100%  |
  |------------------------|------------|
  |                 total  |      100%  |
  |------------------------|------------|

100% coverage always looks good, but isn’t necessarily so. Lines of code can be executed in ways where they always run in the same manner even though alternative execution flows are possible on each line. Again, coverage should not be seen as a way to figure out whether everything is tested, but rather as a way to identify whether some parts of the code base are definitely untested.

A fairly efficient and simple way to gauge the quality of properties on top of the other methods seen is to willingly introduce a bug in the code base, and then seeing whether the tests can find about it. There will be a surprising number of times where replacing a + with a - or a < with a > will seemingly have no impact on the outcome of tests, which can be worrisome. Figuring out if the code is dead or the tests woefully inadequate is a worthwhile exercise in such cases.

7.7. Negative Tests

An interesting aspect of the code and tests so far is that the list of items being bought is generated through the list of supported items. It means that by the virtue of having designed the test case from the price list, there is a lack of tests looking for unexpected usage. Such tests try to exercise specifically underspecified paths rather than only well-defined ones.

Rather than generalizing the current properties and making them more complex, it is a good idea to add specific wide-reaching properties that validate different aspects of the code. This also avoids searching only for expectable problems the way it is usually done with traditional tests: limitations to this approach are those of the developer writing the tests, which is underwhelming. Instead, a good idea is to take a fuzz-like approach and to throw a lot of reasonable-looking garbage at the system to see what happens.

Broad properties are not too useful on their own, but as a supplement they’re great. The basic property looks like:

prop_expected_result() ->
    ?FORALL({ItemList, PriceList, SpecialList}, lax_lists(),
        try checkout:total(ItemList, PriceList, SpecialList) of
            N when is_integer(N) -> true
        catch
            _:_ -> false
        end).

The only property it enforces is "the function returns an integer and does not crash in unexpected ways". A good way to start is through permissive generators and explore from there:

lax_lists() ->
     {list(string()), list({string(), integer()}),
      list({string(), integer(), integer()})}.

Running the property returns an early failure:

===> Testing prop_checkout:prop_expected_result()
.!
Failed: After 2 test(s).
{[[]],[],[]}

Shrinking (0 time(s))
{[[]],[],[]}

The PropEr representation of strings is lacking, but if a single item is used (with name ""), then the lookups fail. The price of an unknown item cannot be used in the calculation, and the function bails out. This is fine, but could give a clearer exception. A programmer error could be raised instead, alerting to the true nature of the problem:

prop_expected_result() ->
    ?FORALL({ItemList, PriceList, SpecialList}, lax_lists(),
        try checkout:total(ItemList, PriceList, SpecialList) of
            N when is_integer(N) -> true
        catch
            error:{unknown_item, _} -> true;
            _:_ -> false
        end).

With the attached patch:

-spec apply_regular([{item(), pos_integer()}], [{item(), price()}]) -> price().
apply_regular(Items, PriceList) ->
    lists:sum([Count * cost_of_item(Name, PriceList)
               || {Name, Count} <- Items]).

cost_of_item(Name, PriceList) ->
    case proplists:get_value(Name, PriceList) of
        undefined -> error({unknown_item, Name});
        Price -> Price
    end.

The bug is fixed, and running the property again reveals no more bugs. Before claiming victory, validating the distribution of generators is required. Looking at the distribution of results with collect/2 will shine some light. Gathering metrics for the error case at hand can be done with collect(lists:all(fun(X) → proplists:get_value(X, PriceList) == undefined end, ItemList), Property), and it reveals that 92% of all test runs contain no overlap between the item list and the price list, meaning they all test the case recently fixed.

With the current properties, the view represented is:

[Perfect happy case] <-a----------------------b-> [nothing works]

Right now, the positive properties give a view of a, and the negative one shows b, but not necessarily much on the gradient in-between. A trick for such cases is to make a kind of 'hybrid' generator that has some constraints, but also is allowed to err. This can be done by mixing expected cases:

lax_lists() ->
    KnownItems = ["A", "B", "C"],
    MaybeKnownItemGen = elements(KnownItems ++ [string()]),
    {list(MaybeKnownItemGen), list({MaybeKnownItemGen, integer()}),
     list({MaybeKnownItemGen, integer(), integer()})}.

Using this generator ensures that some known items and some unknown items will be there, but not always. Running it fortunately reveals a bug:

................!
Failed: After 17 test(s).
{[[66],[65],[67],[65],[67]],[{[0,1,18],2},{[66],7}],[{[66],-4,-7},{[65],0,-6},{[66],2,-1},{[66],7,-5},{[65],0,-7},{[65],-1,-1}]}

Shrinking ...........(11 time(s))
{[[65]],[],[{[65],0,0}]}

Whenever a special requires 0 items, a division by zero error happens. It’s surprising that it does not happen in other properties, but at least it is caught in one. Fixing it will require validating the specials. First the property, and then the code:

prop_expected_result() ->
    ?FORALL({ItemList, PriceList, SpecialList}, lax_lists(),
        try checkout:total(ItemList, PriceList, SpecialList) of
            N when is_integer(N) -> true
        catch
            error:{unknown_item, _} -> true;
            error:invalid_special_list -> true; (1)
            _:_ -> false
        end).

1	Adding the expected error when a 0 count is submitted

valid_special_list(List) ->
    lists:all(fun({_,X,_}) -> X =/= 0 end, List).

-spec total(list(item()), list({item(), price()}), any()) -> price().
total(ItemList, PriceList, Specials) ->
    valid_special_list(Specials) orelse error(invalid_special_list), (1)
    Counts = count_seen(ItemList),
    {CountsLeft, Prices} = apply_specials(Counts, Specials),
    Prices + apply_regular(CountsLeft, PriceList).

1	If the list is invalid, the error is raised.

Finally, the bugs stop cropping up. There are still potential problems though. The space between a and b is still wide to unknown extents. An easy sanity check is to relax the constraints made in the requirements planning (in bold) to figure out what happens when this breaks:

items are identified by unique names (non-empty strings)
the list of items has no set length (though it appears non-empty), and the items it contains is in no specific order
the unit prices appear to be integers
the special prices are triggered only once the right number of items of the proper type can match a given special
all items have a unit price
it is not mandatory for a given item to have a special price.

Those are three constraints to relax. The first one can be easy to verify by relaxing the existing generators:

%% generate a list of {ItemName, Price} to configure the checkout
price_list() ->
    ?LET(PriceList, non_empty(list({non_empty(string()), integer()})),
         lists:keysort(1, PriceList)). % allow duplicates
              (1)

1	`ukeysort` was replaced by `keysort`, removing the uniqueness constraint

This modification ends up killing one of the properties real fast. Duplicates are not supported, but the generator implementation hid that fact. These assumptions are baked into the model, but not necessarily visible to the user. In fact it would be nice to let them know if the list is valid without needing to buy anything. A quick property can be added to check it:

prop_dupe_list_invalid() ->
    ?FORALL(PriceList, dupe_list(),
        false =:= checkout:valid_price_list(PriceList)). (1)

dupe_list() ->
    ?LET(Items, non_empty(list(string())),
         vector(length(Items)+1, {elements(Items), integer()})).

1	A new function is expected to be added to the interface for the users' sake

And an even simpler function can be written to check it:

-spec valid_price_list([{item(), price()}]) -> boolean().
valid_price_list(List) ->
    length(List) =:= length(lists:ukeysort(1, List)).

If a list that sorts and removes duplicate keys has a different length from the original one, the list is invalid since it had to contain dupes. The validator can be wired up inside the main function to avoid accidentally miscalculating prices (better safe than sorry):

-spec total(list(item()), list({item(), price()}), any()) -> price().
total(ItemList, PriceList, Specials) ->
    valid_price_list(PriceList) orelse error(invalid_price_list),
    valid_special_list(Specials) orelse error(invalid_special_list),
    [...]

And all of a sudden, the tests fail!

===> Testing prop_checkout:prop_expected_result()
.........!
Failed: After 10 test(s).
{[[65],[1,4],[66],[67],[4]],[{[67],3},{[65],-9},{[67],0},{[65],-18}],[{[65],-9,-2},{[65],-1,-3},{[66],-2,-2},{[65],4,-9}]}

Shrinking ..............(14 time(s))
{[],[{[65],0},{[65],0}],[]}

The test can be fixed by adding a handler for the same kind of error, and the same anti-duplicate test can also be added for the special list:

prop_expected_result() ->
    ?FORALL({ItemList, PriceList, SpecialList}, lax_lists(),
        try checkout:total(ItemList, PriceList, SpecialList) of
            N when is_integer(N) -> true
        catch
            error:{unknown_item, _} -> true;
            error:invalid_price_list -> true;
            error:invalid_special_list -> true;
            _:_ -> false
        end).

prop_dupe_specials_invalid() ->
    ?FORALL(SpecialList, dupe_special_list(),
        false =:= checkout:valid_special_list(SpecialList)).

dupe_special_list() ->
    ?LET(Items, non_empty(list(string())),
         vector(length(Items)+1, {elements(Items), integer(), integer()})).

And the matching code:

valid_special_list(List) ->
    lists:all(fun({_,X,_}) -> X =/= 0 end, List) andalso
    length(List) =:= length(lists:ukeysort(1, List)).

Everything now passes, and users can validate their inputs before running the code.

Even then, more digging could be done. That was just one of three assumptions being tested. Relaxing the other properties could also reveal interesting things:

Not handling unit prices for some items (or specials) creates unexpected crashes, and more interestingly, replacing integers with numbers (specifically floats) yield multiple failures because the current implementation uses rem and div, two integer-specific operators.
Negative numbers allow people to get paid to walk away with items, and currently no validation is done for that
Passing in non-numeric values in the price list or specials list is considered valid by the code

The spec was very vague though, so those discoveries would have to be discussed with stakeholders to figure out what is acceptable or not before the code hits production. A strict interpretation of the specification will mean the implementation is sufficient. A lax one will find it exploding for multiple cases that may or may not be preposterous.

Of course, in a statically typed language (or in the case of Erlang, with Dialyzer’s type analysis), the strict interpretation of the spec is the only one accepted. The three potential bugs above do not even register as a possibility, and the current state of affairs is very likely acceptable. If strict assumptions are made, and that they are being enforced by manual checks, the compiler, and/or static code analysis, then the program should not get into unexpected states. It may be inflexible but it will not be wrong. At least not on these bugs.

7.8. Weaving of Test Types

This chapter went through three types of tests in order: regular positive tests, tests that aim to look at broad cases to find unexpected bugs, and negative tests, designed ot check the reliability of the suite. In practice there should be no need for such a clear cut between phases; doing all of them iteratively and continuously tends to work very well. Waiting late for negative testing could mean too many tests are not actually useful. Testing and validating hypothesis early and as frequently as possible is never a bad idea. Looking for broad cases early can help identify unexpected problems that prove a design wrong; finding mistakes early is cheaper than finding them late since less code has been written in the wrong direction.

Property-based testing is a very useful tool to do exploration of what code actually does. It is a worthwhile exercise to start poking at things broadly and to then narrow down the results of generators. It is also worthwhile to start narrow and then widen the generators. The order does not matter, but exploring does.

1. http://www.jamesshore.com/Blog/Red-Green-Refactor.html

PropEr Testing