Errors when thinking about degrees of certaintyby Matthew Leitch, 17 May 2002
The previous section argued that in many everyday situations we should think of many things as being uncertain to some extent. Some more common errors arise in this sort of reasoning. Some are the result of the limitations of ordinary language.
Quantitive vagueness is a big problem if you need to be precise and worse if you have had to define a phrase specifically for some special purpose such as when discussing a business decision. The two main problems are:
Vagueness as to the boundaries of the range the phrase is intended to indicate. How high is "high"? How often is "often"? And so on.
Excessive quantisation. The categories can cover such a wide range that they are not precise enough for the purpose to which they are being put.
Example: Risk assessment. Various methods of assessing and managing risks in organisations and projects have been invented and many involve assessing risks as "High", "Medium", or "Low". These categories are sometimes defined in a quantitive way using probabilities or frequency of occurrence, but this is often missed out. Consequently each person has their own idea of what is meant and this will vary from one risk to another as described above. For example, someone may say there is a high risk of a project being completed later than planned (90% certainty in mind because this is normal for projects generally) and a high risk of the project sponsor being arrested by the police (40% in mind because the sponsor is a notoriously dodgy character making this risk unusually high for this project). It may also be the case that risks put into the same range are actually orders of magnitude different in their likelihood e.g. risk of reactor meltdown might be "Low" and so is risk of industrial action, but the first is 1,000 times less likely than the second. The problems of quantitive vagueness are often compounded by asking questions like "What is the risk that sales will be low in the first year?" which combines the vagueness of risk categories with the vagueness of what "low" means in relation to sales. This kind of thing is usually excused on the grounds that it is "simpler" than using numbers, but on that logic the simplest approach would be not to think at all.
Floating numerical scales: Just as a word can be quantitively vague if not defined with some sensible measurements, so invented numerical scales can be treacherous if not tied to some established measurement system. For example, asking people to rate risks as to likelihood on a scale from 1 to 10 without linking those numbers to probabilities usually causes problems:
For the scale to relate directly and easily to probabilities it would have to be 0 to 10, not 1 to 10.
Many people do not assume that the numbers on the scale are 10 x probability and their ratings can be bizarre. Some people tend to group their ratings in the middle, while others prefer the extremes of the scale. When a group of people are trying to agree some ratings this makes the ratings very confusing. [This can be reduced by mathematically normalising the spread of each individual's ratings, if each person makes enough of them to get a good idea of their tendencies.]
Ratings of other quantities do not have the advantage of an existing scale i.e. probability, which ranges from 0 (definitely no) to 1 (definitely yes).
Other errors arise when we rely on judgements of quantities such as likelihood, or try to combine those judgements using yet more judgement. This kind of judgement is often necessary in the absence of more objective information, and it is possible to learn to give reasonably accurate probability judgements in a narrow field with long practice, good feedback, and some skill. However, there are also several errors that can undermine reasoning.
This area has been studied intensively for several decades by psychologists. There are many, many scientific articles on the subject and debate about why people make the errors they do is particularly complex and hard to follow. The research has also been picked up by authors who have taken the names of the theories and misunderstood and over-generalised them in a highly misleading way. Instead of leading with the theories and risking over-generalisation, I present below some examples of questions used in experiments, the answers people usually give, and why they are wrong.
If you've never come across this research before prepare your ego for a battering. If you get to the end of these examples and you're still thinking "Surely this is some mistake?" forget it. Thousands of people have been through the same thoughts you are having but there have been so many studies now that you can be pretty confident that anything you can think of as on objection has been tried out and failed. The fact is we're mostly stupid. Get used to it.
| Question | Typical answers | Why they are wrong |
|---|---|---|
| Imagine we have a group of 100 professional men, a mixture of 70 lawyers and 30 engineers. Now we select one at random. What is the probability that he is an engineer? Now suppose we are told that the man is 30 years old, married with no children, has high ability and motivation and promises to do well in his field. He is liked by his colleagues. What would you say was the probability that he is an engineer? Finally, we learn that he builds and races model cars in his spare time. What now is your opinion as to the probability of his being an engineer? | Most people answer the first of the three probability questions with "30%", which is correct. Given some information but none that helps decide if the person is an engineer or a lawyer people tend to say it's "50:50". Given some information that suggests an engineer, they give a number derived purely from the extent to which the information seems to represent an engineer rather than a lawyer, without regard to the proportion of engineers in the group. For example, building and racing model cars seems to suggest an engineer to me, so I would say "70%" | The first answer is quite correct. The second answer should be the same as the first, but the snippet seems to trigger us into judging probabilities purely on the basis of how much the infomation seems to represent a stereotypical engineer rather than a stereotypical lawyer. The problem is not so much the use of stereotypes but the failure to consider the proportion of engineers in the group as a whole, which is still an important factor when specific factors of the person selected are not conclusive. The third answer should reflect the original proportion of engineers as well as the biographical clues. In this case it should be lower than the answer based just on model cars. |
| Imagine a town has two hospitals with maternity wards. In the larger hospital about 45 babies are born daily, but in the smaller hospital only about 15 are born each day. Of course about 50% of babies born are boys, though the exact percentage varies from day to day. For a year, each hospital recorded the days on which more than 60 percent of the babies born were boys. Which hospital do you think recorded the most such days? | Most people say "about the same" and other answers are evenly split between the large and small hospital. In other words people generally don't think size matters. | Size does matter. An elementary conclusion of statistics is that the larger a sample is the more closely it typically resembles the population from which it is taken. The larger hospital will have far fewer days when more than 60% of babies born are boys. In general people seem to have no concept of the importance of sample size. |
| Imagine a huge bag into which we cannot see, filled with lego bricks of two colours, blue and yellow. 2/3 are one colour and 1/3 are the other colour but we don't know which is which. One person pulls out five bricks at random and finds 4 are blue and one yellow. Another person grabs a bigger handful at random and has 20 bricks, of which 12 are blue and 8 are yellow. Which person should be more confident that the bag contains 2/3 blue bricks and 1/3 yellow bricks, rather than the opposite? | Most people say the handful with 4 blues and one yellow is more conclusive. | Again, sample size matters but is ignored. The actual odds are twice as strong from the larger handful because of the value of a larger sample, even though it does not contain such a high ratio of blue bricks. Furthermore, when asked to give the actual odds people usually underestimate the confidence given by the larger sample. |
| A coin is to be tossed six times. Which sequence of outcomes is more likely: H-T-H-T-T-H or H-H-H-T-T-T ? | The second sequence does not look random so people say the first sequence is more likely. | The sequences are equally likely. However, we have definite views about what a random result looks like. We expect it to be irregular and representative of the sample from which it is drawn or of the process that generated it. Since the second sequence violates those expectations it seems less likely. |
| Write down the results of a sequence of twelve imaginary coin tosses, assuming the coin is fair i.e. heads are as likely as tails. | A typical result looks pretty random. | Because people expect the results to look irregular and to be representative of the process that generated them, their sequences tend to include far too few runs and to be too close to 50:50 heads vs tails for a small sample like 12 tosses. |
| A coin is tossed 20 times and each time the result is heads! What is the probability of tails next time? | Gamblers tend to think that surely it must be time for tails, "by the laws of probability". | If you trust the coin the odds are the same as before at 50:50. If there is doubt about the coin the odds should actually favour another head. Those are the real laws of probability. |
| Ten trainee teachers each give a half hour lesson which is observed by an expert teacher and a short report is written on the quality of the lesson. You see only the ten short reports and must decide for each teacher where their lesson ranks against others in percentile terms (e.g. are they in the top x% of trainees?), and where they are likely to be after 5 years of teaching relative to other teachers of the same experience. | Most people's ratings of lesson quality and future prospects are basically the same. | The ratings of future prospects should be less extreme than the evaluations of lesson quality. Observing a half hour lesson is not a reliable guide to performance in 5 years time! The evidence of the lesson should be influential but not the only factor. Without that evidence you would say their prospects were about average; with the evidence you should say they are somewhere between average and the relative quality of the lesson. |
| You are a sports coach and want to find out, scientifically, what effect your words have on your coachees. Your records show that when you give praise for a particularly good performance they tend to do less well next time. The good news is that when you give them a hard time for a poor performance they tend to do better next time. | Sadly some people conclude that this shows praise does not work but a verbal beating does. | Actually it just means that after a particularly good performance the next one is likely to be less good, and a particularly bad performance is likely to be followed by a better one even if you say nothing. Performance varies, in part randomly, and that is the reason for this. |
| You have the academic results of two undergraduates for their first year of two different three year degree courses. One has scored six B grades while the other has a mixture of As, Bs, and Cs across 6 papers. Whose eventual degree result after three years would you feel most confident of predicting correctly? | Most people are more confident they can predict the result of the consistent B person. | They should have chosen the other person. Highly consistent patterns are most often observed when the input variables are highly redundant or correlated. The straight Bs probably contain less information than the mixture of As, Bs, and Cs. |
| You listen to someone read out the names of a number of male and female celebrities, some more famous than others. At the end, to your surprise, you are asked to say the proportion that were female. You hadn't been keeping score so have to use your judgement and memory. | People tend to overestimate the proportion of women if there were more big female celebrities than big male celebrities. | More famous people are easier to remember so they come to mind more easily. If more of those really big names are female it influences your judgement much more than the minor celebrities. |
| Are there more words in normal English text that start with R than have R as the third letter (ignoring words of less than 3 letters)? | Most people guess that more begin with R. | But in fact more have R as the third letter, but it is far easier to call to mind words that start with R so we think they are more numerous. |
| Imagine you have 10 people to choose from to form a committee. How many different committees can you form from this group with 2 members, 3 members, 4 members, and so on? | Since most people don't know the maths to do this they make a judgement. In one study the median estimate of the number of 2 person committees was 70, while the estimate for committees of 8 members was 20. | The number of ways to form a committee of X people out of 10 is the same as the number of ways to form a committee of 10 - X people. If you select one set of people for the committee you are simultaneous selecting the others for the "reject committee". For example, the number of 2 person committees is the same as the number of 8 person committees and is 45. However, as one tries to imagine committees it seems easier to imagine forming lots of small committees than lots of big committees, and so it seems there must be more of them. |
| You are presented with information about some imaginary mental patients. For each patient you are given a diagnosis (e.g. paranoia, suspiciousness) and a drawing made by the person. Afterwards, you have to estimate the frequency with which each diagnosis is accompanied by various features of the drawings, such as peculiar eyes. | Most people recall correlations between diagnoses and pictures that match natural associations of ideas rather than actual correspondences in the examples given. For example, peculiar eyes and suspiciousness seem to go together and frequency of occurrence is judged high. | The illusory correlations are resistant to contrary data. They occur even when there is a negative correlation between diagnosis and feature, and prevent people from seeing correlations which are actually present. Similar experiments have shown that we are surprisingly bad at learning from examples. We do not always learn by experience! Other experiments have shown how hindsight colours our interpretation of experience. We think that we should have seen things coming when in fact we could not have done. Another common and powerful bias is known as the "Fundamental Attribution Error". This is the tendency to explain another person's behaviour as mainly driven by their personality/habits and our own behaviour as mainly driven by circumstances. Both judgements are usually wrong. |
| Imagine you are asked a series of very difficult general knowledge questions whose answers are all percentages between 0 and 100% (e.g. the percentage of African countries in the United Nations). After each question is given a sort of roulette wheel is spun to give a number. You are asked if the answer to the question is higher or lower than the random number chosen by the wheel. Then you have to estimate the answer. | Most people's estimates are affected by the number given by the roulette wheel. For example, in one study the median estimates of percentage of African countries were 25 for people given 10 as the starting number, and 45 for groups given 65 to start with. | This is called an "anchoring" effect. Our judgements are anchored by the given number, even if it is known to be random, but especially if it is given by someone else we think might know something. We then fail to adjust sufficiently from the anchor. |
| Which do you think is most likely? (1) to pull a red marble from a bag containing 50% red and 50% white marbles, or (2) drawing a red marble 7 times in succession from a bag containing 90% red marbles and just 10% white (assuming you put the marble back each time), or (3) drawing at least one red marble in 7 tries from a bag containing 10% red marbles and 90% white (assuming you put them back each time). | Most people think drawing 7 reds in succession is most likely, and at least 1 red in 7 tries the least likely. | In fact the probabilities are very similar, but the reverse of what most people think: (1) = 0.48, (2) = 0.50, (3) = 0.52. This illustrates our general tendency to understimate the likelihood of something happening at least once in many tries, and overestimating the likelihood of something likely happening successively. For example, we might look at the many risks affecting a project and see that each alone is unlikely. Based on this we tend to think there's little to worry about. In fact, because there are lots of unlikely risks, the risk of at least one thing going wrong is much higher than we imagine. |
| You are asked to make some predictions about the future value of the FTSE100 index. For a given date you are asked to say a value that is high enough that you are 90% sure the actual index will not be higher, and another number so low that you are 90% sure the actual index will not be lower. | Most people are too confident of their judgement for a difficult estimate like this and their ceiling is too low while their floor is too high. | Actually there are various ways to find out what a person thinks the probability distribution of some future value is. Different procedures give different answers and an anchoring effect from the first numbers mentioned is quite common. |
| Here's another question with bags of coloured lego. This time I have two big bags of lego. One has 700 blue bricks and 300 yellow bricks in it. The other has 300 blue bricks and 700 yellow bricks. I select one by tossing a fair coin and offer it to you. At this point there is nothing to tell you if it is the mainly blue bag or the mainly yellow bag. Your estimate of the probability that it is the mainly blue bag is 0.5 (i.e. 50% likely). Now you take 12 bricks out of the bag (without looking inside!) and write down the colour of each before putting it back in the bag and drawing another. 8 are blue and 4 are yellow. Now what do you think is the probability that the bag is the predominantly blue one? | Most people give an answer between 0.7 and 0.8. | The correct answer is 0.97. Hard to believe isn't it? Most people are very conservative when trying to combine evidence of this nature. This contrasts with some of the other situations described above where we seem to be overconfident. These were in situations where we had little idea to go on. Here we have all the information we need to come up with a correct probability but fail to. The more observations we have to combine to make our judgement the more over-conservative we are. Another odd finding is that where there are a number of alternative hypotheses and we have to put probabilities on all of them we tend to assign probabilities across the set that add up to more than 1, unless we are forced to give probabilities that sum to 1 as they should. Our ability to combine evidence is so poor that even crude mathematical models do better than judgement. |
Events that might happen more than once: Another mistake is to ask people to give a single risk rating for risks that might happen more than once. For example, "What is the risk of billing error?". For a large company issuing millions of bills a year it is virtually certain that they will make at least one error at some point. It is necessary to consider the probabilities in a different way. In theory, we need a probability for every number of billing errors they might make. Ratings can be simplified but, essentially, we need a distribution of probability against frequency.
Events whose impact varies: Another common error when asking people to rate risks is to ask for a rating of "impact" when the impact is variable. For example, "What would be the impact of a fire at our main data centre?" is impossible to answer sensibly with a single number. How big is the fire? What does it destroy? Clearly different fires can occur, and the probability of fires with different impacts is also going to vary.
Risks that are not independent: One of the most difficult analyses is to work out the impact of risks that are not independent i.e. the occurrence of one affects the likelihood that others will occurr. They may be mutually incompatible, or perhaps part of a syndrome of related risks that tend to occur together. This is a big problem, as so many of the risks we face are far from independent. For example, you might list the risks of a project and write down the likelihood of each occurring. But what assumptions do you make about the occurrence of the risks in combination? Were those probabilities on the assumption that nothing else has gone wrong? In practice, once a project starts to fall apart it's in danger of a downward spiral in which the distraction of fighting one fire sets another alight.
Not combining evidence: There are two related errors: (1) forgetting about previous evidence when drawing conclusions from new evidence, and (2) not recognising that initial beliefs exist, even before the first evidence is received. When trying to judge the likelihood of something from inconclusive evidence we need to combine whatever evidence we have. Bayesian probability theory explains how this should be done. First, we need a set of hypotheses about what the truth could be and, for each one, a view as to its likelihood of being true. There is no stage at which we have no idea of the likelihoods. We always need to have some belief, even if it is a complete guess and we start by assuming that all hypotheses are equally likely. (In non-Bayesian statistics this is hidden because the initial view is not explicitly stated.) When new evidence arrives we need to modify our beliefs by combining what we believed before with the new evidence. (Bayes even found a formula for doing this mathematically which is extremely useful and can be found in any good textbook of probabilities.)
Example: Appraising performance of colleagues at work. Imagine you are a manager in a large company and every year have to go through the painful process of trying to evaluate the performance of people who have worked for you and for others from various sources of evidence. (Perhaps you don't have to imagine it.) Suppose that this year Personnel have said everyone will be given a performance rating of 1, 2, 3, or 4 with 4 being the best rating. The idea is that 25% of the staff in the company will end up with each rating. As each staff member is considered various written appraisals and performance statistics are read out and the committee of which you are a member has to arrive at a judgement of what rating to give. Here's how to analyse this task in Bayesian terms. For a given staff member, Anne say, the hypotheses are (a) Anne is truly a 1, (b) Anne is truly a 2, (c) Anne is truly a 3, and (d) Anne is truly a 4. The initial probabilities of each hypothesis are each 0.25 because that is really how Personnel have defined the scale. Imagine that the first evidence read out is a punctuality statistic. How likely is it that the statistic read out for Anne is the punctuality of a 1 person, a 2 person, and so on? Imagine it is more the punctuality of a 1 person, now your views should update to show that hypthesis (a) is the most likely, probably followed by (b), then (c), then (d). The next evidence is a customer feedback survey where the results are more those of a 2 rated person. Now your probabilities for each rating being the one truly deserved shift again, perhaps making 2 the favourite, or perhaps staying on 1 depending on the precise probabilities involved. This process continues until all evidence has been heard.
Incidentally, research shows that humans are very bad at doing this kind of evidence combination by judgement alone. A better way would be to use a simple mathematical formula to combine the individual ratings and judgements using Bayes' formula. Another alternative is to use a simple linear function something like this: overall score = c1 x s1 + c2 x s2 + c3 x s3, where c1, c2, c3 are constants and s1, s2, s3 are scores from three separate sources of evidence, such as punctuality, customer feedback, and average rating from peers. The s1, s2, and s3 scores should be "normalised" so that they have the same distribution as each other. For example, it would be distorting if punctuality scores varied from 1 to 3, but customer ratings varied from 1 to 300. We need to get them all onto the same sort of scale so their relative importance is determined only by the constants. Provided the individual scores always move in the same direction as the overall score (e.g. better punctuality always means a better overall score) then this sort of formula consistently outperforms human judgemental combination of evidence even if the constants are chosen at random! This is because we are bad at this and linear combinations are very "robust".
Failing to partition TRUE: Another key point from Bayesian theory is that the set of hypotheses which we are trying to assess as each piece of evidence arrives must (1) include all possibilities (i.e. no gaps), and (2) include each possibility only once (i.e. no overlaps). For example, if we are trying to decide which of the horses in a race is going to win the set of hypotheses could be one hypothesis for each horse plus a hypothesis that says no horse wins (e.g. the race is called off, or abandoned), and hypotheses for dead heats between various combinations of runners. It would be a mistake to miss out one of the horses by saying something like "The question is whether Horsey Lad or Sir Gawain will win." or to forget that there might be no winner or a tie. It would also be a mistake to select hypotheses that overlap such as A: Horsey Lad wins, B: Horsey Lad or Sir Gawain wins, etc.
Confusing areas of uncertainty or risk with specific uncertainties or risks: When people are asked to identify risks or uncertainties of a venture they often come up with areas of risk or uncertainty rather than specific risks or uncertainties, without realising what has happened. (Sometimes it is actually areas that they need to think about rather than specifics, and it is the instructions that are wrong.) For example, "regulatory risk" is not a risk, but an area where there are many risks, whereas "Risk that price controls will be introduced this year" is much more specific and contains a proposition that is either true or false, i.e. "Price controls will be introduced this year".
Group effects: Studies and personal experience show that when we are unsure of something we are more likely to be influenced by what others seem to think. In a group, the first person to say their view about something uncertain, or the most confident looking person, may be very influential on the eventual group "consensus". Perhaps it is best for someone to speak first saying (confidently) "This looks like a difficult judgement, with at least some significant uncertainty, and probably not much hard data to go on. I suggest we all start by just writing down what we each think, why, and what we are most uncertain about, then compare notes."
Concealed over-generalisations: Another error common in group discussions is concealed over-generalisation. Here's an example to show how it works. Imagine a group of, say, five people in a business meeting discussing the possible need for a change to a product sold by their company. Everyone there has a different role in the company and an unknown number of people actually have no idea if a change is needed or not.
CHAIRPERSON: The next item we need to discuss is whether the System A product needs an improved user interface. Alex, perhaps you could start us off.
[Alex, of course, is the person who wanted this discussed and has already lobbied the chairperson. Alex is involved in sales and the previous week two customers made negative comments to him about the user interface of System A. One customer was particularly aggressive about it, putting Alex under a lot of pressure. That same week, Alex had 17 other meetings with customers at which the user interface was not criticised. Other sales people had a further 126 meetings with customers and nobody in the current discussion knows if the user interface was criticised. What Alex actuallys says is . . .]
ALEX (SALES): Recently I've been getting a lot of feedback from our customer base about the user interface of System A. It's not liked at all. I think it could be costing us repeat business.
BOB (ENGINEERING): What's the problem with it?
[The actual complaints were that response time was too long on a particular program, some numbers important to that particular customer could not be analysed in a particular way, and the customer wanted some money back over a training session that never took place. The customer summed these up by referring to the "user interface" although in fact none of these would be solved by better usability. The customer's generalisation of the problem is the first example of concealed over-generalisation in this sequence. The next was Alex's overstatement of the customer feedback, referring to "a lot" of complaints instead of stating the number, from the "customer base" but without naming the customers, and repeating the customer's mis-naming and over-generalisation of the problems. What Alex actually says is . . . ]
ALEX: It's basically clunky and old fashioned. We haven't really done anything with it for three years!
The discussion then moves on to an argument about whose fault it is, effectively assuming the existence of the problem without further thought. Eventually the chairperson intervenes with . . . ]
CHAIRPERSON: OK, OK, why don't you and Alex take that off line. I think we're agreed that we need to improve the System A user interface so I'll feed that back to the product board.
[The chairperson's summing up completes the sequence of delusion by generalising from the heated argument of two people to a general agreement that the user interface needs an overhaul.]
None of these generalisations are made deliberately or with any consideration of uncertainty. Hence, I call them "concealed".
It's common for someone to suggest that there are lots of something and give an "example" which in fact is the only known instance.
This kind of thing is so common it's difficult to imagine escaping from it. What should the chairperson say? Obviously, it is wrong to generalise from a single example, but it is also wrong to dismiss such evidence as inconclusive and therefore worthless or meaningless. In practice, one reason people tend to read so much into insignificant data is that they have reasons to believe the examples are part of, perhaps the first sign of, a significant trend. The reasons may or may not be good ones, but they need to be considered alongside other evidence. This requires evidence to be integrated, which we don't often do well. Here, at least, is a better approach to Alex's customer problem.
CHAIRPERSON: The next item we need to discuss is whether the System A product needs an improved user interface. Alex, perhaps you could start us off.
ALEX (SALES): Recently I've been getting a lot of feedback from our customer base about the user interface of System A. It's not liked at all. I think it could be costing us repeat business.
CHAIRPERSON: Can you be specific about the feedback? Who gave it and what did they say?
ALEX: Errr. The companies were Delta Ltd and Beta Ltd. Delta gave me a really hard time over the reporting.
CHAIRPERSON: What specifically?
ALEX: They said it wasn't user friendly.
CHAIRPERSON: Do you know what that view was based on? Did they have any specific complaints?
ALEX: Well, er. The main thing was report 101, which they wanted analysed by the day of the week. I said they could run it for individual days and put the results together on a spreadsheet, then the guy just got stroppy.
[More questions from the chairperson are needed to get all the specifics, then . . . ]
CHAIRPERSON: [summing up] So, from those experiences we have a suggestion of problems with report 101, enquiry 33, and the script in module C. We don't know what other complaints there may have been about System A in its current version. On that basis alone we only have reason to consider modifications to those three elements, but it's interesting that one customer described these as user interface flaws, which may indicate other issues not made clear in Alex's meeting. Do we have any other reasons for suspecting that System A's user interface is a problem?
JOHN (ENGINEER): We never tested it. We haven't done that for any of our products.
JANE (FINANCE): What do you mean we never tested the user interface? Surely we test all our systems don't we?
ALEX: No wonder they're complaining.
JOHN: No, no. That's not what I mean. Of course we tested it to make sure it didn't have bugs in it and met the functional requirements, but we never did usability testing specifically to find out if the thing was user friendly. Our larger competitors do it, but we just didn't have time.
ALEX: But surely we had users involved. How much difference would this usability testing have made?
JOHN: A big difference. It's another level of design.
CHAIRPERSON: So are you saying that System A's user interface is likely to be significantly less user friendly than the competition?
JOHN: Yes.
Fine quantitive judgements: Some judgements about the future depend on fine judgements of quantities and combinations of them. Decisions about personal finance and about the environmental impact of an activity are good examples. Typically, faced with a decision about pension contributions, or insurance, or mortgages, most people go through some futile reasoning about whether something is taxed or not and perhaps remember to ask about charges and penalties. However, in the end there is no way to know the value of a financial deal without calculation. Similarly, environmental impacts are so diverse and the indirect impacts so numerous and hard to discount safely that only rigorous modelling and calculation has any chance of being reliable.
Difficulty separating likelihood and impact: Many attempts to analyse risk rely on asking people to give separate judgements about the likelihood of something happening, and the impact if it happened. In addition to the other common problems already described it is my impression that people often cannot entirely separate the two.
This may be just the result of all the other problems. Perhaps people simply fudge the ratings to give an overall impression of the importance of a risk. For example, rating the impact of "a fire at our data centre" is impossible to think through clearly. What is the extent of the hypothetical fire? What is damaged by it? What assumptions should be made about fire fighting and standby arrangements? Most people would feel this was an important risk and use whatever system of judgements they were asked for as a way of saying so.