If I was writing a list of the things that annoy me about marketing the most, high on that list would be our discipline’s penchant for measuring things that aren’t really important or for measuring things that do matter but in hugely flawed ways.
Whether its judging email success only based on opens and clicks, using ‘engagement’ measures as a proxy for success within social marketing, or the flawed measurement around digital advertising described in this article here, it all drives me nuts.
I wrote an article recently where I laid out what I believe are the crucial skills of a CRM / Data-driven marketing practitioner which included the ability to understand and apply some key statistical principles. However, given the deficiencies in statistical knowledge and poor measurement outlined above, I think a basic understanding for any marketer is crucial.
Knowledge is Power
But with all the automation and technology available within modern marketing technology, why would you need to understand the principles on which it is built when it can just give you an answer?
Forgive me for being a cynic, but I don’t like vendors marking their own homework. I like to be able to question and understand their methodology even if it is at a superficial level. Scepticism is crucial — most vendors will understandably want to spin a positive story.
Numbers can be daunting
I know that a lot of marketers are a little scared of numbers. So the intention of this article is to try and simplify some concepts that are useful in measuring the effectiveness of your activity. More specifically the question:
How do I know with confidence that my marketing activity is getting more people to do the thing we want them to vs not doing the activity?
Now that activity could be something new you’ve never done before or a new variation of existing activity. Either way, in order to answer this we need to break it down into three further questions:
- What are control groups and why are they necessary?
- What is statistical significance and why you should care?
- What is statistical power and why you should care?
What are control groups and why are they necessary?
Let’s say we’re running an offer to buy a product through email to a targetted group of our existing customers. Let’s also say that 5% of those customers respond i.e. they use the offer and we make a bunch of revenue off the back of that.
Now you might be thinking well I’ll claim all that revenue as though it were a direct result of the email. Here comes a massive ROI. You should hold your horses on that one, however.
How do you know whether those same people wouldn’t have come back and bought that product anyway? You don’t. Maybe they’d have come through another marketing channel for example. You can’t claim all those responses as truly incremental — i.e. responses that wouldn’t have happened if we hadn’t run this email campaign.
That’s where a control group comes in. When created properly a control group represents the “What if we hadn’t run the campaign?” case. Or more broadly the “What if we hadn’t changed something compared to what we normally do?” case.
Only with a Control Group can we know if the campaign generated an incremental response we otherwise would not have had.
Setting up a control group
There are a few things you need to know when setting up a control group. The first two relate to limiting variables that might skew your results.
- It must be representative of the group you’re targeting. So if you’re targeting everyone who bought say a copy of Call of Duty in the last 6 months, you can’t compare against a group who have not bought Call of Duty in the last 6 months. You’re comparing apples to oranges. Your control group must come from the group you’re targeting with an offer.
- It must be a random sample. Even if you pull your control from your target group, you must do so randomly. So for example, you can’t select everyone in the target group who lives in a certain town or everyone who bought Call of Duty in the last month specifically. Again they won’t represent the target group and your results will be skewed.
- It must be the right size but more on that when we discuss statistical power.
What is statistical significance and why you should care?
So let’s say we follow those rules and create a control group. We send our email to the contacted group and exclude the control group entirely. We find that the group we contacted got a 5% response rate and the control group got a 3.5% response rate. So we claim that 1.5% response ‘up-lift’ as down to the campaign right?
Well maybe. It depends.
When we take a control group, even if we try and take it randomly and representatively, it cannot possibly be exactly identical to the total group we are targeting. That means they may respond differently to the contacted group. So any difference we see between the two groups could just be down to that variance in the audience. It’s called Sampling Error.
So to stop us claiming incremental responses that are down to sampling error we apply a test of statistical significance.
Know your hypothesis
In experimental design, we use something called a null hypothesis and only when we have enough evidence against this do we accept it as wrong and go with the alternative hypothesis.
Stay with me. I can see you glazing over.
In our case, the null hypothesis would be that the campaign didn’t have any effect and the alternative hypothesis is that it did. In essence, we are looking to prove ourselves wrong. That’s really important. Humans so desperately want what they have done and invested their time in to matter so are prone to try and jump through statistical hurdles to prove themselves right. Starting from a sceptical position helps protect against that.
So what we want to know in our example is if the difference in response rates is different enough that we are confident that it is not to be down to sampling error.
You may have heard the phrase 95% confidence level before. If a result is significant at that confidence level there is a less than 5% probability that the same uplift would have occurred if you hadn’t run the campaign. If the null hypothesis is that the campaign would create no up-lift we have enough evidence to say that its highly likely to have been proven wrong.
You might see that probability shown as something called a p-value, with P standing for probability. The p-value represents the chance that your hypothesis is wrong so a p-value less than 0.05 is what you're looking for at a 95% confidence level.
If that’s a bit much, all it means is that you can say the campaign did generate at least some of the up-lift you have seen in the results. There are always margins for error which mean that you could be looking at less, more or even no significance if you ran the same test again which is why replication and magnitude of results are key. More on that later.
A 95% confidence level is pretty standard. However, we’re not testing drugs in a clinical trial here where the margins have to be tight— we’re looking for better business results so you may be happy going with a 90% confidence interval.
There are plenty of on-line calculators that can help you do the analysis after your campaign has run — here’s one from Survey Monkey for example.
What is statistical power and why you should care?
Too often I have seen in the past a set rule for control groups — 5 or 10% of your target audience for example. But in reality, the size of your control group is based on a few factors — your expected response rate, the level of confidence you want to have in your results and the statistical power behind your test. Power relates to the volumes you have and how small a difference in response rates do you want to detect.
If you have an underpowered test you can have too little volume to detect what could be a significant result. You miss something positive. Here’s an example from Analytics Toolkit:
In an infamous use of underpowered tests in the 1970s car drivers were allowed to turn right on red light after the several pilot tests showed “no increase in accidents”. The problem was that they had sample sizes too small to detect any meaningful effect. Later, better powered studies with much larger sample sizes found effects and estimated that this decision in fact claimed many lives, not to mention injuries and property damage.
The opposite can also be true — you find a result significant that actually isn’t. This especially true when you’re looking at a test in real-time — maybe looking at creative A vs B within a programmatic digital display ad campaign. You may see a significant result prematurely based on small volumes and call the test complete. That’s a mistake. Let the experiment run its course — statistical significance can change as more volume is added.
Practical Vs Statistical significance
Power is crucial to the idea of statistically versus practically significant results. On a huge sample size a tiny up-lift can be statistically significant but is it really practically significant?
Practical can mean many things but from a marketing perspective you might want to think about what up-lift you’d need to get from your campaign to at least break even as a floor for your practical significance ‘test’. If your result is below that but still statistically significant it's not practical as you’re losing money.
Tips for Control group size
If you’re testing if a campaign is worth doing vs doing nothing and its the first time you’ve run it, increase the size of your control group. I’ve gone with 50% before which is overkill but when you don’t know what the response rate is going to be then I always think better safe than sorry.
If you’re testing a new thing against what you currently do (e.g. a creative test) and you have an idea of expected response rates, then you can reduce the % of your audience you take for a control group. There’s a bit more certainty but more volume is usually always better. After all, the only issue really with an overpowered test is the cost — especially if you’re producing something like Direct Mail.
I’d recommend using one of the many online tools, such as this one, to calculate it. or ask your analytics / Insight team for help.
Replication of results is crucial
It is good practice to run a test twice to see if the results replicate themselves. Even with a statistically significant test the first time around we can’t be 100% sure it was down to the campaign and we know there are margins for error around the results. Rerunning it and seeing similar results means that you can be more certain of the effect of the campaign.
To quote Mark Razzell, Director of Data & Insights at Clemenger Melbourne:
It is good practice to run a test twice to see if the results replicate themselves. This is why scientists publicly publish results, and why they will seemingly answer the same question 30,000 times. If it can’t be replicated, it isn’t valid.
Don’t despair of statistically insignificant results
You may have put your heart and soul into a campaign, slaved over it, worked late into the evening because you believed in it only to see no statistically significant result. You wanted a positive effect from all that effort and let's face it, it would look good your boss. It could be easy to see this as time wasted.
But it’s not. You’ve learned something — what doesn’t work. That’s hugely valuable in streamlining your efforts in the future and working through your test frameworks. I’d suggest creating a playbook that you can build over time giving details of tests you have run, campaigns that have worked out and those that haven’t happened.
The other thing is to avoid the sunk cost fallacy — basically to avoid feeling like you have lost time and money on something you double down in the face of evidence you should stop. Hard emotionally but go with the results you are seeing.
Summary
This is without question, the hardest article I have written and will probably be one of the least read. It’s tough to try and explain these concepts in a simple, digestible way but also do it in enough detail to ensure I don’t get shouted at too much by friends who know more about this than I do!
There’s so much more depth to this sphere that could be added to this article, such as dealing with control groups for triggered daily campaigns versus an ad-hoc one-off campaign, but it’s long enough as it is.
I hope this has at least in part demystified some of these concepts. As I said before, you don’t have to know all the statistical techniques but knowing the basic concepts is really helpful. It’s worth putting in the time to understand them.