A couple months ago we decided it was time improve our marketing game for TimeTune. The app is growing step by step but some marketing elements have remained unchanged since the beginning. Like for example, our icon:
The icon is the first thing a visitor will see from your app when browsing through Google Play. It’s the perfect opportunity to catch their eye, send a message and put a foot in their mind. If it’s not attractive enough you’ll miss a good chance to reach a lot people.
We knew our original “tt” icon wasn’t doing us a great favor in this regard. We needed to find a new image that could provide TimeTune with an attractive and recognizable identity. That’s where we turned to Google’s Store Listing Experiments to test new ideas.
While these icon experiments are still running, we already learned some important lessons we want to share with you. Some of them were learned the hard way, so we hope this post will help you avoid the same mistakes we made.
By the way, if you’re interested in our icon experiments, you can see our tested ideas on our social media channels here and here. When we choose a winner, we’ll write another post detailing the full process and how each experiment performed.
But let’s dive into our findings:
WHEN COMPLETE DOESN’T MEAN COMPLETE
Every time you run an experiment, Google’s algorithm will tell you the experiment is complete after a few days. The exact amount of time will depend on the amount of daily downloads your app has. In our case (2,000 daily downloads as average), three to four days were usually enough to mark an experiment as complete.
“Wow! Three days is fast”, we thought. “We can test a lot of ideas in a single month and come to a decision very quickly!”. We were really impatient to complete all the tests.
And so we tested 7 icons one by one against our “tt” icon to see if they performed better or worse than our original icon.
One of the new icon ideas was clearly marked as underperforming:
But we thought: “Just in case, let’s make the exact same test again to confirm the result, it shouldn’t be different, right?”. And to our surprise, this was the new result:
How could that be? A complete different result for the same exact experiment?
It turns out there are two factors that can hugely affect your test results:
- User download behavior may vary substantially for different days of the week (for example, on weekends people may be more willing to download a new app without caring so much about its icon or screenshots).
- Store Listing Experiments are automated tests with a 90% confidence that the result is correct. Unfortunately, trying to measure a behavior so unpredictable as user downloading patterns is no easy feat. The algorithm can falsely think that a test is complete when in fact it has been measuring a “bad streak”. That’s why many statisticians recommend to use a 95% confidence interval or more.
So we decided to let the second test run a few more days, despite the algorithm saying it was complete. Here’s the evolution for each day:
Although the final result is positive too, there’s a notable variation through the days. And this confirmed our two first “hard” lessons:
- Forget about when Google Play marks your Store Listing Experiments as complete. Always let your experiments run after that.
- Run your experiments counting by weeks, not by days. It’s the only way to gather consistent data and avoid daily variations.
In fact, Google gives similar advice in their experiment best practices. But we developers are impatient by nature!
So sadly, we realized we had almost one month of data in our hands that now could be thrown away 😥
YOU NEED THOUSANDS
From that initial set of “complete-but-not-complete” experiments, we focused on another strange result:
Without being mathematicians, we had the suspicion that 477 installs were not enough to provide an accurate result. So we decided to make the same test again but this time we let the experiment run for two full weeks. And this was the result:
Although the result was favorable in both cases, the final performance was certainly different. If we only look at the first round, we might think this icon could be a stellar performer. But looking at the second one we see this icon is just… meh.
We had learned two new lessons:
- You need thousands of installs per tested variation in order to have a reliable result. Some articles claim 1,000 or 2,000 is enough, but we’ve seen considerable variations after that. For us, results start to stabilize after 5,000 or 6,000 installs per variation (current installs, not scaled installs).
- In any case, run your experiments until you see stable results. If you still see variations from day to day, keep the test running. Each app is different and has a different set of users with different behaviors.
DON’T OBSESS OVER PERCENTAGES
If we look at any test result, like this one for example:
It means that the average performance of the tested version in relation to the current version falls in the middle of the obtained performance range. In this example, it would be 9.9% (the middle point between 4.5% and 15.3%).
Does this mean that another idea that performs at 10% as average is better than this one?
First, knowing that the confidence interval for Store Listing Experiments is 90% (when most experts recommend 95% or more), there’s a high chance your results can be a bit skewed. And knowing that there will be additional variations if you let the experiment run longer, you cannot trust all results blindly.
So here’s the lesson we learned:
- Don’t obsess over exact performance percentages. Remember there’s a margin for error in automated tests. When you obtain similar results, apply your common sense.
THE TOUGHEST LESSON
This may be the most obvious lesson but the toughest to swallow:
- Be patient. What you think it may take weeks to test, will take months.
As developers, we’re all eager to find quick results and apply them as soon as possible so we can have more installs and conversions immediately. But rushing things here may certainly take you to the wrong place.
That’s why you should always be testing one feature or another with Store Listing Experiments. Test new ideas (even the crazy ones) continuously. When the time comes to change something, you’ll have data in advance.
TO SUM UP
Store Listing Experiments are an awesome tool to test new ideas on Google Play. They just come with a few caveats. Take our findings into account and you’ll be able to avoid our same mistakes!
Thanks a lot for reading! 😉