Like many folks who are trying to fight badware, we often find ourselves trying to quantify the problem. How many badware websites are there? How many are hosted by a particular IP address, or a specific AS, or a given hosting provider? And, knowing these numbers, how do we understand how big a problem these numbers represent?
Something as simple as counting may sound easy, but there are many challenges:
- It's hard to know which units to count. Individual URLs? Fully qualified domain names? Base domain names? None of these consistently correlate to what a human would consider "a website." And counting any one of these results in skewed data.
- What's the denominator? Suppose hosting provider A has more infected sites than hosting provider B. Is this a result of negligence on the part of A, or is A simply much larger than B? It's difficult to impossible to find accurate data about tne numbers of unique domain names or websites hosted by a particular provider.
- How do you count a URL or domain name over time if its IP address changes? Suppose you're reporting weekly stats; if a URL moves from hosting provider A to B within that week, how do you account for that in reporting the numbers? What if the site was "bad" when it was provider A but is now clean when it's at B? Do you still report A as hosting a bad site for the week?
- How do you count a URL that resolves to more than one IP address? Do you double count it?
These are questions with no easy answers. Yet, as Brian Krebs and various government officials have pointed out, it's difficult to know what action to take, and against whom, if you don't have a good way of measuring the problem. At StopBadware, we're gradually trying to work on answers to these questions. We've learned a few lessons from our successes and our mistakes, but we can also use more input. If you have ideas, or would like to talk with us further about the measurement challenge, please let us know in the comments or at contact <at> stopbadware <dot> org.