Saturday, February 11, 2006

"white list" vs. "black list"

There are two basic approaches to filtering or moderating content. They are called "white list" and "black list". Each is, in a sense, the opposite of the other. A "black list" is a list of things (bad words, IP addresses, photos, etc) that you want to filter out. A "white list" is a list of things (people, IP addresses, etc) that you define ahead of time, and only things on that list can get through.


An example from the spam email industry. A black-list approach is to, say, filter all incoming messages looking for "Viagra" or "Home Loan" and act on any matches, perhaps moving them to a special mailbox for suspected spam. A white-list approach might be to only accept email from people who are in your address book. Period. If they're not on the white list, the mail doesn't get through. The "challenge/response" email filters are a white-list approach, where people can put themselves on your white list by proving they are a human being (not a robot) by typing in some characters, etc.


The white-list model is much safer (better for children, for example), but much more restrictive.


In the world of user-contributed content to web pages, it is trickier because there is no single recipient to make these decisions. Something posted to the web goes to "everybody". So this becomes an issue for the site administrator.


A white-list model requires human intervention of some sort: content can only go live if someone has approved it. A black-list model can be easily automated, but it is also easily circumvented (alternate spellings of "bad words", or substituting the digit '0' for the letter 'O' for example).


A third model that is gaining popularity in the world of online communities is the idea of site visitors reporting objectionable content. This is, on the one hand, a lazy way to do black-listing: wait until somebody is offended by it. It is also prone to abuse, in cases where the content is not truly objectionable, but someone wants to be a nuisance.


There is no clear winner in terms of approach. If what you want is to block "most" spam product advertisements posted as comments to a blog, maybe a black-list model works okay, because it greatly reduces the volume, though some legitimate content may be filtered out, and some bad content may be missed. Since this type of content is usually automated, having a human moderator going through it is time-consuming.


On the other hand, if the problem you're addressing is a malicious user uploading extremely objectionable content to an unwelcoming audience, a black-list approach may not work at all, because of the repercussions if bad content is missed by the automatic filtering.

No comments: