Webpage Classification using Visual Content

Simulator of Web Page Classification - Try yourself! Link to presentation in 10th WEBIST - April, 2014

There is a constantly increasing requirement for automatic classification techniques with greater classification accuracy.

To automatically classify and process web pages, the current systems use the text content of those pages. However, little work has been done on using the visual content of a web page.

On this account, our work is focused on performing web page classification using only their visual content. First a descriptor is constructed, by extracting different features from each page. The features used are the simple color and edge histograms, Gabor and Tamura features. Then two methods of feature selection, one based on the Chi-Square criterion, the other on the Principal Components Analysis are applied to that descriptor, to select the top discriminative attributes. Another approach involves using the Bag of Words (BoW) model to treat the SIFT local features extracted from each image as words, allowing to construct a dictionary.

Then we classify webpages based on their aesthetic value, their design recency and type of content. The machine learning methods used in this work are the Naïve Bayes, Support Vector Machine, Decision Tree and AdaBoost.

Different tests are performed to evaluate the performance of each classifier. Finally, we thus prove that the visual appearance of a web page has rich content not explored by current web crawlers based only on text content.

We perform classification in some subjective variables (beautiful/ugly and old/new fashioned) and also for the page topic.

Is this page ugly or beautiful?

What about this one?