Popis: |
During the European Cantata project (ITEA project, 2006-2009), a Multi-Content Analysis framework for the classification of compound images in various categories (text, graphical user interface, medical images, other complex images) was developed within Barco. The framework consists of six parts: a dataset, a feature selection method, a machine learning based Multi-Content Analysis (MCA) algorithm, a Ground Truth, an evaluation module based on metrics and a presentation module. This methodology was built on a cascade of decision tree-based classifiers combined and trained with the AdaBoost meta-algorithm. In order to be able to train these classifiers on large training datasets without excessively increasing the training time, various optimizations were implemented. These optimizations were performed at two levels: the methodology itself (feature selection / elimination, dataset pre-computation) and the decision-tree training algorithm (binary threshold search, dataset presorting and alternate splitting algorithm). These optimizations have little or no negative impact on the classification performance of the resulting classifiers. As a result, the training time of the classifiers was significantly reduced, mainly because the optimized decision-tree training algorithm has a lower algorithmic complexity. The time saved through this optimized methodology was used to compare the results of a greater number of different training parameters. |