Popis: |
Text classification via supervised learning involves various steps from processing raw data, features extraction to training and validating classifiers. Within these steps implementation decisions are critical to the resulting classifier accuracy. This paper contains a report of the study performed to determine the optimum parameter setup for reaching the highest possible accuracy when classifying multilingual (Dutch and English) user profiles, collected from social media, with job titles, with the goal of improving the matches between job vacancies and user profiles in a case for HR recruitment. The study includes experiments with eleven labels (job titles), a shifting pivot between test, and training-datasets, the use of combined n-grams, feature extraction methods: bag of words (BOW), word frequency or count (WC) and word importance (via TF-IDF), the use of tagged words corpora with POS tags, and the use of seven well-known classification algorithms. Two Support Vector Machine (SVM) systems, two Naive-Bayes (NB) approaches, two Maximum-Entropy classifiers, and one Decisions Tree (DT). Seven experiments were performed, with a combined total of about 1900 training, and test runs. The used dataset contains of 95,000 profiles that were annotated with eleven job title labels, using a tool specially developed for this purpose. We concluded that classifiers based on the Support Vector Machine (SVM) achieved the highest classification accuracy (up to 93% with 7-labels). Feature extraction methods of (1,2,3)- grams, and word frequency/ importance showed the highest accuracy gain among all classifiers. The most profound accuracy gain was achieved by excluding labels that contained too generic features. The SVM classifiers reached their accuracy ceiling on 2/3 of the experiments already. By further studies into annotating and removing non-specific information it is believed this accuracy figure can be increase even more. |