Popis: |
Computing systems are becoming more and more complex and are assuming more and more responsibilities in all sectors of human activity. Science and technology information present a rich resource, essential for managing research and development programs. Many of today's applications are built as distribution systems. The Internet is one of the best-known distribution systems and is used by nearly everyone today. With a great deal of available data on the net in different languages, it is essential to use efficient methods to extract useful information from the data. Fortunately, the parallel growth of information and of analytical tools offer the promise of advanced decision aids to support research and development more effectively. Data mining, information retrieval and other information-based technologies especially nowadays, are receiving increased attention. The importance of English is well established in every field. Likewise, Arabic is also a major natural language, spoken by over 250 millions people in 21 Arab countries as the first language, and in Islamic countries it is used as a second language. It is one of the languages of the Semitic family and thus preserves the complexity of this group. Arabic is highly derivated, as well as being an inflected language, so it requires good stemming for effective text mining. Yet no standard approach to stemming has emerged. This work investigates some of the issues involved in achieving bilingual text mining from large bodies of electronic Arabic-English datasets. The main aim of this thesis is to address the above issues and provide the best framework. To address this aim, this thesis evaluates the current proposed preprocessing and SOM clustering algorithms. Our proposed MLTextMAES approach has the ability to perform the four main stages of standard text mining, taking into account pre-processing, clustering (via SOM) and test of quality. Thus we have employed SOM as a tool for the clustering of documents into groups with similar categories. To the author's knowledge there is no significant literature available regarding the SOM technique applied to Arabic-English text mining. The model is found to be useful in strategic decision-making settings. The results indicate that SOM is a feasible tool for multilingual languages, and presents several advantages over current methods. Our experimental results show improved clustering performance when using Arabic-English language documents for our datasets. |