Please use this identifier to cite or link to this item:
https://ptsldigital.ukm.my/jspui/handle/123456789/513187
Title: | Penjanaan ringkasan isi utama berdasarkan ciri kata bagi dokumen berita Bahasa Melayu |
Authors: | Mohd Sabri Hasan (P58749) |
Supervisor: | Shahrul Azman Mohd Noah, Prof. Dr. |
Keywords: | Dokumen berita Computer science |
Issue Date: | 28-May-2015 |
Description: | Teknik ringkasan isi utama merupakan proses penyulingan maklumat penting secara ekstraktif atau abstraktif bagi menghasilkan satu ayat tunggal yang mewakili isi utama penulisan teks. Teknik ini penting khususnya dalam dokumen berita dan penghasilan snippet dalam enjin gelintaran. Kajian semasa banyak tertumpu kepada dokumen Bahasa Inggeris. Bagi dokumen Bahasa Melayu, kajian ini masih terlalu sedikit dan tertumpu pada kaedah berdasarkan penterjemahan mesin. Kaedah berdasarkan penterjemahan mesin tidak menghasilkan keputusan memuaskan kerana ia bergantung kepada struktur bahasa tabii dokumen, genre dokumen dan maklumat statistik dokumen. Kajian ini bertujuan untuk membangunkan teknik ringkasan isi utama secara ekstraktif bagi dokumen berita Bahasa Melayu dengan menggabungkan kaedah statistik dan linguistik. Kaedah statistik digunakan untuk menentukan kata signifikan dan pengecaman ayat terpenting berdasarkan konsep pemberat. Kaedah linguistik iaitu ciri kata digunakan untuk meningkatkan ketepatan penentuan kata signifikan, ketepatan pengecaman ayat terpenting dan kualiti ringkasan isi utama daripada aspek bahasa. Dalam kajian ini, korpus dokumen berita Bahasa Melayu terdiri daripada 140 berita berserta ringkasan rujukan tunggal. Berita ini dipilih daripada korpus arkib berita BERNAMA yang mewakili empat genre iaitu ekonomi, jenayah, pendidikan dan sukan. Ringkasan rujukan tunggal pula dihasilkan oleh tiga orang pakar linguistik Bahasa Melayu. Hasil analisis ciri kata dalam koleksi tersebut mendapati isi utama penulisan berita dapat ditentukan berdasarkan empat ciri kata iaitu kedudukan kata dalam ayat, kedudukan ayat dalam teks, kata berjenis kata akronim dan kata mewakili nama individu. Kata signifikan dengan isi utama penulisan teks ditentukan berdasarkan nilai pemberat kata. Nilai ini ditentukan dengan menggabungkan antara nilai frekuensi kata dalam dokumen dan kedudukan kata dalam ayat. Dua ayat pertama dalam dokumen berita Bahasa Melayu dikenal pasti sebagai calon ayat terbaik bagi pengecaman ayat terpenting. Kaedah pengelompokan kata signifikan secara bigram pada ayat terpenting telah diperluaskan kepada kata berjenis akronim dan kata mewakili nama individu berupaya menghasilkan ringkasan isi utama yang lebih berkualiti daripada aspek bahasa. Penilaian teknik yang dicadangkan ini adalah berdasarkan ukuran kejituan (K), dapatan-semula (D), ukuran-F dan set metrik ROUGE serta dibandingkan dengan kaedah frekuensi-kata dan penjanaan snippet. Hasil penilaian menunjukkan teknik ini mampu memberikan prestasi yang lebih baik dibandingkan dengan dua kaedah penanda aras tersebut iaitu K = 0.3194, D = 0.5656, skor-F = 0.4012, ROUGE–N = 0.5656, ROUGE–L = 0.3392, ROUGE–W = 0.1186 dan ROUGE–S = 0.1232. Kesimpulannya gabungan kaedah statistik dan ciri kata dalam pembangunan teknik ringkasan isi utama kajian ini mampu menghasilkan ringkasan yang berkualiti daripada aspek bahasa dan darjah ketepatan yang lebih baik.,Headline generation is an extractive or abstractive information extraction process to generate a single sentence that represents the content of a text. The process is important particularly for headlines of news article and search engine snippets generation. Recent research in the area mainly focused on English text. Comprehensive techniques for Malay text receive little attentions. However, few researchers have proposed the use of machine-translation approach for Malay text headline generation. Although such an approach is feasible, it unable to produce good headlines as headline generation relies on the specific natural language structures, text genre and statistical information. Therefore, this study aims to propose and develop extractive techniques for headline generation of Malay news article by combining the statistic and linguistic methods. The statistic method is meant to identify significant words and sentences based in term weighting approach. The linguistic method uses term feature to increase the preciseness of sentences selection and the quality of headlines. In this study, 140 news and their corresponding headlines model were constructed. The news are chosen from BERNAMA archive that constitutes four news desk mainly economy, crime, education and sports. The headline models are produced by three the Malay linguistic experts. Analysis of the news collection shows that the main idea of written text can be identified based on four characteristic namely word location in sentences, sentence location in texts, acronym word types and words that represent the person name. Significant words with main idea of written text are determined based on the words weighted values. The values are determined by combining the frequency of words and word location in sentences. The content first two sentences in news are suitable candidates for recognising important sentences in text. Bigram clustering method of significant words on the most important sentence has been extended to acronym word types and words that represent the individual’s name. Such an extension has shown the capability of producing better quality headlines from the perspective of linguistics. Evaluation of the proposed technique are based on precision (P), recall (R), F-measure and a set of ROUGE metrics, and compared with the term-frequency and snippetgeneration methods. The proposed method achieve better performance as compared to the benchmarks with P = 0.3194, R = 0.5656, F-measure = 0.4012, ROUGE–N= 0.5656, ROUGE–L= 0.3392, ROUGE–W= 0.1186 and ROUGE–S= 0.1232. In conclusion, the combination of statistic method and term features in headline generation technique is capable of producing coherence headline with higher degree of fidelity as compared to the compared benchmarks.,Ph.D. |
Pages: | 207 |
Call Number: | QA76.9.T48M84674 2015 3 tesis |
Publisher: | UKM, Bangi |
Appears in Collections: | Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
ukmvital_81755+SOURCE1+SOURCE1.0.PDF Restricted Access | 4.12 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.