Please use this identifier to cite or link to this item:
https://ptsldigital.ukm.my/jspui/handle/123456789/476531
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Ahmed Moosajee Patel, Prof. | |
dc.contributor.author | Ali Mohammad Seyfi (P47909) | |
dc.date.accessioned | 2023-10-06T09:20:28Z | - |
dc.date.available | 2023-10-06T09:20:28Z | - |
dc.date.issued | 2012-09-03 | |
dc.identifier.other | ukmvital:117405 | |
dc.identifier.uri | https://ptsldigital.ukm.my/jspui/handle/123456789/476531 | - |
dc.description | A Web crawler systematically collects information about documents from the Web to create an index of the data it is searching and it maintains an updated index through subsequent crawls. As an automatic indexer, the crawler operates in the context of listing the documents relevant to a subject or topic which one would expect in a typical user search query. Traditional general purpose Web crawlers are not easily scalable since the Web is not under the control of one operator or proprietary. They also may not be set to target specific topics for accurate indexing, and lag behind in time and updates to manage updated indexes of the whole Web to stay current because of the distribution, subject and volume involved. To overcome these shortcomings, focused crawlers, also known as topic specific crawlers, follow the hyperlinked structure of the Web to identify and harvest topically relevant pages to increase their performance in terms of accuracy, currency and speed. Many crawler solutions exist but they rely on the whole content of the Web page to evaluate the relevance of an unvisited link, which is an existing hyperlink pointing to a Web page that is not downloaded yet. However in this thesis, it is experimented that the relevance of an unvisited Web page is mainly dependent on both, the anchor text of the link to that page and the text surrounding that link. In this thesis, a novel design of a crawler search system based on Treasure Graph, also called T-Graph, is studied and implemented. The traversal method in the construction of T-Graph can be both top- down or bottom-up, while the typical Context Graph is built bottom-up. Also, the Context Graph is made up by performing a constant search on the Web for the parent nodes of each child node, while in the T-Graph each child node can be optionally inserted under the parent node after performing a local search in the repository or a sample data set, and not exclusively a search on the Web. This study is in order to determine the novel crawler’s effectiveness and viability in crawling to fetch topic specific documents. One of the main objectives of this thesis has been to enhance the accuracy of Web page classification, which is made possible by defining a flexible interpretation of the surrounding text. The Dewey Decimal Classification (DDC) system is applied as a basis to classify the text into appropriate topic/subject boundaries. The other significant objective of this thesis has been to reach target documents in the shortest possible time. This is accomplished by reaching the most matching node(s) of the document in the T-Graph, before calculating the distance in the form of documents to be downloaded. As a result, due to a better determination of the topic boundary and significant decrease of the volume of downloaded documents in text format, this strategy helps the crawler update its indexes more pragmatically, accurately and rapidly. The developed Web search system based on T-Graph is implemented as a prototype and its performance is consequently analyzed, as well as compared to the functionality and performance of the Context Focused Crawler, to observe its characteristics in terms of accuracy. The prototype additionally gives improved values for several parameters of the system that can be applied to its later version(s). Also the results show the values of precision and recall both as ~0.5. The results assert that the proposed algorithm outperforms the Context Focused Crawler in terms of accuracy with the improved number of retrieved on-topic pages by 14% for crawls with generic seed URLs and 22% for crawls with on-topic seed URLs.,Certification of Master's / Doctoral Thesis" is not available | |
dc.language.iso | eng | |
dc.publisher | UKM, Bangi | |
dc.relation | Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat | |
dc.rights | UKM | |
dc.subject | Web search engines | |
dc.subject | Internet searching | |
dc.subject | Universiti Kebangsaan Malaysia -- Dissertations | |
dc.subject | Dissertations, Academic -- Malaysia | |
dc.title | Development of an updating topic specific web search crawler system based on T-Graph | |
dc.type | theses | |
dc.format.pages | 168 | |
dc.identifier.callno | ZA4235.S459 2012 3 tesis | |
dc.identifier.barcode | 002566 | |
dc.identifier.barcode | 004077(2019)(PL2) | |
Appears in Collections: | Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
ukmvital_117405+SOURCE1+SOURCE1.0.PDF Restricted Access | 438.37 kB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.