Development of an updating topic specific web search crawler system based on T-Graph

Ali Mohammad Seyfi

Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/476531

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Ahmed Moosajee Patel, Prof.
dc.contributor.author	Ali Mohammad Seyfi
dc.contributor.other	P47909	-
dc.date.accessioned	2023-10-06T09:20:28Z	-
dc.date.available	2023-10-06T09:20:28Z	-
dc.date.issued	2012-09-03
dc.identifier.other	ukmvital:117405
dc.identifier.uri	https://ptsldigital.ukm.my/jspui/handle/123456789/476531	-
dc.description	A Web crawler systematically collects information about documents from the Web to create an index of the data it is searching and it maintains an updated index through subsequent crawls. As an automatic indexer, the crawler operates in the context of listing the documents relevant to a subject or topic which one would expect in a typical user search query. Traditional general purpose Web crawlers are not easily scalable since the Web is not under the control of one operator or proprietary. They also may not be set to target specific topics for accurate indexing, and lag behind in time and updates to manage updated indexes of the whole Web to stay current because of the distribution, subject and volume involved. To overcome these shortcomings, focused crawlers, also known as topic specific crawlers, follow the hyperlinked structure of the Web to identify and harvest topically relevant pages to increase their performance in terms of accuracy, currency and speed. Many crawler solutions exist but they rely on the whole content of the Web page to evaluate the relevance of an unvisited link, which is an existing hyperlink pointing to a Web page that is not downloaded yet. However in this thesis, it is experimented that the relevance of an unvisited Web page is mainly dependent on both, the anchor text of the link to that page and the text surrounding that link. In this thesis, a novel design of a crawler search system based on Treasure Graph, also called T-Graph, is studied and implemented. The traversal method in the construction of T-Graph can be both top- down or bottom-up, while the typical Context Graph is built bottom-up. Also, the Context Graph is made up by performing a constant search on the Web for the parent nodes of each child node, while in the T-Graph each child node can be optionally inserted under the parent node after performing a local search in the repository or a sample data set, and not exclusively a search on the Web. This study is in order to determine the novel crawler’s effectiveness and viability in crawling to fetch topic specific documents. One of the main objectives of this thesis has been to enhance the accuracy of Web page classification, which is made possible by defining a flexible interpretation of the surrounding text. The Dewey Decimal Classification (DDC) system is applied as a basis to classify the text into appropriate topic/subject boundaries. The other significant objective of this thesis has been to reach target documents in the shortest possible time. This is accomplished by reaching the most matching node(s) of the document in the T-Graph, before calculating the distance in the form of documents to be downloaded. As a result, due to a better determination of the topic boundary and significant decrease of the volume of downloaded documents in text format, this strategy helps the crawler update its indexes more pragmatically, accurately and rapidly. The developed Web search system based on T-Graph is implemented as a prototype and its performance is consequently analyzed, as well as compared to the functionality and performance of the Context Focused Crawler, to observe its characteristics in terms of accuracy. The prototype additionally gives improved values for several parameters of the system that can be applied to its later version(s). Also the results show the values of precision and recall both as ~0.5. The results assert that the proposed algorithm outperforms the Context Focused Crawler in terms of accuracy with the improved number of retrieved on-topic pages by 14% for crawls with generic seed URLs and 22% for crawls with on-topic seed URLs.,Certification of Master's / Doctoral Thesis" is not available
dc.language.iso	eng
dc.publisher	UKM, Bangi
dc.relation	Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat
dc.rights	UKM
dc.subject	Web search engines
dc.subject	Internet searching
dc.subject	Universiti Kebangsaan Malaysia -- Dissertations
dc.subject	Dissertations, Academic -- Malaysia
dc.title	Development of an updating topic specific web search crawler system based on T-Graph
dc.type	theses
dc.format.pages	168
dc.identifier.callno	ZA4235.S459 2012 3 tesis
dc.identifier.barcode	002566
dc.identifier.barcode	004077(2019)(PL2)
Appears in Collections:	Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat

Files in This Item:

File	Description	Size	Format
ukmvital_117405+SOURCE1+SOURCE1.0.PDF Restricted Access		438.37 kB	Adobe PDF	View/Open

Show simple item record Recommend this item