Skip to main content

IBM Israel Research Seminars

 

We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, translates URLs to some canonical form, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules for transforming a given URL to others that are likely to have similar content. DustBuster is able to detect DUST effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from this information to increase the effectiveness of crawling, reduce indexing overhead as well as improve the quality of popularity statistics such as PageRank.

This is joint work with Dr. Idit Kediar and Dr. Ziv Bar-Yossef.

About the speaker
Uri Schonfeld worked at IBM for seven years on various projects. He took part in the "Reef" project in which a storage controller running Linux was developed. He also worked on two file systems GPFS, a well known parallel file system and zFS a research-oriented distributed file system. He was the lead architect of the iBoot project, a project that was first to successfully boot the windows operating system on a diskless computer, using iSCSI. Finally, a project he initiated, SearchMe, lead to a filing of a patent in the field of web search engines.
Since 2004 he is a full time graduate student in the Electrical Engineering department of the Technion, working with his supervisors Dr. Ziv Bar-Yossef and Dr. Idit Keidar on detection of duplicate documents on web sites. He has now completed most of the requirements for his masters degree.
His research interests include web search engines, data mining, machine learning and distributed systems.