Nowadays the Web represents a huge heterogeneous data source. The rapid growth of data volume and the dynamic nature of the Web make it difficult for users to find relevant information for a specific domain. In many cases people are really interested in information about specific objects, not web pages.
Object-level vertical search engines, such as Libra (http://libra.msra.cn), Windows Live Product Search (http://product.live.com), etc., are designed to solve the problem. Users can directly get objects they wanted by well-designed interfaces or complex query languages. Though the object level search engines are very much helpful, they really cost much. The well-known search engines are all run by big organizations or companies. What can you find in their repositories? Their data of couse!
Small companies with good ideas could have one for their own business? No way. Small communities with special interests could have one for their own researches? No way. Well, welcome to SESQ. A fully functioning object level vertical search engine can be built in WEEKS with LITTLE costs.
What is SESQ? It is a pipeline for object level vertical search engines. What must i do to set up a search engine? First you need to specify the data schema of the domain and give the seed for the data of the schema. Then you write extracting rules to indicate how to get instance data of the schema from relevant web pages. Finally you collect some training examples for our classifiers if you want your object categorized. That's all. SESQ will do everything to make your a fresh sparkling search engine! It finds new web sites and web pages relevant to the schema by crawling. It extracts the instance data for the schema from the web pages. It provides a highly efficient data storage and index structure for the collected data. It offers an interactive query interface for end users to represent structural query on the data. Besides, the data can be further analyzed by some analytical tools (such as OLAP).
There are a number of approaches in the literature that have similar objectives. For example, Lixto, Araneus, Object-Web wrapper, Xyleme, etc. Comparing with these systems, SESQ is a complete solution. SESQ represents Search, Extract, Store and Query, which are the four major steps to build a domain-specific search engine, and can be customized to different domain by specifying schema, seed web site, and extraction rules. To reduce the complexity for constructing domain specific search engine, the tasks of searching, extracting and querying can be configured with one uniform script language. In the remainder of this proposal we will first briefly describe the architecture of the system, then give the methods to model, find and extract data, followed by a description of the architecture and components of the system. At last, we conclude the proposal with our plan of demonstration.
|Query Interface||Statistic Demo|