Intelligent Search on the Internet - the Vševěd System

P. Berka, M. Sochorová
Laboratory of Intelligent Systems, University of Economics,
Ekonomická 957, 142 01 Prague 4, Czech Republic
e-mail: {berka|sochorov}@vse.cz  

Abstract:

During the last few years, the World-Wide Web has become one of the most widespread technologies of information presentation. Because of its enormous growth, the task to find an information about a specific topic can be very hard. Search engines were developed to automatize this process. Nevertheless, “classical” search engines can have significant drawbacks for a common unexperienced user (different locations, different interface etc.).

The aim of the Vševěd metasearch system is to simplify search for information (Web pages) on the Internet. The system was inspired by some existing metasearch systems, mainly by AskJeeves. Our system differs mainly in its focus on czech environment (czech language, czech search engines).

 

Search and metasearch

During the last few years, the World-Wide Web has become one of the most widespread technologies of information presentation. Because of its enormous growth, the task to find an information about a specific topic can be very hard. Search engines help to automatize this task by looking for a Web page according to words/phrases given by the user. Search is based on off-line (manually - e.g. Yahoo, or automatically - e.g AltaVista) created indexes. Standard way of using a search engine consists of a sequence of actions:

The use of search engines can have significant drawbacks for a common (unexperienced) user. There is a number of search engines at different locations each having its own ways of interaction with users. So the user must remember their URL and the way how to work with them. So the idea of “metasearch” emerges to help users to find relevant information in a more convenient way. The basics of all metasearch engines is to give access to more than one search engine. Sharing this idea the abilities of different systems can vary:

Advantages of metasearch are:

 

The Aim of the Vševěd System

The aim of the Vševěd metasearch system is to simplify search for information (Web pages) in the Czech Internet. The system was inspired by AskJeeves, a metasearch engine, that uses both local database and “standard” search engines (e.g. AltaVista, Lycos, Infoseek, Excite) to find infromation relevant to users query. Unlike AskJeeves, our system is oriented on czech environment (czech language, czech search engines).

Our work is focused on pre- and postprocessing. With the exception of CBR module, we do not solve the question of search in Internet.

Fig. 1 Overall scheme of the system

The main scheme of Vševěd is shown on Fig. 1. The core modules are working now, the intelligent modules are rising gradually. Full line denotes already realized modules, dashed modules are under development and dotted line means planned modules.

The Vševěd has following components:

The consultation with Vševěd will be as follows:

  1. The user types the query as a Czech sentence.
  2. The system
    • understands the core of the query,
    • finds possible answers in his own knowledge base,
    • sends the query to other searching engines,
    • sorts and corrects the answers from engines based on some conceptual knowledge
  3. The user chooses some answer.
  4. Vševěd interprets users actions as relevance feedback (and eventually updates the case base)..

 

The components of the system

Following the rapid prototyping paradigm of developing software, we deployed a rather simple first version, which will be gradually extended. This section describes both current state of the modules and their further development.

The minimal realised core of Vševěd consists of:

  1. a simple preprocessing of the query
  2. querying other searching engines
  3. a simple postprocessing of the found links
  4. a simple presentation of the found links

The “intelligent” part of the system is spread in following modules:

 

Preprocessing of the query

Preprocessing of the query seems crucial for bridging the gap between unskilled users and search systems. The usual way of querying - input of sequence of keywords - cannot benefit from some smart capabilities of different search engines. By now, the preprocessing consists of

Further extensions will be

Language preprocessing

The processing of the query given in plain (czech) language necessarily includes a solution of these tasks:

Including a thesaurus for expanding the query by equivalent, broader/narrower or otherwise related terms might certainly be of use. We can benefit form previous work on automatic translantion of queries into databases [Strossa].

The result of such preprocessing will be a formula using both boolean operators (and, or, not) and proximity operators (near).

 

Querying other searching engines

Standard search engines are the main source of inforamtion, the user is looking for. In the current implementation, the user must select the engines he wants to use from the given list (see Fig. 2). The user may choose to test the accessability of pages, but this option takes some additional time.

Possible extension of this module is to select (according the query and Vševěd’s experience) only the most relevant search engines. Such approach is used e.g. in the system SaavySearch.

The result of the search is a list of relevant links. We use the information about URL’s and ‘names’ (content of the HTML tag TITLE) for further postprocessing.

 

Fig. 2 Vševěd input form

Postprocessing

The postprocessing module should better organize the results obtained from different search engines. This can be achieved by

Some of this postprocessing steps have been already implemented, another are under development. By now, the postprocessing is based only on the syntax of the links (URLs). We decompose the URL returned from search engines into server, path, owner, file, first and extension part.

E.g. if the URL is http://lisp.vse.cz/~sochorov/Vseved/main.new.cgi, then
          server  ... lisp.vse.cz
          path  ... /~sochorov/Vseved
          owner  ... ~sochorov
          file  ... main.new
          first  ... main
          ext  ... cgi

We defined some heuristics to:

  • remove links to some services of used search systems (e.g. links to altavista server),
  • remove multiple links to the same page ( the files in the same path with the same first part of the, filename that differ in the rest of the filename),
  • group more retrieved pages from one directory into the link to this directory.

We make also an URL-based analysis of the type of the page, e.g.:

  • personal homepage is indicated by the symbol ‘~’ in the part owner,
  • homepage of an organization is indicated, if URL consisits only of the part server

For each retrieved link (page) we compute the relevance rel as:

relmax if link points to a directory with more relevant pages,
relmin if corresponding page has empty TITLE tag,
w1 (a/a+b) + w2 (a/a+c) elsewhere.

where a is the number of the words occuring both in the query and page description
           a+b is the number of the words in the query
           a+c is the number of the words in the page description TITLE

The relevance is used to create a sorted output.

Instead of symetric relation between words in the query and in the document description (e.g. a/a+b+c) we give more weight on the first term a/a+b. So we prefer the documents that are described using all words in the query. The idea behind is that a common, “lazy” user gives only the necessary terms in his query.

In the current implementation, w1 = 100 and w2 = 1 (this setting corresponds to sorting using a/a+b as the primary key and a/a+c as the secondary key, rel being from the interval (0, 101]), relmax = 102 (links to a directory obtain the highest relevance) and relmin = 0 (links with empty TITLE obtain the lowest relevance).

Further extensions of the postprocessing will include

  • more syntactical rules
  • usage of the WWW ontology (to recognize types of page, or similar pages)
  • grouping links according to the contents of the documents

 

WWW Ontology based knowledge

For knowledge-based tasks related to Internet (and WWW, respectively), knowledge modeling techniques seem to suit well for capturing static domain knowledge describing e.g. the typical internal structure and mutual links among WWW pages. Heuristics based on such reusable conceptual knowledge bases (ontologies) will be used for postprocessing of the results of search (recognizing type of page, recommending similar pages..).

Creating WWW ontologies is a subject of related research [Šimek, Svátek 1998].

 

The CBR Module

The CBR module should give direct answers to some common questions. Instead (or parallel) to the search in the Internet, the CBR module will search its local case base of links. The case base will be organized in a tree in a way similar to Yahoo-like indexes. The is-a hierarchy will be used to retrieve cases that not only correspond to the user query but are similar in the sence of general/specific relation.

First version of the CBR module is based on the idea that both query nad case description consists of set of words. On this word level (when looking just for word matches), we can obtain

  1. exact answer (terms in the query are the same as terms in the case description)

    {Query.terms} = {Case.description}
    e.g. {data, mining} = {data, mining}

    in this situation, the result will be the matching case,

  2. more general answer (terms in case description are more general)

    {Query.terms} É {Case.descriptions}
    e.g. {data, mining, and, knowledge, discovery} É {data, mining}

    in this situation, the results will be the general case and its “childrens”

  3. more specific answer (terms in case description are more specific)

    {Query.terms} Í {Case.popis}
    e.g. {data, mining} I {data, mining, and, knowledge, discovery}

    in this situation, the results will be the specific case, its “parent” and its “siblings”

  4. partial answer (terms in case description have nonempty intersection with terms in case description)

    e.g. {data, mining} Ç {OKD, mining, company} = {mining}

  5. no answer (empty intersection of terms)

In case of partial answer and no answer, no case will be retrieved from the case base.

 

For each retrieved case, we compute the relevance as

(a/a+b+c)

where a is the number of the words occuring both in the query and case description
           a+b is the number of the words in the query
           a+c is the number of the words in the case description

The prototype consists of two parts:

  1. The mysql database of direct answers
  2. CBR system realised in CLIPS

The module searches all records from mysql database that contains some term from the query, their ‘parents’, ‘siblings’ and ‘children’ in the hierarchy. These records are processed by CLIPS rules. We don’t have the taxonomy of Czech words now so we have synonyms in related terms.

The further development of the CBR module will be towards better organization of cases (DAGs instead of trees), better reprezentation of case description and better similarity measures.

The case base is created manually by now, in further step, we assume to use machine learning methods to update this base automatically.

 

The presentation of results

The answers from all engines are shown on one HTML page as a (index) list of links together with some additional information obtained during postprocessing (relevance, types of page). The links are sorted according to the estimated relevance. The links followed by the user are stored for further analysis in (not yet realized) ML module.

Fig. 3 Vševěd results

 

Possible development

  • grouping related links according to the context

 

Implementation notes

Vševěd is written in PERL, uses the database mysql and CLIPS (C Language Integrated Production System). Because of its rule-based programming paradigm, CLIPS is suitable for encoding of the postprocessing knowledge. It’s drawback is that the computation is rather slow.

The current version of the system can be found at http://lisp.vse.cz/~sochorov/Vseved/ded-in.htm. Multiple users may use it without an accident, temporary files are generated dynamically.

 

References

[Aha, 1991] Aha,D. - Kibler,D. - Albert,M.: Instance-based learning algorithms. Machine Learning, 6, 1991.

[Etzioni, 1997] Etzioni,O.: Moving Up the Information Food Chain. AI Magazine Vol. 18, No. 2, 1997.

[Howe, 1997] Howe,A. - Dreilinger,D.: SavvySearch - A Metasearch Engine That Learns Which Search Engines to Query. AI Magazine Vol. 18, No. 2, 1997.

[Luke, 1997] Luke, S. - Spector, L. - Rager, D: Ontology-Based Knowledge Discovery on the World-Wide Web. In: AAAI96 Workshop on Internet-based Information Systems.

[Mladenic, 1997] Mladenic,D.: Machine Learning on non-homogeneous, distributed text data. PhD thesis proposal, 1997.

[Popovič, 1990] Popovič, M. - Willett, P.: Processing of Documents and Queries in a Slovene Language Free Text Retrieval System. Literary and Linguistic Computing, Vol. 5, No. 2.

[Popovič, 1992] Popovič, M. - Willett, P.: The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data. Journal of the American Society for Informational Science, Vol. 43, No. 5.

[Strossa, 1994] Strossa, P.: Zpracování informačních fondů; — sešit č. 2: Algoritmizace a automatizace zpracování textových informací. Praha, VŠE 1994.

[Šimek, Svátek 1998] Šimek, P. - Svátek, V.: An Integrated Ontology for the WWW. Submitted to EKAW’99.