Self-organising
web portal with evolutionary links panel
Taras
Filatov and Viktor Popov
Wessex Institute of Technology,
Abstract
Web portals are often created
to unite and classify information from different web sources in order to satisfy
the needs of particular group of internet/intranet users. The main difficulty
arising due to maintenance of such portals is that it requires certain time and
labour inputs. The other task to solve is to provide users with adaptive
interfaces to recognize their interests and help them find and filter the required
information.
An innovative approach is
proposed represented by a combination of search engine elements and an evolutionary
algorithm for the navigational interface of the links panel. The system is
implemented as a web-based application.
Keywords:
internet, knowledge, portal, self-organising, evolutionary algorithm, keyword, navigation,
links panel, search spider
1 Introduction
Modern Internet and the rates
of its development demand new high-end technologies to organise and represent
data. The problem addressed in current research is the issue of web portals. It
is a widespread case when there are a number of websites devoted to some field and
there is a certain group of visitors interested in this field. It is convenient
in these cases to have a web portal as a central place to visit and check for
updates. However, the problem of maintenance may arise. Such a portal should be
maintained on a regular basis, the related websites should be checked for new
materials, and data collected and prepared in a convenient form for the users
of the portal. This is very time-demanding and requires a duly work of content
managers.
The main aim of the current
research is to find the approaches to automate these processes of data
collection and management and create a method to maintain self-organising web
portal.
2. Related work
The idea of automated and
user-oriented data pre-processing solutions for web is not novel. A number of
systems were developed to organise the information requested by users into semantic
data collections [1, 2] which makes it more convenient to browse and search for
related documents. In these cases, however, users need to know what they are
looking for to specify a search query to be passed to the search engine. On the
other hand, in some systems users are faced with knowledge base with a host of
categories which may make the search confusing. We believe it is better to
propose a random set of documents initially; this set to be of proportional
size to the navigational panels most web surfers are used to; then make the
system gradually recognize users’ interests and modify the set
correspondingly. Resembling approach was
presented in [3] where a method for dynamic link generation is proposed. The
online module, however, was not developed. The main problem, we may assume, is
that there was no mechanism for data collection implemented. Besides, there is
no mechanism mentioned to remember user so each time a user is new to the
system which reduces the possibilities for analysis and adaptation. The
mechanism for links fetching is not presented in detail, and no special
algorithm is proposed. In the current work we are resolving these problems and
improving the approach.
3. Problem description
Let’s define all the pages of
the related websites as the source data for the mentioned portal. Under related
websites we mean a predefined set of websites which are frequently attained by
a certain group of internet surfers and are therefore selected to form a self organising
web portal for their needs.
The following tasks should be
solved to create a solution for self-organising portal:
The task therefore is to
develop a solution for a first instance of self-organising web portal with the
abovementioned characteristics. At the current stage of the project the data
collection and navigation interface elements are developed which therefore represent
the main part and corner stone for the self-organising web portal solution.
4 Method
As the comprehensive solution
to resolve these tasks we propose the combination of search engine mechanisms
and a special links panel based on evolutionary algorithm. A search spider will
crawl through the web pages of the selected (predefined) websites and index
their contents. A special algorithm will select the pages from the database and
propose them to the visitor by popping them up into the links panel (navigation
menu) of the portal. The algorithm will consider the information about the
similarity of the pages (stored in the database) and user’s response to the proposed
links.
4.1 Data
collection and processing
The data collection mechanism
is essential to collect and update the information automatically. Among the
available solutions to realize this is RSS technology when a tiny script should
be implemented within the selected websites enabling them to export all the
documents in unified XML format. However this solution requires some
interference with all the selected websites. Thus, we have decided to implement
a search spider mechanism as commonly used in search engines. Such spiders
index the pages through the same way like human browsers do which is HTTP
protocol. The only distinction from the search engines in our case will be that
search spider is limited to certain websites and not allowed to index external
links.
Predefined data:
Algorithm:
The scheme of the data
collection and processing part is presented in Fig.1.

Fig.1. The structure of data
collection part
4.2
Evolutionary links panel
Web portal equipped with the
abovementioned automated data collection mechanisms will fill the database with
heaps of records including URLs of pages, indexed keywords, titles and accompanying
data. The next problem to solve is how to create an appropriate navigational
interface to filter and propose this information to visitors. The system
shouldn’t overload visitors with information. The user-oriented approach should
be implemented; the preferences of each visitor should be remembered by the
system. To resolve this task, an evolutionary algorithm was developed to form a
special links panel which will provide a navigational interface for a self-organising
web portal. The algorithm is described below with 4 preparatory stages and by a
scheme of the main part which is presented in Fig.2.
Predefined data:
Algorithm:

Fig. 2. Flow chart -
evolutionary algorithm for links panel
Algorithm proposed provides a
navigational interface which is adaptive to visitors and their current needs.
The key point is that the links panel react to the user’s activity and when the
interest is indicated, system pops up related pages to the link set as
replacement for pages with weakest ‘fit’. The fit is increased when the page is
clicked and it is decreased with time. The panel, therefore, adapts to the
user’s current interests and proposes the documents on his subject of interest
from different websites. When the interest is not indicated, the system will
understand that visitor is not interested in the current topic. It will then
gradually return to random set which is initial state of the set. After that
the system is ready to work with new subjects.
The proposed approach allows
representing unlimited amount of documents from different websites in compact
and usable navigational panel. Involving both user-oriented approach and random
factors, it provides visitors with intuitively understandable evolutionary
interface for browsing and filtering the available documents.
5 Appearance
Following the approach
described in the current work we propose a following interface for the layout
of self-organising portal:

Fig. 3. Portal layout
The header is reserved for
design purposes and interface elements. The majority of the space is taken by
content of the selected page. The evolving links panel is positioned aside. When
one of the links is clicked within the panel, the content is replaced with the
content of the appropriate page and the panel itself is renewed according to
algorithm. This allows a comfortable browsing of the websites devoted to a
certain topic.
6 Application so far
The system as described in
current work was developed and successfully implemented at Wessex Institute of
Technology. All the programming scripts were coded in PHP. MySQL database was
selected as data storage and the appropriate database structure was designed
for the project. The changeable predefined parameters such as start URLs of
websites and the size of links set are placed in config file. The page crawler
and keyword indexer scripts are designed to launch automatically using the
server cron jobs. A simple search spider coded in current implementation may be
classified as of Breadth-first type crawler [5]. For the stemming of keywords
an implementation of the Porter Stemming Algorithm by Jon Abernathy distributed
under GNU license was used. The system
is running on Apache web server, FreeBSD 6.0 OS.
The system in current
implementation is an alpha version of the planned self-organising web portal
solution. The research and development will be continued to elaborate new
improved versions of the system.
The current version of the
project can be reached at: www.dl.wessex.ac.uk/sowp/
We have tested the system with
three websites indexed. There were indexed 575 pages, 755 unique keywords, 5678
records for keyword-page relations and their ranks. The list of stop words
consists of 190 entries. The experiments have shown that in addition to common
stop words, this list should be extended with words custom to each case,
depending on the topic and language used within the defined websites. It
remains to find out whether it is applicable to form the stop word lists
automatically with the help of semantic lexicon and statistical methods as some
researchers mention [6].
The preliminary tests have shown
that the approach is fairly usable for users. The evolutionary panel
successfully recognizes the interests of users and with the help of page
similarity algorithm proposes the related materials. It is very convenient for
users that the documents they are interested in are fixed at the same place for
some time and new links pop up periodically to replace those not demanded.
Conclusions and future
work
An innovative approach to make
a self-organising web portal solution is proposed. A real working system was
developed based on the described concept and successfully applied for a group
of real websites.
The tests of the system have
clearly shown that the self-organising portal approach is worthwhile and
presents a really usable method for automated organisation of web portals for
the purposes of certain groups of users.
Despite that the working model
is implemented, the method may be greatly improved and therefore intensive
research and development work is required.
There are many directions for
further improvement. Current implementation of the system uses simplistic
algorithm to detect similar pages which is based on keyword matches. More
accurate results can be obtained using some hybrid algorithm considering
hyperlinks, words writing and location, implementing SVM and ontology models [6].
The method may be advanced to
consider more input parameters such as time spent on page, links clicked within
the content area etc. Moreover, in current implementation when forming a set of
links for a user the system doesn’t use the information about the popularity of
certain documents with other users. The method may be improved to cluster users
into groups and therefore provide coherence between the evolutionary link
panels of users with similar interests.
It is also possible, as some
recent studies show [7], to consider and dynamically update such parameters as
the location of elements, their size and time of appearance in the web
interface.
The developed solution may be
widely applied in WWW and intranet anywhere when there is a necessity to
collect and display information from different web sources. It can be used
either to create special informational portals for certain groups of users or
to create self-organising homepages for individuals. The possibilities for
application are spacious and to be explored by practice.
Besides, a number of
incidental improvements can be made to the functionality of websites united by
self-organising web portal. Thanks to the database of indexed pages and
keywords it is possible to implement ‘show similar pages’ interface which will
get the related documents from other websites. Improved search mechanisms can
be implemented using the same database. Search engine optimisation may be also
improved with the help of automated building of site maps and dynamic linking
between related documents.
The issue of deep statistical analysis
is not covered in this work. It is obvious however that uniting different
websites with unified navigational interface provides new possibilities for the
analysis of access patterns. Provided that ontologies are implemented it may be
possible to develop ‘intelligent advisors’ agents for webmasters. These agents
will analyze the access logs, popularity of documents and keywords, as well as
other information collected by search spider and evolutionary links panel and
make human-like advices regarding the content and navigation within the
websites. It is important that nowadays we are technically capable to implement
human-like ‘characters’ within the context of web interfaces [8] so we are on
the path to creation of a new kind of WWW, more user-friendly and intelligent.
References
[1] M. Sahami, S. Yusufali, M. Baldonado (1998) SONIA: a service for organising
networked information autonomously. Proceedings of the third ACM conference on Digital
libraries.
[2] N. Stojanovic, A. Maedche, S. Staab, R. Studer, Y. Sure (2001) SEAL – A Framework for Developing SEmantic
PortALs. Proceedings of the 1st international conference on Knowledge
capture, pp.155-162
[3] T.Yan, M.Jacobsen, H.
Garcia-Molina, U.Dayal (1996) From user
access patterns to dynamic hypertext linking. Computer Networks and ISDN,
pp.1007-1014
[4] C.J. van Rijsbergen, S.E. Robertson and M.F. Porter (1980) New models in probabilistic information
retrieval.
[5] F. Menczer, G. Pant, P. Srinivasan (2004) Topical web crawlers: Evaluating adaptive algorithms. ACM
Transactions on Internet Technology (TOIT) archive
Volume 4 , Issue 4 (November 2004) pp.378-419
[6] M. Khordad, M. Shamsfard & F. Kazemeyni (2005) A hybrid method to categorize HTML
documents. Data Mining VI, pp.331-340
[7] S.Kumar, V.Jacob,
C.Sriskandarajah (2005) Scheduling
advertisements on a web page to maximize revenue. European Journal of
Operational Research, 2005.
[8] F. Abbatista, A. Paradiso,
G. Semerano, F. Zambetta (2004) An agent
that learns to support users of a Web site. Applied Soft Computing 4,
pp.1-12