Content area
This research project looks at problems in Web site design from the perspective of network analysis. In view of the similarity between the hypertext structure of Web pages and a generic network, the authors find in network analysis several concepts and theories which could provide some insight for Web site design. In this paper, the problem of homepage location and the control of number of Web pages and their links are described. Solutions to these problems are discussed.
Hakman A. Wan: Department of Computer Studies, Lingnan College, Hong Kong. E-mail: [email protected]; [email protected]
Chi-wai Chung: Department of Computer Studies, Lingnan College, Hong Kong. E-mail: [email protected]; [email protected]
ACKNOWLEDGMENT: The authors sincerely thank Professor Ming-te Lu for his kind support of our research.
Introduction
When researchers are exploring business opportunities related to the Internet, the World Wide Web (WWW) checks in thousands of new users everyday. Many of them are individuals intrigued by the fun of reaching to an abundance of information in the Web, but others may be institutions who wish to gain competitive advantage. The Web is now filled with electronic shops, cybermalls, and many Web sites which are designed and maintained by business organizations. In view of the relatively small investment in Web site development and the many less-than-successful stories of Internet business, organizations may find the Internet unattractive and refrain from taking the Internet marketplace seriously. To promote electronic commerce and to make the WWW a better place for business, researchers might need to expand their study fields to look into the problems arising with the Internet from a variety of perspectives.
Electronic marketplace was a new concept when Bakos analyzed the rationale of developing these markets (Bakos, 1991). Their business opportunities are publicized by researchers in many areas; but the success factors of electronic markets have not been fully identified. Maroney remarks that hypertext is the basis of the success of WWW, in a special issue on WWW of Journal of Advertising Research (Maroney, 1997). Alba et al. (1997) compare retailing Web sites with three traditional retail formats: convenience-goods store (supermarket), specialty-goods store (department store) and shopping-goods store (category specialist) in dimensions such as consideration set, transaction costs, and other benefits. These researchers find the content and design of commercial Web sites are among the success factors of electronic commerce. They analyze the context of these Web sites.
This paper takes a different approach. Realizing most business Web sites are organized in a network structure, the authors study Web page design from a geometric angle. By using techniques of network analysis, Web sites can be classified according to the complexity of their designs, which are believed to be strongly related to the effectiveness and efficiency of the Web sites. The main body of this paper is divided into several sections to illustrate the relationship between Web site design and two particular concepts in network analysis. Following this introductory section, the next section describes the importance of the network analogy. A simple mathematical background of network analysis is then presented and the section leads to the two network concepts, location problem and cyclomatic number, which are shown as a quantitative measure of the Web site structure. The end of the paper is marked by a concluding section, which summarizes the ideas discussed in the previous sections and possible research in the future.
Hypertext and network
One of the major attractions of the "soft" side of the Web technology is the hypertext architecture of data, whether the latter are in the form of text, graphics, sound or animation. The hypertext architecture treats data as collections of chunks and links. These chunks of data, physically stored as computer files, are associated with one another via pre-established linkages. Thus a Web site can be seen as a network of chunks; one chunk or a number of them together make up a Web page which is displayed on-screen as viewers reach the corresponding address. For the sake of simplicity, all data chunks are regarded as Web pages in this paper. A Web site is thus made up of one or several Web pages interconnected with hyperlinks. Figure 1 shows a simple Web site in which Web pages are arranged in a hierarchical structure.
A special chunk named "home page" is usually accessible by the public through a Web site address. Web surfers may find passages to all other Web pages in the site via the home page. However, a more complicated Web site may have multiple entries; and the hypertext does not restrict the number of linkages attached to any chunks of data, as long as the syntax of, say, HTML is observed. If linkages are added to each Web page arbitrarily, the Web site is easily turned into a messy network structure.
Disordered and unstructured Web sites are usually products of amateurs (e.g. students); a business firm could and should find the problem devastating and intolerable. The Web site of a business organization is a window and/or a service to their business partners. A messy Web site design having its Web pages connected to one another without control may render adverse effects to the site. In summary, a well-designed Web site may bring in the following advantages:
- more efficient use of multimedia files;
- quicker access to important Web pages; and
- easier maintenance of Web pages.
A Web page nowadays is basically a text file in which links to other graphics, audio, or video files could be found. To view a Web page, the visitor must spend some time to locate the Web site address, get the files, perhaps together with relevant driver programs, from the owner's computer to the visitor's computer. Even with today's technology, the process could be very slow especially when the Web site is a very popular one or the files are to be transmitted across transcontinental routes. A business Web site would avoid keeping potential customers from waiting for too long by controlling the size of Web pages.
In some cases, the Web site designer may have prepared certain "hot" pages where frequent visits are expected. (For instance, after showing details of commodities, a retailer's Web site may like its visitors to place an order through the Internet; the "hot" page could be an order form.) Unless most visitors are able to get direct access to these hot pages (for example, through bookmarking), an average customer may have to go through a series of Web pages in order to find the hot one. It is thus desirable to limit the search time for the hot page by keeping it close to the homepage.
Web sites have to be updated very frequently and the maintenance work is perhaps done as a routine without plans. However, a Web page constitutes a part of the routes to those pages which it is linked to. If a Web page is replaced or its content is modified carelessly, the maintenance work may disconnect some of the critical routes and change the accessibility of some Web pages. It may also damage the integrity of the information carried by the Web site and the site may not function as it was supposed to. By controlling the number of outlets in each Web page, a designer may have more flexibility in maintenance work and in the same time, facilitating documentation.
On the other hand, some Web sites may wish their visitors to roam about pages in whatever route they choose, and the designer may like to provide visitors with as many routes as possible. A Web site so designed should have its pages interconnected to many others. However, the complexity should be controlled carefully, especially when the pages are being modified. This is desirable to quantify the complexity of a Web site so as to monitor the maintenance of its complexity.
A network and its cost matrix
A network is defined by the number of vertices and arcs it contains. For example, a network of the form in Figure 2a could be defined by the cost matrix in Figure 2b.
A natural way to see a Web site as a network is to represent its Web pages by vertices. An arc between two vertices, indicating the existence of a link from one Web page to another, is usually characterized by the cost of traveling using the link. The symbol 4 denotes the non-existence of link between two pages. A definition to network is given below:
- Definition 1. A network is defined as a couple G = (V, E) where V is the set of vertices and E is the cost matrix. The entries of the cost matrix are denoted by d(x, y), for every pair x, y in V; they are either infinity (indicating no direct route) or positive real numbers.
The word path (or route) is a combination of arcs starting from a vertex to the destination vertex. In this paper, it is assumed that every network is a strongly connected network - i.e., for every x in V, there exists a path joining x to the rest of the vertices.
If compared with a network in general, a Web site is distinguished in two aspects:
- (1) Arcs lengths. If the length of an arc is used to represent the time spent in seeking the destination page and if this search time is taken as equivalent to the size of the Web page, one way to put lengths to arcs is to let the size of the destination Web page be the length.
- (2) Return path. If an arc (path) exists between two pages, there must be at least one return path. Where there is no return path explicitly written on a page, the "Back" icon on the browser can also provide a return exit from the current page.
These two features demand that a Web site is a digraph with an asymmetric cost matrix. For example, the cost matrix of the Web site in Figure 3a[1] is shown in Figure 3b.
Note that entries in the cost matrix represent the sizes of the destination Web pages. Since the size of a Web page is proportional to the time spent in transferring the necessary files to the computer of the visitor, these numbers could also represent the time spent in waiting for the transmission. However, when the visitor clicks the "Back" icon (or in any other way to return to the previously viewed page), the necessary files are normally retrieved from local hard disk/cache memory. The time spent on a return path is very much shorter. It is thus represented by [Lambda] ([Lambda] <1) in the cost matrix of Figure 3b.
The Web site is a connected network(i.e. there exists a path from any one page to any other page); all pages are accessible within finite time. Seek times can be inserted to the cost matrix to make the cost matrix as shown in Figure 3c.
Note that the cost matrix above is valid only when return paths are known. If the direction of traversing (browsing) is unknown - it happens when visitors do not start from the homepage -whether a path is a return one cannot be determined in advance.
Location problems
In operation research, the "location theory" is concerned with networks of demands and service facilities and is used to optimize the location in a specified region for a facility. Some of the applications of location theory include:
- distribution system (warehousing) design;
- emergency facility location; and
- lockbox problems.
For instance, sources of products, warehouses and retail stores in the context of distribution system design are connected by routes of transportation. From the geographical location of a warehouse, ease of access to sources of products and retail stores is a function of the physical length of the route in between. Using the location theory, researchers can find an optimal location of a warehouse in the network - whether it is a vertex or along an arc - so as to secure the shortest delivery time.
The concept of location problem is relevant to Web site design. Its applications are discussed in the following sections.
Center problems
We first denote the distance of the furthest vertex from a vertex x by
The definition of the center of a network is as follows:
- Definition 2. A center of a network G = (V, E) is any vertex such that the furthest vertex away is the closest among other vertices; ie. if c is the center of V, mv(c) ! mv(x) for every x in V.
Within a network of Web pages, a center (page) may be a page which the designer wishes to be the main entrance to the Web site. Its Web address would be publicized, perhaps by submitting it to a search engine. When a visitor is browsing the center page, all other pages can be reached within a relatively short seek time. Practical uses of the center concept include the following.
Location of the homepage
To improve accessibility to all other pages, the homepage of a Web site should keep other Web pages within close distance. The definition of a center in a network thus provides a measure of the suitability of the homepage. If initially, one of the Web page in the site is to be selected as the homepage, the direction of traversal is not known. It is thus assumed that every page is to be retrieved from the remote site. For instance, the cost matrix of the Web site of Figure 3a should not assume the form of Figure 3c but the following one (Figure 3d).
Calculating the mv(x) for each vertex x,
- mv(a) = 10
- mv(b) = 13
- mv(c) = 13
- mv(d) = 9
- mv(e) = 13
- mv(f) = 16
- mv(g) = 16
The smallest mv is found with vertex d. This suggests that one option to improve accessibility - assuming all pages bear the same contextual importance - is to make vertex d the homepage. The resulting arrangement of the Web site is shown in Figure 4, and it guarantees that all Web pages could be reached within nine units of time.
Insertion of a new page
Suppose that the original Web site is given as in Figure 4 and a new page (page h) is attached to page g. The resulting Web site and the new cost matrix are shown in Figures 5a and 5b respectively.
The insertion may disrupt the optimal location of the homepage. The idea of network center can again be applied to find a new location. In this example, the distance of the furthest vertex from each vertex is found:
- mv(a) = 18
- mv(b) = 21
- mv(c) = 21
- mv(d) = 18
- mv(e) = 21
- mv(f) = 18
- mv(g) = 16
- mv(h) = 17
These distances suggest a new optimal location of a homepage.
Median problems
Starting from a homepage which is the center of a Web site will guarantee that even the furthest Web page is reachable within a pre-set time. If Web pages other than the furthest one are to be considered, the center problem turns into a median problem.
We denote the total vertex distance of a vertex x by
Then, the median of a network can be defined:
- Definition 3. A median of a network G=(V,E) is the vertex with the smallest total distance to all other vertices. That is, median x has the smallest tv(x) ! tv(y) for every y in V.
To find the median for the Web site in Figure 3a, the total vertex distances are calculated for each vertex:
- tv(a) = 41
- tv(b) = 54
- tv(c) = 53
- tv(d) = 31
- tv(e) = 52
- tv(f) = 71
- tv(g) = 73
The result suggests that the median is page d. Having the median page as the homepage will guarantee that in average other Web pages are reachable within the shortest possible time.
Linear structure
Finding a proper homepage by using the center or median concept is suitable only for Web sites in which the contents of Web pages do not demand a certain order of browsing. In cases where contents of Web pages are presented in a deductive manner, often they are arranged in a linear way as the one in Figure 6. The homepage in such a structure is usually the beginning of the deduction and its position in the Web site is not replaceable.
Although it is meaningless to make another page a beginning page, the problem of search time still exists. It is desirable to keep Web pages within a close distance from the homepage. With a fixed amount of information carried by the Web pages, the location problem could mean reducing the number of pages or splitting the Web site to two smaller networks.
Random tours
Many Web sites contain a wide range of information, which is partitioned into Web pagesand are linked as a complex network. Since these Web pages are interconnected directly or indirectly, a visitor is free to go to any page in any sequence. It is the designer's decision to control the number of possible routes in the network. For instance, a retailer may like to have many pages in his/her Web site, but still want to prevent visitors from being led astray by all these pages. It is thus desirable to control the number of possible routes in the Web site. In this section, the number of possible routes is considered and the sizes of Web pages are not as crucial as in the previous sections.
McCabe's (1976) definition of "cyclomatic number" is helpful in providing a measure of the complexity of a network:
- Definition 4. The cyclomatic number of a connected network G of n vertices and e arcs is v(G) = e - n + 1.
Using this formula, vs for the Web sites of Figures 4, 9 and 10 are found to be 6 and 7 respectively. According to theorem 1, which is stated below, the cyclomatic number is related to the number of possible routes in a network. The larger the number is, the more possible routes there will be. However, theorem 1 refers to circuits, which are paths that always return to their starting vertices:
- Theorem 1. In a strongly connected network G, the cyclomatic number is equal to the maximum number of linearly independent circuits.
(Its proof is given in the Appendix.) In the context of Web site structure, any Web page is always reachable from any other page. A circuit is thus equivalent to a path by eliminating the return path. But the concept of linear independency needs some clarification:
- Definition 5. A set of circuits is linear independent if no circuit in the set is a linear combination of any other circuits in the set.
In other words if a set of circuits is linear independent and if
r[sub]1[mu][sub]1 + ... r[sub]k[mu][sub]k = 0
there exist real numbers, then
r[sub]1 = ... = r[sub]k = 0.
For example, a set of six linear independent circuits in Figure 3a is as follows.
- aba), (aca), (ada), (aea), (dfd), (dgd)
if a circuit is arbitrarily chosen as, say, (cadfdgdac), it could be expressed as a linear combination of those independent circuits above. McCabe (1976) provides a simple way to demonstrate linear combination.
He expresses paths as vectors of n entries; for example:
- aba) = (2100000)
- aca) = (2010000)
- ada) = (2001000)
- aea) = (2000100)
- dfd) = (0002010)
- dgd)= (0002001)
The value of entry represents the number of occurrences of vertex i in the path. Hence,
cadfdgdac) = (2023011) = 2(aca) - (ada)+ (dfd) + (dgd).
McCabe (McCabe and Butler, 1989) uses the cyclomatic number to measure the complexity of a program and suggests an acceptable limit of v for the network of modules of a program. If the cyclomatic number of a modular program exceeds this limit, the program is more likely to have bugs and maintenance work is more difficult.
The theory finds similar application to Web page design. However, the limit value of v varies among different structures of Web sites. For example, the limit value of v for a retailer's Web site is expected to be smaller than those for other Web sites containing similar number of pages. In general, a commercial Web site may have tens or hundreds of pages to display mountains of information. Instead of building links to these pages arbitrarily, the designer of a retailer's Web site can control the number of pages and links so as to reduce the complexity of the Web site. The limit value of v could be determined by trial and error. Once determined, the limit v may suggest the maximum number of links when the number of pages is fixed, or the minimum number of pages when the number of links is fixed.
Conclusion
The paper suggests that certain principles in network analysis are applicable to Web site design. It should be careful to design the homepage, which is the entrance to the Web site, so that visitors do not need to click many times to reach other pages or to wait indefinitely to bring in sufficient chunks of data. By varying the positions of links and the contents of the pages, a designer could control the proper location of the homepage and/or newly added pages so that the homepage is kept as a center or median in a Web site.
McCabe's measure of program complexity, the cyclomatic number v, can also be applied to Web page design. Unless it is the designer's intention to let visitors have absolute freedom to go anywhere in the Web site and in any sequence, the number of pages and links in the site should be controlled to fall within a limit value of v. In this way, visitors would not be going too far away and getting lost somewhere some links have led them to.
There have seldom been attempts to study Web site design using network theories. This paper suggests that some theories could be useful to determine the size of a Web page and the number of links in it. However, we have not treated different files in a Web site differently. There are text files of the Web pages to which other files of graphics, sound, etc. are attached. The implication of the diversity of types of files and links must be considered while the seek time to Web pages is to be estimated more accurately.
Note
1 The lower number indicates the size of the Web page.
References
1. Alba, J., Lynch, J., Weitz, B., Janiszewski, C., Lutz, R., Sawyer, A. and Wood, S. (1997, "Interactive home shopping: consumer, retailer, and manufacturer incentives to participate in electronic marketplaces", Journal of Marketing, Vol. 61, pp. 38-53.
2. Bakos, J.Y. (1991, "A strategic analysis of electronic marketplaces", MIS Quarterly, September, pp. 295-310.
3. Berge, C. (1973, Graphs and Hypergraphs, North-Holland, Amsterdam.
4. McCabe, T.J. (1976, "A complexity of measure", IEEE Transactions on Software Engineering, SE-2, No. 4, December, pp. 308-20.
5. McCabe, T.J. and Butler, C.W., (1989, "Design complexity measurement and testing", Communications of the ACM, Vol. 32 No. 12, pp. 1415-25.
6. Maroney, D. (1997, "In praise of hypertext", Journal of Advertising Research, Vol. 37 No. 2, March/April, pp. 7-9.
Appendix
The proof of theorem 1 follows (Berge, 1973). First, a definition of "partial network" is necessary:
- Definition 6. For a network G = (V, E), a partial network Gcents is the network G without some of the arcs.
In particular, we define a series of partial networks where e is the number of arcs in G, by
- (1) G[sub]0 is made up of all vertices but has no arcs.
- (2) G[sub]i is made up of G[sub]i[sub]-1 by adding one more arc to G[sub]i[sub]-1.
Since there is no circuit in G[sub]0
- v(G[sub]0) = 0.
When an arc i is added to G[sub]i[sub]-1, it could make up either no new circuit or one new circuit more. In other words
- Case 1:No new circuit is made, then
- v(G[sub]i) = v (G[sub]i[sub]-1).
- Case 2:One new circuit is made, then
- v(G[sub]i) = v (G[sub]i[sub]-1) + 1.
The generation of partial networks halts when all arcs have been added; thus the above two cases altogether happen for e times. Thenumber of circuits formed is equal to the number of times that case 2 happens or, equal to e minus the number of times that case 1 occurs.
The process joins the initial n vertices (trivial networks) to one single network. Note that when the addition of an arc gives birth to a new circuit, the number of separate networks would not be reduced. The reduction of number of networks occurs only when no new circuit is made, i.e., when case 1 occurs. Hence case 1 occurs for n - 1 times; the number of circuits in G is thus
- v(G) = e - n + 1
Suppose that [mu][sub]1, ..., [mu][sub]e are the circuits that are generated during the above process. These circuits must be linear independent because if there exists k such that
- [mu][sub]k = r[sub]1[mu][sub]1 + ... r[sub]k[sub]-1[mu][sub]k[sub]-1 + r[sub]k[sub]+1[mu][sub]k+[sub]1 + ...r[sub]e[mu][sub]e
Caption: Figure 1; A simple Web site; Figure 3; Hierarchy of a simple Web site (a), its cost matrix showing lengths of direct paths (b), modified cost matrix of network in Figure 3(a) (c) and cost matrix of network in Figure 3(a) (d); Figure 2; A simple network (a) and its cost matrix (b); 4; Element 4; Figure 4; Making page d a homepage improves accessibility; 6; Element 6; Figure 5; Page h attached (a) and its cost matrix (b); Figure 6; A linearly structured Web site
Copyright MCB UP Limited (MCB) 1998
