View More View Less
  • a Faculty of Information and Media Studies, University of Western Ontario, London, ON, Canada N6A 5B7
  • b School of Information Studies, University of Wisconsin-Milwaukee, Milwaukee, WI (USA)
Restricted access

Using data sampled from top-level Web pages across five high-level domains and from sample pages within individual websites, the authors investigate the frequency distribution of outlinks in Web pages. The observed distributions were fitted to different theoretical distributions to determine the best-fitting model for representing outlink frequency across Web pages. Theoretical models tested include the modified power law (MPL), Mandelbrot (MDB), generalized Waring (GW), generalized inverse Gaussian-Poisson (GIGP), and generalized negative binomial (GNB) distributions. The GIGP and GNB provided good fits for data sets for top-level pages across the high level domains tested, with the GIGP performing slightly better. The lumpiness and bimodal nature of two of the observed outlink distributions from Web pages within a given website resulted in poor fits of the theoretical models. The GIGP was able to provide better fits to these data sets after the top components were truncated. The ability to effectively model Web page attributes, such as the distribution of the number of outlinks per page, paves the way for simulation models of Web page structural content, and makes it possible to estimate the number of outlinks that may be encountered within Web pages of a specific domain or within individual websites.

  • Adamic, L. A. Huberman, B. A. 2001 The Web's hidden order Communications of the ACM 44 9 5559.

  • Ajiferuke, I., Wolfram, D. (submitted). Analysis of image tag distribution characteristics in Web pages.

  • Albert, R. Barabasi, A. L. 2000 Topology of evolving networks: Local events and universality Physical Review Letters 85 24 52345237.

    • Search Google Scholar
    • Export Citation
  • Albert, R. Jeong, H. Barabasi, A. L. 1999 Diameter of the world-wide web Nature 401 130131.

  • Baayen, R. H. 2001 Word Frequency Distributions Kluwer Boston.

  • Barford, P., Crovella, M. (1998). Generating representative web workloads for network and server performance evaluation. In: ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pp. 151160, July 1998.

    • Search Google Scholar
    • Export Citation
  • Bates, M. J. Lu, S. 1997 An explanatory profile of personal home pages: content, design, metaphors Online & CDROM Review 21 6 331340.

    • Search Google Scholar
    • Export Citation
  • Brin, S., Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Available from: http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm [2003, April 15th].

    • Search Google Scholar
    • Export Citation
  • Broder, A. Kumar, R. Maghoul, F. Raghavan, P. Rajagopalan, S. Staa, R. Tomlins, A. Wiener, J. 2000 Graph structure in the Web Computer Networks and ISDN Systems 30 209320.

    • Search Google Scholar
    • Export Citation
  • Burrell, Q. L. Fenton, M. R. 1993 Yes, the GIGP really does work — and is workable Journal of the American Society for Information Science 44 6169.

    • Search Google Scholar
    • Export Citation
  • Craven, T. C. 2001 Description meta tags in pages returned on different search engines The Canadian Journal of Information and Library Science 26 1 117.

    • Search Google Scholar
    • Export Citation
  • cache/cond-mat/pdf/0009/0009090.pdf.

  • Egghe, L. Rousseau, R. 1990 Introduction to Informetrics: Quantitative Methods in Library, Documentation and Information Science Elsevier Amsterdam.

    • Search Google Scholar
    • Export Citation
  • Famoye, F. 1997 Parameter estimation for generalized negative binomial distribution Communications in Statistics: Simulation & Computation 26 1 269279.

    • Search Google Scholar
    • Export Citation
  • Huberman, B. A. 2001 The Laws of the Web: Patterns in the Ecology of Information The MIT Press Cambridge, MA.

  • Huberman, B. A. Adamic, L. A. 1999 Growth dynamics of the World Wide Web Nature 401 131133.

  • Irwin, J. O. 1975 The generalized Waring distribution: Part 1, part 2, part 3 Journal of the Royal Statistical Society, Series A 138 1831.

    • Search Google Scholar
    • Export Citation
  • Johnson, N. L. Kotz, S. Kemp, A. W. 1993 Univariate Discrete Distributions 2nd edition John Wiley & Sons, Inc. New York.

  • Larson, R. R. (1996). Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace, Available: http://sherlock.berkeley.edu/asis96/asis96.html [2003, April 19th].

    • Search Google Scholar
    • Export Citation
  • Levene, M. Fenner, T. Loizou, G. Wheeldon, R. 2002 A stochastic model for the evolution of the Web Computer Networks 39 3 277287.

  • Mandelbrot, B. 1954 Structure formelle des textes et communication: Deux etudes Word 10 127.

  • Nelson, M. J. 1989 Stochastic models for the distribution of index terms Journal of Documentation 45 3 227237.

  • Nelson, M. Downie, J. S. 2002 Informetric analysis of a music database Scientometrics 54 2 243255.

  • Nielsen, J. (1997a). Do Websites Have Increasing Returns? Available: http://www.useit.com/alertbox/9704b.html [2003, April 19th].

  • Nielsen, J. (1997b). Zipf Curves and Website Popularity. Available: http://www.useit.com/alertbox/zipf.html [2003, April 19th].

  • Pennock, D. M. Flake, G. W. Lawrence, S. Glover, E. J. Giles, C. L. 2002 Winners don.t take all: Characterizing the competition for links on the Web Proceedings of the National Academic of Sciences of the United States of America 99 8 52075211.

    • Search Google Scholar
    • Export Citation
  • PIROLLI, P., PITKOW, J., RAO, R. (1996). Silk from a sow's ear: Extracting usable structures from the Web. In: BILGER, R., GUEST, S., TAUBER, M. J. (Eds) CHI 96 – Electronic Proceedings. Available: http://www.acm.org/sigchi/chi96/proceedings/papers/Pirolli_2/pp2.html [2003, April 19th].

    • Search Google Scholar
    • Export Citation
  • Rousseau, R. (1997). Sitations: An exploratory study. Cybermetrics, 1(1). Available: http://www.cindoc.csic.es/cybermetrics/articles/v1i1p1.html [2003, April 19th].

    • Search Google Scholar
    • Export Citation
  • Sichel, H. S. 1985 A bibliometric distribution which really works Journal of the American Society for Information Science 3 5 314321.

    • Search Google Scholar
    • Export Citation
  • Sichel, H. S. 1992 Anatomy of the generalized inverse Gaussian-Poisson distribution with special applications to bibliometric studies Information Processing & Management 28 1 517.

    • Search Google Scholar
    • Export Citation
  • Simon, H. A. 1955 On a class of skew distribution functions Biometrika 42 425440.

  • Snyder, H. Rosenbaum, H. 1999 Can search engines be used as tools for web-link analysis? A critical review Journal of Documentation 55 4 375384.

    • Search Google Scholar
    • Export Citation
  • Wolfram, D. 2003 Applied Informetrics for Information Retrieval Research Libraries Unlimited Westport, CT.

  • WOODRUFF, A., AOKI, P. M., BREWER, E., HAUTHIER, P., ROWE, L. A. (1996). An investigation of documentsfrom the World Wide Web. In: Proceedings of the Fifth International World Wide Web Conference, Paris, France, May 6–10, 1996. Available: http://www5conf.inria.fr/fich_html/papers/P7/Overview.html[2003, April 19th].

    • Search Google Scholar
    • Export Citation
  • Yule, G. U. 1944 Statistical Study of Literary Vocabulary Cambridge University Press Cambridge.

  • Zipf, G. K. 1949 Human Behavior and the Principle of Least Effort Addison-Wesley Cambridge.