Authors:
Cristian Colliander Department of Sociology, Inforsk, Umeå University, 901 87 Umeå, Sweden

Search for other papers by Cristian Colliander in
Current site
Google Scholar
PubMed
Close
and
Per Ahlgren Department of Sociology, Inforsk, Umeå University, 901 87 Umeå, Sweden
Department of e-Resources, University Library, Stockholm University, 106 91 Stockholm, Sweden

Search for other papers by Per Ahlgren in
Current site
Google Scholar
PubMed
Close
Restricted access

Abstract

The measurement of similarity between objects plays a role in several scientific areas. In this article, we deal with document–document similarity in a scientometric context. We compare experimentally, using a large dataset, first-order with second-order similarities with respect to the overall quality of partitions of the dataset, where the partitions are obtained on the basis of optimizing weighted modularity. The quality of a partition is defined in terms of textual coherence. The results show that the second-order approach consistently outperforms the first-order approach. Each difference between the two approaches in overall partition quality values is significant at the 0.01 level.

  • Ahlgren, P, Colliander, C 2009 Document–document similarity approaches and science mapping: experimental comparison of five approaches. Journal of Informetrics 3 1 4963 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Ahlgren, P., & Colliander, C. (2009b). Textual content, cited references, similarity order, and clustering: an experimental study in the context of science mapping. In Proceedings of the 12th International Conference on Scientometrics and Informetrics (Vol. 2, pp 862-873), Rio de Janeiro.

    • Search Google Scholar
    • Export Citation
  • Ahlgren, P, Jarneving, B 2008 Bibliographic coupling, common abstract stems and clustering: A comparison of two document–document similarity approaches in the context of science mapping. Scientometrics 76 2 273290 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Ahlgren, P, Jarneving, B, Rousseau, R 2003 Requirements for a cocitation similarity measure, with special reference to Pearson's correlation coefficient. Journal of the American Society for Information Science and Technology 54 6 550560 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Arenas, A., Fernandez, A., & Gomez, S. (2008). Analysis of the structure of complex networks at different resolution levels. New Journal of Physics, 10, Article Number: 053039.

    • Search Google Scholar
    • Export Citation
  • Baeza-Yates, R, Ribeiro-Neto, B 1999 Modern information retrieval Addison-Wesley Harlow, UK.

  • Bland, JM, Kerry, SM 1998 Statistics notes—Weighted comparison of means. British Medical Journal 316 7125 129 .

  • Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics—Theory and Experiment, Article Number: P10008.

    • Search Google Scholar
    • Export Citation
  • Boyack, KW, Klavans, R 2010 Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately?. Journal of the American Society for Information Science and Technology 61 12 23892404 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Boyack, KW, Klavans, R, Börner, K 2005 Mapping the backbone of science. Scientometrics 64 3 351374 .

  • Boyack, K. W., Newman, D., Duhon, R. J., Klavans, R., Patek, M., Biberstine, J. R., et al. (2011). Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches. PLoS One, 6 (3), Article Number: e18029.

    • Search Google Scholar
    • Export Citation
  • Cao, M, Gao, X 2005 Combining contents and citations for scientific document classification. AI 2005: Advances in artificial intelligence Springer Berlin 143152.

    • Search Google Scholar
    • Export Citation
  • Cribbin, T 2011 Discovering latent topical structure by second-order similarity analysis. Journal of the American Society for Information Science and Technology 62 6 11881207 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Egghe, L 2009 New relations between similarity measures for vectors based on vector norms. Journal of the American Society for Information Science and Technology 60 2 232239 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Egghe, L 2010 Good properties of similarity measures and their complementarity. Journal of the American Society for Information Science and Technology 61 10 21512160 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Egghe, L 2010 On the relation between the association strength and other similarity measures. Journal of the American Society for Information Science and Technology 61 7 15021504 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Egghe, L, Leydesdorff, L 2009 The relation between Pearson's correlation coefficient r and Salton's cosine measure. Journal of the American Society for Information Science and Technology 60 5 10271036 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Egghe, L, Rousseau, R 2006 Classical retrieval and overlap measures satisfy the requirements for rankings based on a Lorenz curve. Information Processing & Management 42 1 106120 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Fortunato, S, Barthelemy, M 2007 Resolution limit in community detection. Proceedings of the National Academy of Sciences of the United States of America 104 1 3641 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Glenisson, P, Glänzel, W, Persson, O 2005 Combining full-text analysis and bibliometric indicators. A pilot study. Scientometrics 63 1 163180 .

  • Gmür, M 2003 Co-citation analysis and the search for invisible colleges: A methodological evaluation. Scientometrics 57 1 2757 .

  • Hamers, L, Hemeryck, Y, Herweyers, G, Janssen, M, Keters, H, Rousseau, R et al. 1989 Similarity measures in scientometric research— The Jaccard index versus Salton cosine formula. Information Processing & Management 25 3 315318 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Janssens, F., Quoc, V. T., Glänzel, W., & Moor, B. D. (2006). Integration of textual content and link information for accurate clustering of science fields. In InSCit2006, Current Research in Information Sciences and Technologies: Multidisciplinary Approaches to Global Information Systems (Vol. I, pp. 615619), Merida, Spain.

    • Search Google Scholar
    • Export Citation
  • Klavans, R, Boyack, KW 2006 Identifying a better measure of relatedness for mapping science. Journal of the American Society for Information Science and Technology 57 2 251263 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Levenshtein, V 1966 Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10 8 845848.

  • Leydesdorff, L 2008 On the normalization and visualization of author co-citation data: Salton's cosine versus the Jaccard index. Journal of the American Society for Information Science and Technology 59 1 7785 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Lin, JH 1991 Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory 37 1 145151 .

  • Luukkonen, T, Tijssen, RJW, Persson, O, Sivertsen, G 1993 The measurement of international scientific collaboration. Scientometrics 28 1 1536 .

  • Newman, M. E. J. (2004). Analysis of weighted networks. Physical Review E, 70 (5), Article Number: 056131.

  • Peters, HPF, Van Raan, AFJ 1993 Co-word-based science maps of chemical-engineering. Part 1: Representations by direct multidimensional-scaling. Research Policy 22 1 2345 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Salton, G, Buckley, C 1988 Term-weighting approaches in automatic text retrieval. Information Processing & Management 24 5 513523 .

  • Salton, G, McGill, MJ 1983 Introduction to modern information retrieval McGraw-Hill New York.

  • Schneider, JW, Borlund, P 2007 Matrix comparison, part 1: Motivation and important issues for measuring the resemblance between proximity measures or ordination results. Journal of the American Society for Information Science and Technology 58 11 15861595 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Schneider, JW, Borlund, P 2007 Matrix comparison, part 2: Measuring the resemblance between proximity measures or ordination results by use of the mantel and procrustes statistics. Journal of the American Society for Information Science and Technology 58 11 15961609 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Tan, P-N, Steinbach, M, Kumar, V 2006 Introduction to data mining Pearson Addison Wesley Boston.

  • NJ van Eck Waltman, L 2009 How to normalize cooccurrence data? An analysis of some well-known similarity measures. Journal of the American Society for Information Science and Technology 60 8 16351651 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wasserman, S, Faust, K 1994 Social network analysis: Methods and applications Cambridge University Press Cambridge.

  • Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., & Nevill-Manning, C. (1999). KEA: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries, Berkeley, CA.

    • Search Google Scholar
    • Export Citation
  • Collapse
  • Expand

Scientometrics
Language English
Size B5
Year of
Foundation
1978
Volumes
per Year
1
Issues
per Year
12
Founder Akadémiai Kiadó
Founder's
Address
H-1117 Budapest, Hungary 1516 Budapest, PO Box 245.
Publisher Akadémiai Kiadó
Springer Nature Switzerland AG
Publisher's
Address
H-1117 Budapest, Hungary 1516 Budapest, PO Box 245.
CH-6330 Cham, Switzerland Gewerbestrasse 11.
Responsible
Publisher
Chief Executive Officer, Akadémiai Kiadó
ISSN 0138-9130 (Print)
ISSN 1588-2861 (Online)