The measurement of similarity between objects plays a role in several scientific areas. In this article, we deal with document–document similarity in a scientometric context. We compare experimentally, using a large dataset, first-order with second-order similarities with respect to the overall quality of partitions of the dataset, where the partitions are obtained on the basis of optimizing weighted modularity. The quality of a partition is defined in terms of textual coherence. The results show that the second-order approach consistently outperforms the first-order approach. Each difference between the two approaches in overall partition quality values is significant at the 0.01 level.
Ahlgren, P, Colliander, C 2009 Document–document similarity approaches and science mapping: experimental comparison of five approaches. Journal of Informetrics 3 1 49–63 .
Ahlgren, P., & Colliander, C. (2009b). Textual content, cited references, similarity order, and clustering: an experimental study in the context of science mapping. In Proceedings of the 12th International Conference on Scientometrics and Informetrics (Vol. 2, pp 862-873), Rio de Janeiro.
Ahlgren, P, Jarneving, B 2008 Bibliographic coupling, common abstract stems and clustering: A comparison of two document–document similarity approaches in the context of science mapping. Scientometrics 76 2 273–290 .
Ahlgren, P, Jarneving, B, Rousseau, R 2003 Requirements for a cocitation similarity measure, with special reference to Pearson's correlation coefficient. Journal of the American Society for Information Science and Technology 54 6 550–560 .
Arenas, A., Fernandez, A., & Gomez, S. (2008). Analysis of the structure of complex networks at different resolution levels. New Journal of Physics, 10, Article Number: 053039.
Baeza-Yates, R, Ribeiro-Neto, B 1999 Modern information retrieval Addison-Wesley Harlow, UK.
Bland, JM, Kerry, SM 1998 Statistics notes—Weighted comparison of means. British Medical Journal 316 7125 129 .
Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics—Theory and Experiment, Article Number: P10008.
Boyack, KW, Klavans, R 2010 Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately?. Journal of the American Society for Information Science and Technology 61 12 2389–2404 .
Boyack, KW, Klavans, R, Börner, K 2005 Mapping the backbone of science. Scientometrics 64 3 351–374 .
Boyack, K. W., Newman, D., Duhon, R. J., Klavans, R., Patek, M., Biberstine, J. R., et al. (2011). Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches. PLoS One, 6 (3), Article Number: e18029.
Cao, M, Gao, X 2005 Combining contents and citations for scientific document classification. AI 2005: Advances in artificial intelligence Springer Berlin 143–152.
Cribbin, T 2011 Discovering latent topical structure by second-order similarity analysis. Journal of the American Society for Information Science and Technology 62 6 1188–1207 .
Egghe, L 2009 New relations between similarity measures for vectors based on vector norms. Journal of the American Society for Information Science and Technology 60 2 232–239 .
Egghe, L 2010 Good properties of similarity measures and their complementarity. Journal of the American Society for Information Science and Technology 61 10 2151–2160 .
Egghe, L 2010 On the relation between the association strength and other similarity measures. Journal of the American Society for Information Science and Technology 61 7 1502–1504 .
Egghe, L, Leydesdorff, L 2009 The relation between Pearson's correlation coefficient r and Salton's cosine measure. Journal of the American Society for Information Science and Technology 60 5 1027–1036 .
Egghe, L, Rousseau, R 2006 Classical retrieval and overlap measures satisfy the requirements for rankings based on a Lorenz curve. Information Processing & Management 42 1 106–120 .
Fortunato, S, Barthelemy, M 2007 Resolution limit in community detection. Proceedings of the National Academy of Sciences of the United States of America 104 1 36–41 .
Glenisson, P, Glänzel, W, Persson, O 2005 Combining full-text analysis and bibliometric indicators. A pilot study. Scientometrics 63 1 163–180 .
Gmür, M 2003 Co-citation analysis and the search for invisible colleges: A methodological evaluation. Scientometrics 57 1 27–57 .
Hamers, L, Hemeryck, Y, Herweyers, G, Janssen, M, Keters, H, Rousseau, R et al. 1989 Similarity measures in scientometric research— The Jaccard index versus Salton cosine formula. Information Processing & Management 25 3 315–318 .
Janssens, F., Quoc, V. T., Glänzel, W., & Moor, B. D. (2006). Integration of textual content and link information for accurate clustering of science fields. In InSCit2006, Current Research in Information Sciences and Technologies: Multidisciplinary Approaches to Global Information Systems (Vol. I, pp. 615–619), Merida, Spain.
Klavans, R, Boyack, KW 2006 Identifying a better measure of relatedness for mapping science. Journal of the American Society for Information Science and Technology 57 2 251–263 .
Levenshtein, V 1966 Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10 8 845–848.
Leydesdorff, L 2008 On the normalization and visualization of author co-citation data: Salton's cosine versus the Jaccard index. Journal of the American Society for Information Science and Technology 59 1 77–85 .
Lin, JH 1991 Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory 37 1 145–151 .
Luukkonen, T, Tijssen, RJW, Persson, O, Sivertsen, G 1993 The measurement of international scientific collaboration. Scientometrics 28 1 15–36 .
Newman, M. E. J. (2004). Analysis of weighted networks. Physical Review E, 70 (5), Article Number: 056131.
Peters, HPF, Van Raan, AFJ 1993 Co-word-based science maps of chemical-engineering. Part 1: Representations by direct multidimensional-scaling. Research Policy 22 1 23–45 .
Salton, G, Buckley, C 1988 Term-weighting approaches in automatic text retrieval. Information Processing & Management 24 5 513–523 .
Salton, G, McGill, MJ 1983 Introduction to modern information retrieval McGraw-Hill New York.
Schneider, JW, Borlund, P 2007 Matrix comparison, part 1: Motivation and important issues for measuring the resemblance between proximity measures or ordination results. Journal of the American Society for Information Science and Technology 58 11 1586–1595 .
Schneider, JW, Borlund, P 2007 Matrix comparison, part 2: Measuring the resemblance between proximity measures or ordination results by use of the mantel and procrustes statistics. Journal of the American Society for Information Science and Technology 58 11 1596–1609 .
Tan, P-N, Steinbach, M, Kumar, V 2006 Introduction to data mining Pearson Addison Wesley Boston.
NJ van Eck Waltman, L 2009 How to normalize cooccurrence data? An analysis of some well-known similarity measures. Journal of the American Society for Information Science and Technology 60 8 1635–1651 .
Wasserman, S, Faust, K 1994 Social network analysis: Methods and applications Cambridge University Press Cambridge.
Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., & Nevill-Manning, C. (1999). KEA: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries, Berkeley, CA.