View More View Less
  • 1 School of Public Policy, Georgia Institute of Technology, 685 Cherry Street, Atlanta, GA, 30332-0345, USA; kberzins@gatech.edu; dhicks@gatech.edu; jmelkers@gatech.edu; fxiao3@gatech.edu; diogo.pinheiro@pubpolicy.gatech.edu
Restricted access

Abstract

This paper proposes a method for classifying true papers of a set of focal scientists and false papers of homonymous authors in bibliometric research processes. It directly addresses the issue of identifying papers that are not associated (“false”) with a given author. The proposed method has four steps: name and affiliation filtering, similarity score construction, author screening, and boosted trees classification. In this methodological paper we calculate error rates for our technique. Therefore, we needed to ascertain the correct attribution of each paper. To do this we constructed a small dataset of 4,253 papers allegedly belonging to a random sample of 100 authors. We apply the boosted trees algorithm to classify papers of authors with total false rate no higher than 30% (i.e. 3,862 papers of 91 authors). A one-run experiment achieves a testing misclassification error 0.55%, testing recall 99.84%, and testing precision 99.60%. A 50-run experiment shows that the median of testing classification error is 0.78% and mean 0.75%. Among the 90 authors in the testing set (one author only appeared in the training set), the algorithm successfully reduces the false rate to zero for 86 authors and misclassifies just one or two papers for each of the remaining four authors.

  • Aksnes, DW. When different persons have an identical author name. How frequent are homonyms?. Journal of the American Society for Information Science and Technology 2008 59 5 838841 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Aswani, N., Bontcheva, K., & Cunningham, H. (2006). Mining information for instance unification. In I. Cruz, S. Decker, D. Allemang, C. Preist, D. Schwabe, & P. Mika, et al. (eds.), The Semantic WebISWC 2006. Lecture Notes in Computer Science. (Vol. 4273, pp. 329342). Berlin: Springer.

    • Search Google Scholar
    • Export Citation
  • Bhattacharya, I, Getoor, L 2006 A latent dirichlet model for unsupervised entity resolution J Ghosh D Lambert D Skillicorn J Srivastava eds. Proceedings of the SIAM 6th International Conference on Data Mining Society for Industrial Mathematics Bethesda, MD 4758.

    • Search Google Scholar
    • Export Citation
  • Bhattacharya, I, Getoor, L. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD) 2007 1 1 136 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Blei, DM, Ng, AY, Jordan, MI. Latent dirichlet allocation. The Journal of Machine Learning Research 2003 3:9931022.

  • Breiman, L 1984 Classification and regression trees Chapman & Hall/CRC Boca Raton, FL.

  • Burges, CJC. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 1998 2 2 121167 .

  • Cole, FJ, Eales, NB. The history of comparative anatomy: Part 1.-a statistical analysis of the literature. Science Progress in the Twentieth Century 1917 6:578597.

    • Search Google Scholar
    • Export Citation
  • Cota, RG, Ferreira, AA, Nascimento, C, Goncalves, MA, Laender, AHF. An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology 2010 61 9 18531870 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Culotta, A., Kanani, P., Hall, R., Wick, M., & McCallum, A. Author disambiguation using error-driven machine learning with a ranking loss function. In 6th International Workshop on Information Integration on the Web (IIWeb-07), Vancouver, Canada, 23 July 2007.

    • Search Google Scholar
    • Export Citation
  • Culp, M., Johnson, K., & Michailidis, G. (2010). ada: An R package for stochastic boosting. http://CRAN.R-project.org/package=ada. Accessed 01 Aug 2011.

    • Search Google Scholar
    • Export Citation
  • D'Angelo, CA, Giuffrida, C, Abramo, G. A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments. Journal of the American Society for Information Science and Technology 2011 62 2 257269 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Friedman, J, Hastie, T, Tibshirani, R. Special invited paper. additive logistic regression: A statistical view of boosting. The Annals of Statistics 2000 28 2 337374 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Han, H, Giles, L, Zha, H, Li, C, Tsioutsiouliklis, K 2004 Two supervised learning approaches for name disambiguation in author citations H Chen H Wactlar C-c Chen E-P Lim M Christel eds. Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries ACM New York 296305.

    • Search Google Scholar
    • Export Citation
  • Han, H, Xu, W, Zha, H, Giles, CL 2005 A hierarchical naive Bayes mixture model for name disambiguation in author citations HM Haddad A Omicini RL Wainwright LM Liebrock eds. Proceedings of the 2005 ACM Symposium on Applied Computing ACM New York 10651069 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Han, H, Zha, H, Giles, CL 2005 Name disambiguation in author citations using a K-way spectral clustering method M Marlino T Sumner F Shipman eds. Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries ACM New York 334343 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hastie, T, Tibshirani, R, Friedman, JH 2009 The elements of statistical learning: data mining, inference, and prediction 2 Springer New York.

    • Search Google Scholar
    • Export Citation
  • Hirsch, JE. An index to quantify an individual's scientific research output. Proceedings of the National Academy of Sciences of the United States of America 2005 102 46 1656916572 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Hofmann, T 1999 Probabilistic latent semantic indexing F Gey M Hearst R Tong eds. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval ACM New York 5057 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Huang, J, Ertekin, S, Giles, C. Efficient name disambiguation for large-scale databases. Knowledge Discovery in Databases: PKDD 2006 2006 4213:536544 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Johnson, RA, Wichern, DW 2007 Applied multivariate statistical analysis 6 Pearson Prentice Hall Upper Saddle River, NJ.

  • Kanani, P., & McCallum, A. Efficient strategies for improving partitioning-based author coreference by incorporating Web pages as graph nodes. In 6th International Workshop on Information Integration on the Web (IIWeb-07), Vol. 23, Vancouver, Canada, 23 July 2007.

    • Search Google Scholar
    • Export Citation
  • Kanani, P., McCallum, A., & Pal, C. Improving author coreference by resource-bounded information gathering from the web. In 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, 6-12 Jan 2007 (pp. 429434). Hyderabad: AAAI Press.

    • Search Google Scholar
    • Export Citation
  • Kang, IS, Na, SH, Lee, S, Jung, H, Kim, P, Sung, WK et al. 2009 On co-authorship for author disambiguation. Information Processing and Management 45 1 8497 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Lee, D, On, BW, Kang, J, Park, S 2005 Effective and scalable solutions for mixed and split citation problems in digital libraries L Berti-Equille C Batini D Srivastava eds. International Workshop on Information Quality in Information Systems (IQIS 2005) ACM New York 6976.

    • Search Google Scholar
    • Export Citation
  • Liben-Nowell, D, Kleinberg, J. The link prediction problem for social networks. Journal of the American Society for Information Science and Technology 2007 58 7 10191031 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • McCallum, A., & Wellner, B. Object consolidation by graph partitioning with a conditionally-trained distance metric. In KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation, Washington, DC, 24-27 Aug 2003. Washington, DC: Citeseer.

    • Search Google Scholar
    • Export Citation
  • McRae-Spencer, DM, Shadbolt, NR 2006 Also by the same author: AKTiveAuthor, a citation graph approach to name disambiguation G Marchionini ML Nelson CC Marshall eds. Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries ACM New York 5354 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Moed, HF 2005 Citation analysis in research evaluation Springer Dordrecht.

  • Newman, MEJ. The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences of the United States of America 2001 98 2 404 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • On, BW, Lee, D, Kang, J, Mitra, P Acm 2005 Comparative study of name disambiguation problem using a scalable blocking-based framework M Marlino T Sumner F Shipman eds. Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries ACM New York 344353 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Onodera, N, Iwasawa, M, Midorikawa, N, Yoshikane, F, Amano, K, Ootani, Y et al. 2011 A method for eliminating articles by homonymous authors from the large number of articles retrieved by author search. Journal of the American Society for Information Science and Technology 62 4 677690 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Porter, A, Rafols, I. Is science becoming more interdisciplinary? Measuring and mapping six research fields over time. Scientometrics 2009 81 3 719745 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Radicchi, F, Fortunato, S, Markines, B, Vespignani, A. Diffusion of scientific credits and the ranking of scientists. Physical Review E 2009 80 5 056103 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Smalheiser, NR, Torvik, VI. Author Name Disambiguation. Annual Review of Information Science and Technology 2009 43:287313 .

  • Song, Y, Huang, J, Councill, IG, Li, J, Giles, CL 2007 Efficient topic-based unsupervised name disambiguation E Rasmussen RR Larson E Toms S Sugimoto eds. Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries ACM New York 342351.

    • Search Google Scholar
    • Export Citation
  • Strotmann, A, Zhao, D, Bubela, T. Author name disambiguation for collaboration network analysis and visualization. Proceedings of the American Society for Information Science and Technology 2009 46 1 120 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Tan, YF, Kan, MY, Lee, D 2006 Search engine driven author disambiguation G Marchionini ML Nelson CC Marshall eds. Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries ACM New York 314315 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Tang, L, Walsh, JP. Bibliometric fingerprints: name disambiguation based on approximate structure equivalence of cognitive maps. Scientometrics 2010 84 3 763784 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Therneau, T. M., & Atkinson, B. (2010). rpart: Recursive partitioning. http://CRAN.R-project.org/package=rpart. Accessed 01 Aug 2011.

  • Torvik, VI, Smalheiser, NR. Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data (TKDD) 2009 3 3 129 .

  • Torvik, VI, Weeber, M, Swanson, DR, Smalheiser, NR. A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology 2005 56 2 140158 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • U.S. Census Bureau (2000). Frequently occurring surnames from Census 2000. http://www.census.gov/genealogy/www/data/2000surnames/index.html. Accessed 01 Aug 2011.

    • Search Google Scholar
    • Export Citation
  • Wooding, S, Wilcox-Jay, K, Lewison, G, Grant, J. Co-author inclusion: A novel recursive algorithmic method for dealingwith homonyms in bibliometric analysis. Scientometrics 2006 66 1 1121 .

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Yang, K. H., Jiang, J. Y., Lee, H. M., & Ho, J. M. (2006). Extracting citation relationships from web documents for author disambiguation. Taipei: Technical Report (TR-IIS-06-017).

    • Search Google Scholar
    • Export Citation
  • Yin, X., Han, J., & Yu, P. S. (2007). Object distinction: Distinguishing objects with identical names. In Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop (pp. 12421246). Washington, DC: IEEE.

    • Search Google Scholar
    • Export Citation

Manuscript submission: http://www.editorialmanager.com/scim/

  • Impact Factor (2019): 2.867
  • Scimago Journal Rank (2019): 1.210
  • SJR Hirsch-Index (2019): 106
  • SJR Quartile Score (2019): Q1 Computer Science Apllications
  • SJR Quartile Score (2019): Q1 Library and Information Sciences
  • SJR Quartile Score (2019): Q1 Social Sciences (miscellaneous)
  • Impact Factor (2018): 2.770
  • Scimago Journal Rank (2018): 1.113
  • SJR Hirsch-Index (2018): 95
  • SJR Quartile Score (2018): Q1 Library and Information Sciences
  • SJR Quartile Score (2018): Q1 Social Sciences (miscellaneous)

For subscription options, please visit the website of Springer

Scientometrics
Language English
Size B5
Year of
Foundation
1978
Volumes
per Year
4
Issues
per Year
12
Founder Akadémiai Kiadó
Founder's
Address
H-1117 Budapest, Hungary 1516 Budapest, PO Box 245.
Publisher Akadémiai Kiadó
Springer Nature Switzerland AG
Publisher's
Address
H-1117 Budapest, Hungary 1516 Budapest, PO Box 245.
CH-6330 Cham, Switzerland Gewerbestrasse 11.
Responsible
Publisher
Chief Executive Officer, Akadémiai Kiadó
ISSN 0138-9130 (Print)
ISSN 1588-2861 (Online)