Probabilistic Identification of Biomedical Author-Inventors Across PubMed and USPTO

Wednesdays@NICO Seminar, Noon, October 12 2011, Chambers Hall, Lower Level

Prof. Vetle Torvik, University of Illinois at Urbana-Champaign


Large-scale studies of named entities like people, organizations, genes, or drugs can suffer from severe bias introduced by name ambiguity. The assumption that a name uniquely identifies an entity is often made because disambiguation is time-consuming and error-prone when done manually, and simple computational approaches fail to capture the complexity of an identity that can also change over time. In an effort to enable unambiguous studies of biomedical scientists, their collaborative networks, and the flow of knowledge at the intersection of science and technology, this talk will focus on a newly initiated project aimed at identifying the individuals who both publish and patent across two large bibliographic databases: PubMed and USPTO. At the heart of our approach is a multi-dimensional model that combines many explicit and implicit dimensions of similarity between the publication profile of an author and the patenting profile of an inventor in order to estimate the probability that the two profiles refer to the same individual. Our preliminary results show that, even though the overlap among authors and inventors is relatively low, this approach can capture the great majority of the real author-inventors with high precision.


Torvik is an Assistant Professor in the Graduate School of Library and Information Science at the University of Illinois at Urbana-Champaign where he teaches courses on text/data mining, informatics, information processing, literature-based discovery, and bioinformatics. His current research ad-dresses problems related to the practice of science and innovation, often using large-scale bibliographic databases as a source for text/data-mining models. More Information can be found at