Better Together: Text + Context

Full proposal

1 Introduction

We will build several different embeddings of documents from Semantic Scholar,¹ a collection of N ≈ 200 million documents from seven sources: (1) Microsoft Academic Graph (MAG),² (2) DOI,³ (3) PubMed,⁴ (4) PubMedCentral,⁵ (5) DBLP,⁶ (6) ArXiv⁷ and (7) the ACL Anthology.⁸ Since Semantic Scholar plays such a central role in this project, we are fortunate to have Maria Antoniak and Sergey Feldman on our team. They are members of the Semantic Scholar group at the Allen Institute for Artificial Intelligence. Semantic Scholar is a signif- icant effort involving more than 50 people working over 7 years. Semantic Scholar has more than 8M monthly active users. Another team member, Hui Guan, provides expertise in Systems, which will be useful since scale is a challenge for many of the methods discussed below.

2 Embeddings

An embedding is a dense matrix, M ∈ ℜ ^{N x K}, where K is the number of hidden dimensions. The rows in M represent documents. Cosines of two rows estimate the similarity of two documents. Some embeddings, Mt, are text-based, and other embeddings, Mc, are context-based:

Examples of text-based embeddings:
1. BERT encoding of titles, abstracts, body
Examples of context-based embeddings:
1. node2vec encodings of citation graph
2. BERT encoding of citing sentences

We can construct examples of Mt and Mc from resources from Semantic Scholar:

Specter Embeddings (Cohan et al., 2020), based on SciBERT (Beltagy et al., 2019)
Citation Graph: G = (N, E); N ≈ 200M nodes (docs) & E ≈ 2B edges (citations).
Citing sentences for doc i: Sentences, s, in other documents j, where s cites doc i.

We will start with Specter Embeddings as an example of Mt and a node2vec (Grover and Leskovec, 2016) encodings of the citation graph as an example of Mc. Node2vec maps a graph (sparse N × N Boolean matrix) toadensematrixin R^N×K Our first example of Mc will use node2vec to construct an embedding from the citation graph. Specter embeddings and citation graphs will be downloaded from Semantic Scholar. A number of node2vec methods are supported under nodevectors.⁹ ProNE (Zhang et al., 2019) is particularly promising. Many node2vec methods are based on deep nets, but ProNE is based on SVD. We will also experiment with GNNs¹⁰ for combining text embeddings with citation graphs.

Much of the recent excitement over embeddings started with (Devlin et al., 2018), though embeddings have been important in Information Retrieval for at least thirty years (Deerwester et al., 1990). The older bag-of-word methods have some advantages over more recent methods that are limited to the first 512 subword units. Node2vec can be viewed in terms of spectral clustering; cosines of node2vec embeddings are related to random walks on graphs.

More sophisticated versions of M_c will take advantage of citing sentences. We believe that citing sentences will help with terms such as “Turing Machine.” That term is common in sentences that cite Turing et al. (1936), even though the term does not appear in Turing’s paper, since he did not name anything after himself.

Some embeddings evolve over time, and some do not. Since the text does not change after a paper is published, Mt also does not change after publication. However, Mc evolves as more and more papers are published over time. We like to think of the literature as a conversation like social media. The value of a paper combines (time invariant) contributions from authors with (monotonically increasing) contributions from the audience.

3 Deliverables

This project will distribute resources (embeddings) and tools (programs and/or APIs) for ranked re- trieval, routing and recommending papers to read and/or cite. In addition, we will enhance our theo- retical understanding of deep nets by making con- nections with SVD. The routing application is of particular interest since many conferences assign papers to reviewers with software that may not work well, and may benefit from additional testing.

Evaluations will show that some embeddings are better for capturing text (Figure 1a) and other embeddings are better for capturing context (Figure 1b). Combinations of these embeddings (Figure 1d) are better than either by itself.

Evaluations will be based on a number of benchmarks including MAG240M11 and SciRepEval.¹² Evaluation of routing systems will use materials from (Mimno and McCallum, 2007).

References

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB- ERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615– 3620, Hong Kong, China. Association for Computational Linguistics.
Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel Weld. 2020. SPECTER: Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online. Association for Computational Linguistics.
Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391– 407.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceed- ings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864.
David Mimno and Andrew McCallum. 2007. Expertise modeling for matching papers with reviewers. In KDD.
Alan Mathison Turing et al. 1936. On computable numbers, with an application to the entscheidungsproblem. J. of Math, 58(345-363):5.
Jie Zhang, Yuxiao Dong, Yan Wang, Jie Tang, and Ming Ding. 2019. ProNE: fast and scalable network repre- sentation learning. In Proceedings of the 28th Inter- national Joint Conference on Artificial Intelligence, pages 4278–4284.

¹ https://www.semanticscholar.org/product/api

² https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/

³ https://www.doi.org/

⁴ https://pubmed.ncbi.nlm.nih.gov/

⁵ https://www.ncbi.nlm.nih.gov/pmc/

⁶ https://dblp.org/

⁷ https://arxiv.org/

⁸ https://aclanthology.org/

⁹ https://github.com/VHRanger/nodevectors

¹⁰ https://snap-stanford.github.io/cs224w-notes/machine-learning-with-networks/graph-neural-networks/

¹¹ https://ogb.stanford.edu/docs/lsc/

¹² https://github.com/allenai/scirepeval