Logo JSALT2023
JSALT2023

Better Together: Text + ContextAbstract

 

As seen in Figure 1, it is standard practice to represent documents, (a), as embeddings, (d). We will do this in multiple ways. Embeddings based on deep nets (BERT) capture text and other embeddings based on node2vec and GNNs (graph neural nets), (c), capture citation graphs, (b). Embeddings encode each of N ≈ 200M documents as a vector of K ≈ 768 hidden dimensions. Cosines of two vectors denote similarity of two documents. We will evaluate these embeddings and show that combinations of text and citations are better than either by itself on standard benchmarks of downstream tasks.

As deliverables, we will make embeddings available to the community so they can use them in a range of applications: ranked retrieval, recommender systems and routing papers to reviewers. Our interdisciplinary team will have expertise in machine learning, artificial intelligence, information retrieval, bibliometrics, NLP and systems. Standard embeddings are time invariant. The representation of a document does not change after it is published. But citation graphs evolve over time. The representation of a document should combine time invariant contributions from the authors with constantly evolving responses from the audience, like social media.

 

 

Full proposal

1 Introduction

We will build several different embeddings of documents from Semantic Scholar,1 a collection of N ≈ 200 million documents from seven sources: (1) Microsoft Academic Graph (MAG),2 (2) DOI,3 (3) PubMed,4 (4) PubMedCentral,5 (5) DBLP,6 (6) ArXiv7 and (7) the ACL Anthology.8 Since Semantic Scholar plays such a central role in this project, we are fortunate to have Maria Antoniak and Sergey Feldman on our team. They are members of the Semantic Scholar group at the Allen Institute for Artificial Intelligence. Semantic Scholar is a signif- icant effort involving more than 50 people working over 7 years. Semantic Scholar has more than 8M monthly active users. Another team member, Hui Guan, provides expertise in Systems, which will be useful since scale is a challenge for many of the methods discussed below.

 

2 Embeddings

An embedding is a dense matrix, M ∈ ℜ N x K, where K is the number of hidden dimensions. The rows in M represent documents. Cosines of two rows estimate the similarity of two documents. Some embeddings, Mt, are text-based, and other embeddings, Mc, are context-based:

  1. Examples of text-based embeddings:
    1. BERT encoding of titles, abstracts, body
  2. Examples of context-based embeddings:
    1. node2vec encodings of citation graph
    2. BERT encoding of citing sentences

We can construct examples of Mt and Mc from resources from Semantic Scholar:

  1. Specter Embeddings (Cohan et al., 2020), based on SciBERT (Beltagy et al., 2019)
  2. Citation Graph: G = (N, E); N ≈ 200M nodes (docs) & E ≈ 2B edges (citations).
  3. Citing sentences for doc i: Sentences, s, in other documents j, where s cites doc i.

We will start with Specter Embeddings as an example of Mt and a node2vec (Grover and Leskovec, 2016) encodings of the citation graph as an example of Mc. Node2vec maps a graph (sparse N × N Boolean matrix) toadensematrixin RN×K Our first example of Mc will use node2vec to construct an embedding from the citation graph. Specter embeddings and citation graphs will be downloaded from Semantic Scholar. A number of node2vec methods are supported under nodevectors.9 ProNE (Zhang et al., 2019) is particularly promising. Many node2vec methods are based on deep nets, but ProNE is based on SVD. We will also experiment with GNNs10 for combining text embeddings with citation graphs.

Much of the recent excitement over embeddings started with (Devlin et al., 2018), though embeddings have been important in Information Retrieval for at least thirty years (Deerwester et al., 1990). The older bag-of-word methods have some advantages over more recent methods that are limited to the first 512 subword units. Node2vec can be viewed in terms of spectral clustering; cosines of node2vec embeddings are related to random walks on graphs.

More sophisticated versions of Mc will take advantage of citing sentences. We believe that citing sentences will help with terms such as “Turing Machine.” That term is common in sentences that cite Turing et al. (1936), even though the term does not appear in Turing’s paper, since he did not name anything after himself.

Some embeddings evolve over time, and some do not. Since the text does not change after a paper is published, Mt also does not change after publication. However, Mc evolves as more and more papers are published over time. We like to think of the literature as a conversation like social media. The value of a paper combines (time invariant) contributions from authors with (monotonically increasing) contributions from the audience.

 

3 Deliverables

This project will distribute resources (embeddings) and tools (programs and/or APIs) for ranked re- trieval, routing and recommending papers to read and/or cite. In addition, we will enhance our theo- retical understanding of deep nets by making con- nections with SVD. The routing application is of particular interest since many conferences assign papers to reviewers with software that may not work well, and may benefit from additional testing.

Evaluations will show that some embeddings are better for capturing text (Figure 1a) and other embeddings are better for capturing context (Figure 1b). Combinations of these embeddings (Figure 1d) are better than either by itself.

Evaluations will be based on a number of benchmarks including MAG240M11 and SciRepEval.12 Evaluation of routing systems will use materials from (Mimno and McCallum, 2007).

 

References

  • Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB- ERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615– 3620, Hong Kong, China. Association for Computational Linguistics.
  • Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel Weld. 2020. SPECTER: Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online. Association for Computational Linguistics.
  • Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391– 407.
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceed- ings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864.
  • David Mimno and Andrew McCallum. 2007. Expertise modeling for matching papers with reviewers. In KDD.
  • Alan Mathison Turing et al. 1936. On computable numbers, with an application to the entscheidungsproblem. J. of Math, 58(345-363):5.
  • Jie Zhang, Yuxiao Dong, Yan Wang, Jie Tang, and Ming Ding. 2019. ProNE: fast and scalable network repre- sentation learning. In Proceedings of the 28th Inter- national Joint Conference on Artificial Intelligence, pages 4278–4284.

 

1 https://www.semanticscholar.org/product/api

2 https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/

3 https://www.doi.org/

4 https://pubmed.ncbi.nlm.nih.gov/

5 https://www.ncbi.nlm.nih.gov/pmc/

6 https://dblp.org/

7 https://arxiv.org/

8 https://aclanthology.org/

9 https://github.com/VHRanger/nodevectors

10  https://snap-stanford.github.io/cs224w-notes/machine-learning-with-networks/graph-neural-networks/

11 https://ogb.stanford.edu/docs/lsc/

12 https://github.com/allenai/scirepeval

Group Members

Full-time members

Senior reserchers

  • Kenneth W. Church (Northeastern University) team leader
  • John E. Ortega (Northeastern University)
  • Shabnam Tafreshi (University of Maryland)
  • Hui Guan (University of Massachusetts, Amherst)

Junior researchers

  • Abteen Ebrahimi (University of Colorado Boulder.)
  • Sandeep Polisety (University of Massachusetts Amherst)
  • Peter vickers (University of Sheffield)
  • Rodolfo Zevallos (Universitat Pompeu Fabra, Barcelona)

Undergrade students

  • Benjamin Irving (Northeastern University)
  • Melissa Mitchell (University of Washington)

Part-time members

  • Sergey Feldman (Allen Institute for Artificial Intelligence)
  • Martin Dočekal (Brno University of Technology)

  • Jiameng Sun (Northeastern University)

Partagez :