Our Blogs

Tell us about your business and goals. Our team personally reviews every inquiry and
responds within 24 hours — no auto-replies, no sales scripts, no runaround.

Entity Salience Patent Review

Although many NLP systems are moving toward entity-based processing, most still identify important phrases using classical keyword-based approaches. To bridge this gap, we introduce the task of entity salience: assigning a relevance score to each entity in a document.

Data Scientists demonstrated how a labeled corpus for the task can be automatically generated from a corpus of documents and accompanying abstracts. We then show how a classifier with features derived from a standard NLP pipeline outperforms a strong baseline by 34%. Finally, we outline initial experiments on further improving accuracy by leveraging background knowledge about the relationships between entities.

INTRODUCTION:

Information retrieval, summarization, and online advertising rely on identifying the most important words and phrases in web documents. While traditional techniques treat documents as collections of keywords, many NLP systems are shifting toward understanding documents in terms of entities.

Google is a semantic search engine, meaning it tries to understand the content’s meaning rather than just keyword matches. Google uses Natural Language Processing (NLP) algorithms to understand content.

Google needed new algorithms to determine the prominence –meaning the salience – of each entity in the document.

Toward this end, these scientists describe three primary contributions:

  1. First, they show how a labeled corpus for this task can be automatically constructed from a corpus of documents with accompanying abstracts. They also demonstrate the validity of the corpus with a manual annotation study.
  2. Second, they trained an entity salience model using features derived from a co reference resolution system. This model significantly outperforms a baseline model based on sentence position.
  3. Third, they suggested how their model can be improved by leveraging background information about the entities and their relationships – information not specifically provided in the document in question.

Google generated a salience corpus for testing:

According to the patent, testing was done based on a Corpus of documents from the NY Times, the corpus of document/abstract pairs was the annotated (labeled) New York Times corpus. It includes 1.8 million articles published between January 1987 and June 2007; some 650,000 include a summary written by one of the newspaper’s library scientists.

We automatically generate salience labels for an existing corpus of document/abstract pairs. We derive the labels using the assumption that the salient entities will be mentioned in the abstract, so we identify and align the entities in each text.

Given a document and abstract, we run a standard NLP pipeline on both. This includes a POS tagger and dependency parser, comparable in accuracy to the current Stanford dependency parser (Klein and Manning, 2003); an NP extractor that uses POS tags and dependency edges to identify a set of entity mentions; a coreference resolver, comparable to that of Haghighi and Klein, (2009) for clustering mentions; and an entity resolver that links entities to Freebase profiles. The entity resolver is described in detail by Lao, et al. (2012). We then apply a simple heuristic to align the entities in the abstract and document.

Salience classification:

These scientists built a regularized binary logistic regression model to predict the probability that an entity is salient. To simplify feature selection and to add some further regularization, they used feature hashing to randomly map each feature string to an integer in [1; 100000]; larger alphabet sizes yielded no improvement.

Positional baseline:

For news documents, it is well known that sentence position is a very strong indicator for relevance. Thus, our baseline is a system that identifies an entity as salient if it is mentioned in the first sentence of the document. (Including the next few sentences did not significantly change the score.)

Model features

Table 2 describes our feature classes; each individual feature in the model is a binary indicator. Count features are bucketed by applying the function f(x) = round(log(k(x + 1))), where k can be used to control the number of buckets. We simply set k = 10 in all cases.

Table 3 shows the experimental results on our test set. Each experiment uses a classification threshold of 0.3 to determine salience, which in each case is very close to the threshold that maximizes F1. For comparison, a classifier that always predicts the majority class, non-salient, has F1 = 23:9 (for the salient class).

Entity centrality

All the features described above use only information available within the document. However, articles are written with the assumption that the reader knows something about at least some of the entities involved. We experimented with a simple method for including background knowledge about each entity and an adaptation of PageRank to a graph of connected entities.

Consider, for example, an article about a recent congressional budget debate. Although House Speaker John Boehner may be mentioned just once, we know he is likely salient because he is closely related to other entities in the article, such as Congress, the Republican Party, and Barack Obama. On the other hand, the Federal Emergency Management Agency may be mentioned repeatedly because it happened to host a major presidential speech, but it is less related to the story’s key figures and less central to the article’s point.

Our intuition about these relationships, mostly not explicit in the document, can be formalized in a local PageRank computation on the entity graph.

PageRank for computing centrality

In the weighted version of the PageRank algorithm, a web link is considered a weighted vote by the containing page for the landing page – a directed edge in a graph where each node is a webpage. In place of the web graph, we consider the graph of Freebase entities that appear in the document. The nodes are the entities, and a directed edge from E1 to E2 represents P(E2|E1), the probability of observing E2 in a document given that we have observed E1.

We estimate P(E2|E1) by counting the number of training documents in which E1 and E2 co-occur and normalizing by the number of training documents in which E1 occurs.

The nodes’ initial PageRank values act as a prior, where the uniform distribution, used in the classic PageRank algorithm, indicates a lack of prior knowledge. Since we have some prior signal about salience, we initialize the node values to the normalized mention counts of the entities in the document. We use a damping factor d, allowing random jumps between nodes with probability 1 – d, with the standard value d = 0:85.

We implemented the iterative version of weighted PageRank, which tends to converge in under 10 iterations. The centrality features in Table 3 are indicators for the rank orders of the converged entity scores. The improvement from adding centrality features is small but statistically significant at p ≤ 0:001.

https://code.google.com/p/nyt-salience

Sources:

https://code.google.com/p/nyt-salience
Jesse DunietzComputer Science DepartmentCarnegie Mellon UniversityPittsburgh, PA 15213, USAjdunietz@cs.cmu.eduDan GillickGoogle Research1600 Amphitheatre ParkwayMountain View, CA 94043, USAdgillick@google.com
Picture of Ameneh

Ameneh

Ameneh Saeednia is a Co-Founder and Trust Strategist at Re-Imagine That Digital. She focuses on the human and strategic dimensions of digital authority — building the trust architecture that makes brands believable, credible, and citation-worthy in the eyes of both search engines and real audiences. Ami translates E-E-A-T principles into actionable brand trust systems. ORCID: https://orcid.org/0000-0002-8812-2306 Podcast: https://www.youtube.com/@ReimaginingDigital