# Summarizing Spanish with Stanford CoreNLP

After a summer replete with feature-engineering and corpus processing, the Stanford NLP Group has just released CoreNLP 3.4.1, which includes support for Spanish-language text. In this post I’ll show how to make use of these tools to make a dead-simple document summarizer.1

Our end goal will be to take a news article of significant length and reduce it to its two or three most important points. We’ll run through each sentence and assign it a score based on two factors:

1. tf–idf weights. The tf–idf metric is a formula which explains how important a particular word is in the context of its containing document. We’ll calculate the sum of tf–idf scores for all nouns in each sentence, and consider those sentences with the greatest sums to be the most important.

The tf–idf metric is the product of two factors:

The first is a term frequency factor, which tracks how often the word appears in its containing document. It is some scaled version of the number of times the word appears in the given document. We’ll use a logarithm form here:

The second is an inverse document frequency (IDF) factor. This measures the informativeness of the word based on how often it appears in total across an entire corpus. The inverse document frequency factor is a logarithm as well:

Note that IDF values will be exactly 0 for common words like “the,” as they are likely to appear in every document in the corpus. Meaningful and less common words like “transmogrify” and “incinerate” will yield higher IDF values.

2. Positional weight. For news articles, another easy measure of the importance of a sentence is its position in the document: important sentences tend to appear before less crucial ones. We can model this by scaling our original tf–idf score by the index of the sentence within the document.

With theory over, let’s get to the code. I’m going to walk through a Java class Summarizer, the full source code of which is available in a GitHub repo. Our only dependency here is Stanford CoreNLP 3.4.1. We begin by instantiating the CoreNLP pipeline statically.

As we discussed earlier, the summarizer depends upon document frequency data, which must be precalculated from a corpus of Spanish text. In the constructor of the Summarizer, we receive a prebuilt dfCounter and determine the total number of documents in the training corpus.2

Our main routine, summarize, accepts a document string and a number of sentences to return.

The method rankSentences sorts the provided sentence collection using a custom comparator SentenceComparator, which contains the bulk of our actual logic for sentence importance. Here’s the framework:

score and the following methods are the meat of the entire code. score accepts a sentence and returns a floating-point value indicating the sentence’s importance.

score calls a method tfIDFWeights, which determines the total tf–idf scores for all the nouns in the given sentence:

That’s it for the code. You can see the entire class in this public GitHub repo.

I’ll end with a quick unscientific test of the code. I built document-frequency counts (using a helper DocumentFrequencyCounter class) from the Spanish Gigaword, which contains about 1.5 billion words of Spanish. It took several days (running on a 16-core machine) to POS-tag each sentence and collect the nouns in a global counter.3

I next tested with a few recent Spanish news articles, requesting a two-sentence summary of each. Here’s the output summary of an article on the Laniakea supercluster:

Las galaxias no están distribuidas al azar en todo el universo, sino que se encuentran en grupos, al igual que nuestro propio Grupo Local, que contiene docenas de galaxias, y en cúmulos masivos, que poseen cientos de galaxias, todas interconectadas en una red de filamentos en la que se ensartan como perlas. Estos expertos han bautizado al supercúmulo con el nombre de ‘Laniakea’, que significa “cielo inmenso” en hawaiano, como informan en un artículo de la edición de este jueves de Nature. Una galaxia entre dos estructuras de este tipo puede quedar atrapada en un tira y afloja gravitacional en el que el equilibrio de las fuerzas gravitacionales que rodean las estructuras a gran escala determina el movimiento de la galaxia.

La inclusión de la capital de Francia como nueva jurisdicción para hacer efectivos los desembolsos a los acreedores ha sido una iniciativa del bloque ‘cristinista’ para ganar los votos de algunos legisladores opositores. Por ejemplo, los legisladores del Frente Renovador, también peronista pero no ‘cristinista’, según la prensa, acordarían con la inclusión de París, por considerar que allí los pagos estarían a salvo de los fondos especulativos o ‘buitre’. Con esta iniciativa el gobierno de la presidenta Cristina Fernández, viuda de Kirchner, pretende esquivar a la justicia de los Estados Unidos y a los fondos especulativos o ‘buitre’ que ganaron a Argentina un juicio y colocaron al país en ‘default’ parcial.

I hope this code serves as a useful example for using basic CoreNLP tools in Spanish. Feel free to follow up below in the comments or by email!

1. I won’t claim this will always give fantastic summarizations, but it’s definitely a quick and easy-to-grasp algorithm.

2. If you are interested in how this helper data is constructed, see the DocumentFrequencyCounter class in the GitHub repo.

3. This probably could have been optimized quite a bit down to the level of hours – but when you’ve got the time…