Annotating your corpus with the Stanford Dependency Parser

Author: Larissa Goulart, Northern Arizona University (lg845@nau.edu)
Tested with CoreNLP 4.1.0

Before we start

The goal of this workshop is to teach you how to annotate your corpus with the Stanford Dependency Parser. In order to follow the workshop, you will need to download:

1. CoreNLP

Download the CoreNLP from https://stanfordnlp.github.io/CoreNLP/

Unzip the files

2. Java

Stanford CoreNLP is written in Java. Therefore, you will need to have Java installed in your computer.You will need at least Java 8. Select the file to download according to your operating system https://java.com/en/download/manual.jsp
For windows users:If you don’t know your computer’s operating system, you can click on Windows/Start>Settings> System > About to check.

For mac users: You will likely need to install Java SE Development Kit

3. Corpus Text Processor from CROW

Before running the Stanford CoreNLP, you will need to convert and standardize your text. For this, we will use the Corpus Text Processor. You can read more about it here

Overview

Here is a general step by step of this workshop:

1.Introduction to the Stanford Dependency Parser

2.Preparing the files

3.Running the Stanford Dependency Parser

1.Introduction to the Stanford Dependency Parser

The Stanford CoreNLP suite offers java-based tools to run a number of NLP basic tasks. CoreNLP is a toolkit updated frequently by the NLP research group at Stanford University. CoreNLP integrates the following tools:

These tools allow us to extract linguistics annotations from our corpus, similar to what is seen below.

For corpus linguists, these annotations can be useful to conduct different types of analyses, such as, key-feature analysis, multi-dimensional analysis, and machine learning applications.

The two main annotators used for grammatical analysis are the part-of-speech tagger and the dependencies.

1.1. Part-of-Speech Annotation

Part-of-speech annotation gives you information about the word class of each word in a sentence.

1.2. Dependencies

Dependencies indicate the syntactical relationship between the words in a sentence.

Jurafsky & Martin, 2019

Universal Stanford Dependencies are based on Universal Grammar. You can read more about them here

2.Preparing the files

2.1. Files: LILE Corpus

For this workshop we will use ten files from the LILE Corpus (Sarmento, Scortegagna & Goulart, 2011). This corpus contains abstracts of Thesis and Dissertations written in the fields of Literature and Applied Linguistics. You can download the files we will use here or here.

2.2. Corpus Text Processor

We will use CROW’s Corpus Text Processor to prepare the files. I recommend that you create three folders in your computer that mirror the menus in the Corpus Text Processor: 01_converted, 02_encoded, and 03_standardized before running the tool. It should look like this.

And this is what you will see when you open the Corpus Text Processor

To convert the files, we will follow the order of the menus in the Corpus Text Processor. You can also watch the video demonstration on how to run it.

Now, we have all the files in the correct format to run the Stanford Dependency Parser.

3.Running the Stanford Dependency Parser

3.1. Checking the Java settings

In order to run CoreNLP, you will need to allocate more RAM memory to Java than your computer’s default setting. If you have a lot of RAM this will not be a problem for you.You can check this by going to Windows/Start>Settings> System > About again.

When explaining the memory needed to run CoreNLP, the Stanford NLP group said that:

CoreNLP needs about 2GB to run the entire pipeline.

Let’s allocate more RAM to Java in your machine.

1. Open your Control Panel

2. Select Programs

3. Select Java

4. Click on the tab Java

5. Change the Runtime Parameter

6. Decide on the most suitable parameter

-Xmx512m assigns 512MB memory for Java.
-Xmx1024m assigns 1GB memory for Java.
-Xmx2048m assigns 2GB memory for Java.
-Xmx3072m assigns 3GB memory for Java
And so on…

You can start with -Xmx3072m and if CoreNLP does not work try -Xmx2048m and so on. Keep in mind that as you assign more memory to Java, CoreNLP will run faster.

7. Click OK on the “Java Runtime” window and then Apply in the “Java Control Panel”

Alternative tutorial on how to assign more RAM to Java

To understand why CoreNLP needs more memory than other applications, you can check the developers’ explanation here.

**3.2. Navigating to the stanford-corenlp-full folder**

Windows:

1. Open Power Shell ISE

Mac:

1. Open your Terminal

Launchpad > Terminal

2. Use dr on Windows or ls on Mac to identify the folders in your directory

3. Use cd to navigate to the directory where you saved the stanford-corenlp-full folder

If you run into an error, check that you are doing cd + space + the name of the folder. You can also use tab to autocomplete the name of the directory.

Once you find the folder you are looking for, press Enter

This is how it should look on a Windows machine

This is how it should look on a Mac

I saved the stanford-corenlp-full folder in my documents. So, in my computer, this is the code:

cd Documents\stanford-corenlp-4.1.0

3.3. Launching CoreNLP (optional)

After navigating to the CoreNLP folder, we are going to run this code.

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer [port] [timeout]

Where -mx4g indicates the amount of RAM that CoreNLP can use. So, if you set 3GB to Java you can use -mx3g and so on.

It should look like this:

Then, open http://localhost:9000 in your machine and see if an interface like this will open.

3.4. Running CoreNLP in one file

The previous code opens the CoreNLP interface. With the interface, you can analyze one sentence at a time, but if you want to annotate the whole text, this is the code we will use:

java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -file filename_here -outputFormat conll

Where filename_here is the name of your file and -mx3gp is the amount of RAM you assigned to Java. In addition, make sure to write the path between inverted commas " ".

Windows

This is an example of the code on a Windows:

java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -file '..\LAEL_Research_Bazaar\Corpus Text Processor\03_Standardized\LIRNLGI101.txt' -outputFormat conll

Mac

This is an example of the code on a Mac:

java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -file '../LAEL_Research_Bazaar/Corpus Text Processor/03_Standardized/LIRNLGI101.txt' -outputFormat conll

Where ..\ means that I am going back one folder.

It will take a couple of seconds to annotate the text.

After running CoreNLP, your file should look like this

3.5. Running CoreNLP in a directory

While annotating one file may be all you need, usually, for corpus linguistics analysis, you will have to annotate several texts in a directory.

Windows

To annotate a whole directory, this is the code you will use for Windows:

Get-ChildItem " input_Directory " -Filter *.txt |Foreach-Object {java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -file$_.FullName -outputDirectory “ output_Directory ” -outputFormat conll}

Make sure to write the input and the output path between inverted commas " ".

This is an example of the code on a Windows:

Get-ChildItem "..\LAEL_Research_Bazaar\Corpus Text Processor\03_Standardized" -Filter *.txt |Foreach-Object {java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -file $_.FullName -outputDirectory "..\LAEL_Research_Bazaar\tagged" -outputFormat conll}

Here is what it should look like in your PowerShell ISE:

Mac

This is the code you will use for a Mac:

for i in input_directory /*.txt; do (java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -file $i -outputDirectory “ output_Directory ” -outputFormat conll); done

Make sure to write the input and the output path between inverted commas " ".

This is an example of the code on a Mac:

for i in ../LAEL_Research_Bazaar/Corpus Text Processor/03_Standardized/*.txt; do (java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -file $i -outputDirectory "../LAEL_Research_Bazaar/tagged" -outputFormat conll); done

3.6. Annotators available

In the previous step, we ran CoreNLP annotating our texts for lemmas, part-of-speech, entity recognition and dependencies, but CoreNLP offers many other tools.You can see the full list of annotators here https://stanfordnlp.github.io/CoreNLP/annotators.html

To change the annotators we are using, we will add the flag -annotators plus the annotators we would like to run. For example, the code

Windows
java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos -file '..\LAEL_Research_Bazaar\Corpus Text Processor\03_Standardized\LIRNLGI101.txt' -outputFormat conll

Mac
java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos -file '../LAEL_Research_Bazaar/Corpus Text Processor/03_Standardized/LIRNLGI101.txt' -outputFormat conll

will run only the tokenizer, the sentence split and the part-of-speech annotators.

Now, the output will look like this:

Example: sentiment analysis

Let’s say you want to extract sentiment analysis, all you have to do is to include the sentiment annotators in the flag -annotators and change the outputFormat to xml.

Windows
java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,parse,sentiment -file '..\LAEL_Research_Bazaar\Corpus Text Processor\03_Standardized\LIRNLGI101.txt' -outputFormat xml

Mac
java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,parse,sentiment -file '../LAEL_Research_Bazaar/Corpus Text Processor/03_Standardized/LIRNLGI101.txt' -outputFormat xml

Note that there are dependencies between annotators. Thus, to run the sentiment analysis annotator, you need to run the tokenizer, the sentence split, the part of speech tagger and the parser also.

Just remember the more annotators you include, the longer it will take to annotate each text.

3.7. Flags key

We can think about the flags as items in a menu. Here is a dictionary for the different flags we used.

-annotators: we choose the annotators we want to use.
-file: we indicate the file that will be annotated.
-outputFormat: we select the format of the file that is output with the tags.
-props: we can use props to indicate the language we are annotating.
-outputDirectory: we choose the folder where we are going to save the texts.

3.8. Reading the output

The output with lemmas, part-of-speech tags, recognized entities and dependencies will look like this:

a. Part-of-Speech Tags (column 4/E)

CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb

b. Recognized entities (column 5/F)

In English, CoreNLP recognizes
PERSON
LOCATION
ORGANIZATION
MISC
MONEY
NUMBER
ORDINAL
PERCENT DATE
TIME
DURATION
SET

c. Dependencies (column 6 and 7/G and H)

Column 7 annotates the dependencies, while column 6 indicates the relation. For the complete dependecies list, you can check it here - Universal Dependency Relations

3.9. Annotating other languages

To annotate texts in other languages, first you need to download the properties for the language you want to use. Find the correct file here.

Let’s practice with one file in Spanish. Here is the code you will need to run:

java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -Spanish.properties -file filename_here -outputFormat conll

Example Windows
java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-spanish.properties -file '..\spanish.txt' -outputFormat conll

Mac
java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-spanish.properties -file '../spanish.txt' -outputFormat conll

Ideas on how to count the tags

Antconc

We can use Antconc to look at the texts and count the tags.

You can do this by opening Antconc > File > Open File > Change files of type for All Files > add all texts.

RStudio

We can also use RStudio, as an example you could use this code to count the tags:

library(tidyverse)
file_path <- "tagged/LIRNLGI101.txt.conll"

file <- read_tsv(file_path,
col_names = FALSE,
skip_empty_rows = FALSE)
tags <- file %>%
count(X4)
View(tags)
write_csv(tags, "tags.csv")

List of Resources

NLP with Stanford CoreNLP
Stanford CoreNLP Tutorial
Stanford CoreNLP
Processing multiple files in a directory
Taylor et al.