Author: Larissa Goulart, Northern Arizona University (lg845@nau.edu)
Tested with CoreNLP 4.1.0
The goal of this workshop is to teach you how to annotate your corpus with the Stanford Dependency Parser. In order to follow the workshop, you will need to download:
1. CoreNLP
2. Java
Stanford CoreNLP is written in Java. Therefore, you will need to have Java installed in your computer.You will need at least Java 8. Select the file to download according to your operating system https://java.com/en/download/manual.jsp
For windows users:If you don’t know your computer’s operating system, you can click on Windows/Start>Settings> System > About to check.
3. Corpus Text Processor from CROW
Here is a general step by step of this workshop:
1.Introduction to the Stanford Dependency Parser
2.Preparing the files
3.Running the Stanford Dependency Parser
The Stanford CoreNLP suite offers java-based tools to run a number of NLP basic tasks. CoreNLP is a toolkit updated frequently by the NLP research group at Stanford University. CoreNLP integrates the following tools:
These tools allow us to extract linguistics annotations from our corpus, similar to what is seen below.
For corpus linguists, these annotations can be useful to conduct different types of analyses, such as, key-feature analysis, multi-dimensional analysis, and machine learning applications.
The two main annotators used for grammatical analysis are the part-of-speech tagger and the dependencies.
Part-of-speech annotation gives you information about the word class of each word in a sentence.
Dependencies indicate the syntactical relationship between the words in a sentence.
For this workshop we will use ten files from the LILE Corpus (Sarmento, Scortegagna & Goulart, 2011). This corpus contains abstracts of Thesis and Dissertations written in the fields of Literature and Applied Linguistics. You can download the files we will use here or here.
We will use CROW’s Corpus Text Processor to prepare the files. I recommend that you create three folders in your computer that mirror the menus in the Corpus Text Processor: 01_converted, 02_encoded, and 03_standardized before running the tool. It should look like this.
And this is what you will see when you open the Corpus Text Processor
To convert the files, we will follow the order of the menus in the Corpus Text Processor. You can also watch the video demonstration on how to run it.
Now, we have all the files in the correct format to run the Stanford Dependency Parser.
In order to run CoreNLP, you will need to allocate more RAM memory to Java than your computer’s default setting. If you have a lot of RAM this will not be a problem for you.You can check this by going to Windows/Start>Settings> System > About again.
When explaining the memory needed to run CoreNLP, the Stanford NLP group said that:
CoreNLP needs about 2GB to run the entire pipeline.
Let’s allocate more RAM to Java in your machine.
1. Open your Control Panel
2. Select Programs
3. Select Java
4. Click on the tab Java
5. Change the Runtime Parameter
6. Decide on the most suitable parameter
-Xmx512m assigns 512MB memory for Java.
-Xmx1024m assigns 1GB memory for Java.
-Xmx2048m assigns 2GB memory for Java.
-Xmx3072m assigns 3GB memory for Java
And so on…
You can start with -Xmx3072m and if CoreNLP does not work try -Xmx2048m and so on. Keep in mind that as you assign more memory to Java, CoreNLP will run faster.
7. Click OK on the “Java Runtime” window and then Apply in the “Java Control Panel”
After navigating to the CoreNLP folder, we are going to run this code.
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer [port] [timeout]
Where -mx4g
indicates the amount of RAM that CoreNLP can use. So, if you set 3GB to Java you can use -mx3g
and so on.
It should look like this:
Then, open http://localhost:9000
in your machine and see if an interface like this will open.
The previous code opens the CoreNLP interface. With the interface, you can analyze one sentence at a time, but if you want to annotate the whole text, this is the code we will use:
java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -file
filename_here -outputFormat conll
Where filename_here is the name of your file and -mx3gp
is the amount of RAM you assigned to Java. In addition, make sure to write the path between inverted commas " ".
Windows
This is an example of the code on a Windows:
java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP
-file '..\LAEL_Research_Bazaar\Corpus Text Processor\03_Standardized\LIRNLGI101.txt'
-outputFormat conll
Mac
This is an example of the code on a Mac:
java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP
-file '../LAEL_Research_Bazaar/Corpus Text Processor/03_Standardized/LIRNLGI101.txt'
-outputFormat conll
Where ..\
means that I am going back one folder.
It will take a couple of seconds to annotate the text.
After running CoreNLP, your file should look like this
While annotating one file may be all you need, usually, for corpus linguistics analysis, you will have to annotate several texts in a directory.
Windows
To annotate a whole directory, this is the code you will use for Windows:
Get-ChildItem "
input_Directory " -Filter *.txt |Foreach-Object {java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -file$_.FullName -outputDirectory “
output_Directory ” -outputFormat conll}
Make sure to write the input and the output path between inverted commas " ".
This is an example of the code on a Windows:
Get-ChildItem "..\LAEL_Research_Bazaar\Corpus Text Processor\03_Standardized" -Filter *.txt
|Foreach-Object {java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -file $_.FullName
-outputDirectory "..\LAEL_Research_Bazaar\tagged" -outputFormat conll}
Here is what it should look like in your PowerShell ISE:
Mac
This is the code you will use for a Mac:
for i in
input_directory /*.txt; do (java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -file $i -outputDirectory “
output_Directory ” -outputFormat conll); done
Make sure to write the input and the output path between inverted commas " ".
This is an example of the code on a Mac:
for i in ../LAEL_Research_Bazaar/Corpus Text Processor/03_Standardized/*.txt; do (java -mx3g -cp
"*" edu.stanford.nlp.pipeline.StanfordCoreNLP -file $i -outputDirectory
"../LAEL_Research_Bazaar/tagged" -outputFormat conll); done
In the previous step, we ran CoreNLP annotating our texts for lemmas, part-of-speech, entity recognition and dependencies, but CoreNLP offers many other tools.You can see the full list of annotators here https://stanfordnlp.github.io/CoreNLP/annotators.html
To change the annotators we are using, we will add the flag -annotators
plus the annotators we would like to run. For example, the code
Windows
java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos -file
'..\LAEL_Research_Bazaar\Corpus Text Processor\03_Standardized\LIRNLGI101.txt' -outputFormat conll
Mac
java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos -file
'../LAEL_Research_Bazaar/Corpus Text Processor/03_Standardized/LIRNLGI101.txt' -outputFormat conll
will run only the tokenizer, the sentence split and the part-of-speech annotators.
Now, the output will look like this:
Example: sentiment analysis
Let’s say you want to extract sentiment analysis, all you have to do is to include the sentiment annotators in the flag -annotators
and change the outputFormat
to xml.
Windows
java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators
tokenize,ssplit,pos,parse,sentiment -file '..\LAEL_Research_Bazaar\Corpus Text Processor\03_Standardized\LIRNLGI101.txt' -outputFormat xml
Mac
java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators
tokenize,ssplit,pos,parse,sentiment -file '../LAEL_Research_Bazaar/Corpus Text Processor/03_Standardized/LIRNLGI101.txt' -outputFormat xml
Note that there are dependencies between annotators. Thus, to run the sentiment analysis annotator, you need to run the tokenizer, the sentence split, the part of speech tagger and the parser also.
Just remember the more annotators you include, the longer it will take to annotate each text.
We can think about the flags as items in a menu. Here is a dictionary for the different flags we used.
-annotators
: we choose the annotators we want to use.
-file
: we indicate the file that will be annotated.
-outputFormat
: we select the format of the file that is output with the tags.
-props
: we can use props to indicate the language we are annotating.
-outputDirectory
: we choose the folder where we are going to save the texts.
The output with lemmas, part-of-speech tags, recognized entities and dependencies will look like this:
a. Part-of-Speech Tags (column 4/E)
CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb
b. Recognized entities (column 5/F)
In English, CoreNLP recognizes
PERSON
LOCATION
ORGANIZATION
MISC
MONEY
NUMBER
ORDINAL
PERCENT DATE
TIME
DURATION
SET
c. Dependencies (column 6 and 7/G and H)
Column 7 annotates the dependencies, while column 6 indicates the relation. For the complete dependecies list, you can check it here - Universal Dependency Relations
To annotate texts in other languages, first you need to download the properties for the language you want to use. Find the correct file here.
Let’s practice with one file in Spanish. Here is the code you will need to run:
java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -Spanish.properties -file
filename_here -outputFormat conll
Example Windows
java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-spanish.properties -file '..\spanish.txt' -outputFormat conll
Mac
java -mx3g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-spanish.properties -file '../spanish.txt' -outputFormat conll