Overview
In this workshop we will talk about tagging your texts and tag checking. To do the tag checking of our texts we will use two tools developed by the CROW team. This is a step by step overview of this workshop:
1.Introduction to tagging
2.Precision and recall
3.Tool to check precision
4.Tool to check precision and recall
5.Practice
To annotate, or tag, a corpus means to add information to each individual token. This information can be grammatical, semantic, pragmatic etc. These tags (or labels) can be used to compare frequencies of determined linguistic feature across texts, or group of texts. In addition, adding Part-of-Speech information to our texts allows us to differentiate between words that have the same spelling (e.g. claim - noun, claim-verb).
In this workshop, we will be talking about part-of-speech annotation, or the first level of this tree-diagram.
There are different softwares or packages you could use to tag your text (TreeTagger, Stanford Parser, CLAWS, etc) each of these softwares has a different tagset and will give you a different output. A Tagset is the set of labels a tagger can identify, for instance, this is the tagset for CLAWS. The figure below compares the output of three taggers (Stanford Parser, CLAWS and the Biber Tagger) for the same text. What differences can you notice between outputs?
In this workshop, we will be doing tag checking in the tags of the Biber Tagger. The Biber Tagger outputs the text with each word in the text in the vertical followed by the tag. The tag and the word are separated by a space and every tag has seven tag fields followed by another space and the original CLAWS tag.
women N+CM+PLUR++++ NN2
Where women
is the word in the text, ++
is used to separate tag fields, and NN2
is the original CLAWS tag.
If you would like to understand better the difference between the Biber Tagger and other taggers, I recommend Gray (2019).
After we tagged our texts, it is important to check the precision and recall of our tags to make sure that we are counting what we think we are counting. In the following text, for example, how many noun (NN
) tags do we have? and how many are not accurate?
ï..Word | Tag | Lemma |
---|---|---|
This | DT | this |
study | NN | study |
uses | NN | use |
F-MRI | NP | MRI |
( | ( | ( |
functional | NN | functional |
magnetic | NN | magnetic |
resonance | NN | resonance |
imaging | NN | imaging |
) | ) | ) |
to | TO | to |
investigate | VV | investigate |
the | DT | the |
brain | NN | brain |
activity | NN | activity |
in | IN | in |
a | DT | a |
set | NN | set |
of | IN | of |
cortical | NN | cortical |
areas | NNS | area |
in | IN | in |
the | DT | the |
task | NN | task |
of | IN | of |
main | JJ | main |
idea | NN | idea |
identification | NN | identification |
, | , | , |
when | WRB | when |
the | DT | the |
topic | NN | topic |
sentence | NN | sentence |
was | VBD | be |
presented | VVN | present |
in | IN | in |
first | JJ | first |
versus | CC | versus |
in | IN | in |
last | JJ | last |
position | NN | position |
in | IN | in |
a | DT | a |
three-sentence | NN | three-sentence |
paragraph | NN | paragraph |
. | SENT | . |
The | DT | the |
participants | NNS | participant |
were | VBD | be |
eight | CD | eight |
right-handed | JJ | right-handed |
undergraduate | JJ | undergraduate |
students | NNS | student |
from | IN | from |
Carnegie | NP | Carnegie |
Mellon | NP | Mellon |
University | NP | University |
, | , | , |
six | CD | six |
male | NN | male |
and | CC | and |
2 | CD | @card@ |
female | NN | female |
, | , | , |
all | DT | all |
native | JJ | native |
speakers | NNS | speaker |
of | IN | of |
English | NP | English |
. | SENT | . |
Each | DT | each |
participant | NN | participant |
read | VVD | read |
twelve | NN | twelve |
paragraphs | NNS | paragraph |
, | , | , |
six | CD | six |
in | IN | in |
which | WDT | which |
the | DT | the |
topic | NN | topic |
sentence | NN | sentence |
was | VBD | be |
paragraph | NN | paragraph |
initial | JJ | initial |
and | CC | and |
six | CD | six |
in | IN | in |
which | WDT | which |
it | PP | it |
was | VBD | be |
paragraph | NN | paragraph |
final | JJ | final |
, | , | , |
and | CC | and |
each | DT | each |
paragraph | NN | paragraph |
was | VBD | be |
presented | VVN | present |
word | NN | word |
by | IN | by |
word | NN | word |
in | IN | in |
the | DT | the |
center | NN | center |
of | IN | of |
a | DT | a |
screen | NN | screen |
, | , | , |
inside | IN | inside |
the | DT | the |
scanner | NN | scanner |
. | SENT | . |
The | DT | the |
major | JJ | major |
finding | NN | finding |
of | IN | of |
the | DT | the |
current | JJ | current |
study | NN | study |
is | VBZ | be |
the | DT | the |
differential | JJ | differential |
response | NN | response |
observed | VVN | observe |
in | IN | in |
the | DT | the |
left | JJ | left |
and | CC | and |
right | JJ | right |
hemispheres | NNS | hemisphere |
as | RB | as |
to | TO | to |
the | DT | the |
location | NN | location |
of | IN | of |
the | DT | the |
topic | NN | topic |
sentence | NN | sentence |
within | IN | within |
the | DT | the |
paragraph | NN | paragraph |
. | SENT | . |
The | DT | the |
left | JJ | left |
temporal | JJ | temporal |
region | NN | region |
showed | VVD | show |
greater | JJR | great |
activation | NN | activation |
when | WRB | when |
the | DT | the |
topic | NN | topic |
sentence | NN | sentence |
was | VBD | be |
in | IN | in |
final | JJ | final |
position | NN | position |
than | IN | than |
in | IN | in |
initial | JJ | initial |
position | NN | position |
. | SENT | . |
The | DT | the |
right | JJ | right |
temporal | JJ | temporal |
region | NN | region |
, | , | , |
on | IN | on |
the | DT | the |
other | JJ | other |
hand | NN | hand |
, | , | , |
was | VBD | be |
affected | VVN | affect |
only | RB | only |
by | IN | by |
sentence | NN | sentence |
type | NN | type |
, | , | , |
showing | VVG | show |
a | DT | a |
greater | JJR | great |
response | NN | response |
to | TO | to |
topic | NN | topic |
sentences | NNS | sentence |
than | IN | than |
support | NN | support |
sentences | NNS | sentence |
, | , | , |
regardless | RB | regardless |
of | IN | of |
their | PP$ | their |
location | NN | location |
within | IN | within |
the | DT | the |
paragraph | NN | paragraph |
. | SENT | . |
Precision is the percentage of the tags that are correct. Recall is the percentage of the tags that were correctly identified. Confusing? I find the following image helpful when trying to understand the idea of precision and recall.
We calculate precision and recall by using the following formula.
For example, if we search for NN
here is how we would count true/false positives and true/false negatives.
Original.Tag | Correct.Tag | Count |
---|---|---|
NN | RB | False Positive |
NN | NN | True Positive |
RB | NN | Fase Negative |
NN | NN | True Positive |
RB | RB | True Negative |
RB | RB | True Negative |
The New Biber Tag Checking Tool simply helps us search for one tag at a time. When we find error in the tag and fix, it gives us the precision for that specific tag.
PMOD
right now, but you can change to any tag in the tagset.All the words that are highlighted contain the tag you input in your search.
In this example, I deleted the PMOD.
Once you start making changes in the tags, this will appear on the top of your screen.
If you want to continue searching for the same tag just input another file, or if you want to search for another tag click on Clear Tool
In this workshop I am showing the tool online, but you can download the html file from github and run it offline too.
Bonus: if you are taking INF502 or you want to practice your Git skills, you can clone the repo to your computer, link here
The Biber Tag Checking Tool looks the same as the previous tool, but the difference is that you can check the precision for every token in a text, this way you can check for both precision and recall.
<<
and >>
menus to move to the next tokenWhen you make changes to a tag the new Biber Tag will appear
You can check the percentage of the text that you have tag checked by looking at the token #
at the top of the page.
To calculate precision and recall after tag checking, you can open your file with excel and compare the column with the original tags and the edited ones. It will look like this:
ï..Column1 | Column2 | Column3 | Column4 | Column5 |
---|---|---|---|---|
Token | Original_Biber_Tags | Original_Claws_Tag | New_Biber_Tags | Comments |
The | DET+AT+SING+DEF+++ | AT | ++++++ | |
publication | N+CM+SING+NOM+++ | NN1 | P+COM+PLUR+INDEF++B+MULTI1 | |
of | I++++++ | IO | ++++++ | |
The | DET+AT+SING+DEF+++ | AT | ++++++ | |
Curve | N+PR+SING++++ | NN1 | ++++++ | |
concept | N+CM+SING++++ | NN1 | ++++++ | |
of | I++++++ | IO | ++++++ | |
intelligence | N+CM+SING+NOM+++ | NN1 | ++++++ | |
has | VL+Z+++++ | VHZ | DET+CM+++++ | |
been | VL+EN+PRF+COP+++ | VBN | ++++++ | |
the | DET+AT+SING+DEF+++ | AT | ++++++ | |
pariah | N+CM+SING++++ | NN1 | ++++++ | |
expressing | VL+ING++COMP+PREP++ | VVG | ++++++ | |
a | D+AT+SING+INDEF+++ | AT1 | ++++++ | |
person | N+CM+SING++++ | NN1 | DET+COL+RIGHT++++ | |
’s | GE++++++ | GE | ++++++ | |
intelligence | N+CM+SING+NOM+++ | NN1 | ++++++ | |
intelligence | ++++++ | ++++++ | ||
Especially | R++++++ | RR | ++++++ | |
right’ | J++ATRB++++ | JJ | ++++++ | |
( | Y+PAR+LEFT++++ | ( | ++++++ | |
Miele | N+PR+SING++++ | NP1 | EX+DOL+PPVB++++ | |
1995 | M+CARD+SING+MORE+++ | MC | ++++++ | |
) | Y+PAR+RIGHT++++ | ) | ++++++ |
This is just an example, the tags are not correct and I cut a part of the text
Looking at the results of the table below, we can calculate precision and recall by marking the ones that are true positive
, false positives
and false negatives
.
ï..Original_Biber_Tags | New_Biber_Tags | Comments |
---|---|---|
DET+AT+SING+DEF+++ | ||
N+CM+SING+NOM+++ | P+COM+PLUR+INDEF++B+MULTI1 | False Positive |
I++++++ | ++++++ | |
DET+AT+SING+DEF+++ | ++++++ | |
N+PR+SING+++PMOD+ | JJ+COM+++++ | False Positive |
N+PR+SING++++ | JJ+COM+++++ | False Positive |
I++++++ | ++++++ | |
M+CARD+SING+MORE+++ | ++++++ | |
Y+COM+++++ | ++++++ | |
DET+AT+SING+DEF+++ | ++++++ | |
N+CM+PLUR++++ | JJ+COM+++++ | False positive |
P+IM++3+++ | ++++++ | |
VL+ED+++++ | ++++++ | |
C+CRD+++++ | ++++++ | |
N+CM+PLUR++++ | ++++++ | True Positive |
P+IM++3+++ | ++++++ | |
VL+ED+++++ | ++++++ | |
VL+BF+++++ | ++++++ | |
VL+EN+PRF++++ | ++++++ | |
D+AT+SING+INDEF+++ | ++++++ | |
J++ATRB++++ | N+CM+++++ | False Negative |
N+CM+SING++++ | ++++++ | True Positive |
I++++++ | ++++++ | |
DET+AT+SING+DEF+++ | ++++++ | |
N+CM+PLUR+++TIME+ | ++++++ | True Positive |
I++++++ | ++++++ | |
D+IMPR++3+V++ | ++++++ | |
N+CM+SING+NOM+++ | ++++++ | True Positive |
Y+PAR+LEFT++++ | ++++++ | |
N+PR+SING++++ | ++++++ | True Positive |
M+CARD+SING+MORE+++ | ++++++ | |
Y+PAR+RIGHT++++ | ++++++ | |
++++++ | ++++++ | |
D++SING++++ | ++++++ | |
VL+Z+++++ | ++++++ | |
R++++++ | ++++++ | |
I++++++ | ++++++ | |
D+IMPR++3+V++ | ++++++ | |
J+EN+ATRB++++ | N+CM+++++ | False Negative |
J++ATRB++++ | ++++++ | |
N+CM+PLUR+NOM+++ | JJ+COM+++++ | False Positive |
Y+COM+++++ | ++++++ | |
VL+ED+++++ | ++++++ | |
N+CM+PLUR+NOM+++ | ++++++ | True Positive |
I++++++ | ++++++ | |
J++ATRB++++ | ++++++ | |
N+CM+SING++++ | ++++++ | |
Y+COM+++++ | ++++++ | |
C+CRD+++++ | ++++++ |
If I am interested in the precision and recall of the tag for nouns, I have the following results:
ï..Precision | N | Recall | N.1 |
---|---|---|---|
True positives | 6 | True Positive | 6 |
False positives | 5 | False Negative | 2 |
All nouns identified | 11 | Real N of nouns | 8 |
Precision | True Positives/All nouns id | Recall | True Positives/Real N of nouns |
Precision | 0.545454545 | Recall | 0.75 |
Tag | Key |
---|---|
C+SRD+THT+COMP+VERB | Verb complement clause |
CND+ | Conditionals |
WH+CLS | Wh-Clause |
I+STR+++ | Stranded preposition |
ATRB | Atributive Adjectives |
C+SRD+THT+REL+ | That relative clauses |
WH+ | Wh-Word |
EX+ | Existential There |
VM+ | Modal verbs |
N+ | Nouns |
Group 1 - Precision
Group 2 - Precision and Recall
Report the results here
Tip:
The tool does not work if you add extra +++
, therefore, if you are looking for nouns you want to do just N+
and not N+++++
.
If you find issues with the tool or the tagset while doing your homework, document then here.