Natural language processing (NLP): part-of-speech (POS) tagging and named entity recognition (NER)

Part-of-speech (POS) tagging and named entity recognition (NER)


POS tagging and NER are two important tasks that involve analyzing and understanding the structure and meaning of text.

Part-of-Speech (POS) Tagging:

Part-of-speech tagging is the process of assigning grammatical tags to each word in a given sentence, based on its syntactic category and role in the sentence. The tags represent the part of speech of the word, such as noun, verb, adjective, adverb, pronoun, preposition, conjunction, etc. POS tagging helps in understanding the grammatical structure of a sentence, enabling further analysis and interpretation.

For example, consider the sentence: "The cat is sitting on the mat." A POS tagger would analyze each word and assign tags like "DT" (determiner) for "The," "NN" (noun) for "cat," "VBZ" (verb) for "is," "VBG" (verb) for "sitting," "IN" (preposition) for "on," and "DT" for "the" and so on.


Output:


Word Classes in POS tagging

words are classified into different word classes or parts of speech based on their syntactic and grammatical properties.Word classes can be categorized into two main sets: open word classes and closed word classes. This classification is based on the observation that the ability to form new words, known as lexical productivity, is primarily limited to specific word classes.

Open word classes

It is  also referred to as content words, are characterized by their potential for new word formation. In languages, the creation of new conjunctions, pronouns, prepositions, and determiners is relatively rare. On the contrary, the generation of new verbs, adjectives, and particularly nouns is a common occurrence in daily language use.

Here are some commonly recognized Open word classes in POS tagging:
  • Noun (NN): Represents a person, place, thing, or idea. Examples: "cat," "city," "book."
  • Verb (VB): Represents an action, occurrence, or state. Examples: "run," "eat," "is."
  • Adjective (JJ): Describes or modifies a noun. Examples: "big," "red," "happy."
  • Adverb (RB): Describes or modifies a verb, adjective, or other adverb. Examples: "quickly," "very," "well."
  • Pronoun (PRP): Represents a word used in place of a noun to avoid repetition. Examples: "he," "she," "it."
  • Interjection(INTJ): Represents exclamation, greeting, yes/no response, etc. Examples: "oh", "um", "yes", "hello"

Closed word classes

 On the other hand closed word classes, are considered more stable and less prone to frequent word creation. Adverbs and numbers occupy an intermediate position in this classification, as their productivity lies between that of open and closed classes. Interjections, often seen as expressions of strong emotions or exclamations, are typically viewed as a relatively closed word class.

Here are some commonly recognized Closed word classes in POS tagging:
  • Determiner (DT): Indicates specificity or introduces a noun. Examples: "the," "this," "each."
  • Preposition (IN): Shows a relationship between a noun/pronoun and another word in a sentence. Examples: "on," "in," "at."
  • Conjunction (CC): Connects words, phrases, or clauses. Examples: "and," "but," "or."
  • Interjection (UH): Expresses strong emotions or exclamations. Examples: "oh," "wow," "oops."
  • Numeral (CD): Represents a number or numerical value. Examples: "one," "first," "3."
While POS tagging doesn't explicitly differentiate between open and closed word classes, understanding this distinction can help in analyzing the syntactic and semantic characteristics of different word classes and their behavior in various linguistic contexts

Problems or challenges during POS tagging

Ambiguity: 

Words can have multiple meanings and can be used in different parts of speech depending on the context. Determining the correct POS tag in such cases can be challenging.
Example: "The bank can provide a loan." (bank as a noun) vs. "I need to bank the money." (bank as a verb)


Output:

It classified "bank" verb in the second sentence as a noun.

Out-of-vocabulary (OOV) words: 

POS taggers are trained on a specific vocabulary, and if they encounter words that are not present in their training data, they may struggle to assign the correct POS tag.
Example: Proper nouns, rare technical terms, or newly coined words might not be recognized by the POS tagger.


Output:

The word "frozzles" is a newly discovered word that is not recognized by the POS tagger. It is assigned the default tag 'NN' (noun) by the tagger, assuming it to be a noun based on the unknown nature of the word.

Tagging errors: 

POS taggers are not perfect and may occasionally assign incorrect POS tags. This can happen due to training data limitations, ambiguities, or variations in language usage.
Example: Tagging a noun as a verb or vice versa due to contextual ambiguity or syntactic structure.


Ourput:


In this example, the word "and" is mistakenly tagged as a coordinating conjunction (CC) instead of being recognized as a verbal conjunction (VB) based on the context.

Lack of Context:

When POS tagging, the lack of context can sometimes lead to ambiguous or incorrect tags. POS taggers typically consider the current word and its immediate neighbors when assigning tags. However, some words require a broader understanding of the sentence or document to determine their accurate POS tags.

Output:

In this example, the word "bat" can have different meanings and POS tags depending on the context. It can refer to a flying mammal (noun) or a sports equipment (verb). Without sufficient context, the POS tagger assigns the tag 'NN' (noun) by default.

Homographs and Homonyms:

Homographs are words that are spelled the same but have different meanings and sometimes different POS tags. Homonyms are words that sound the same but have different meanings and sometimes different POS tags.


Output:

In this example, the word "band" is a homograph that can be interpreted as a musical group (noun) or a strip of material (noun). The POS tagger assigns the tag 'NN' (noun) to both occurrences of "band" without disambiguating the different meanings.

Named Entity Recognition (NER):

Named entity recognition is the task of identifying and classifying named entities in text into predefined categories such as person names, organization names, locations, dates, time expressions, monetary values, percentages, etc. NER helps in extracting specific information from text and is crucial for applications such as information retrieval, question answering, and knowledge extraction. For example, in the sentence "Apple Inc. is planning to open a new store in New York City," the named entities would be "Apple Inc." (organization) and "New York City" (location).

Output:

While some named entity recognition systems may have a more extensive set of tags, most NER taggers typically include basic tags such as PER (person), LOC (location), ORG (organization), and GEO (geo-political entities). These tags represent commonly recognized entity types in NER systems.

Unlike part-of-speech (POS) tagging, where every token in a sentence is assigned a part-of-speech tag, in NER, not every token receives a named entity tag. This is because named entities are specific and distinct entities within the text, such as names of people, locations, organizations, etc. NER focuses on identifying and classifying these named entities rather than assigning tags to every token

Named Entity Tagging:

Treating the identification of named entities in a document as a form of tagging involves two main steps: identifying the boundaries of the entity expressions and then classifying these expressions. To accomplish the first task, it is common to use a set of tags that provide information about whether each token is inside or outside a named entity.

Here are three commonly used tagging schemes for identifying entity boundaries:

IO Tagging:

In this scheme, each token is marked as either Inside (I) or Outside (O) a named entity. Tokens that are part of an entity are labeled as I, while tokens that are not part of an entity are labeled as O. For example:
Sentence: Apple Inc. is headquartered in Cupertino.
Tagging: [B-ORG] [I-ORG] [O] [O] [O] [O] [B-LOC]
Tags: Apple (B-ORG) Inc. (I-ORG) is (O) headquartered (O) in (O) Cupertino (B-LOC)

Sentence: Barack Obama was the 44th President of the United States.
Tagging: [B-PER] [I-PER] [O] [O] [O] [B-LOC] [I-LOC] [I-LOC]
Tags: Barack (B-PER) Obama (I-PER) was (O) the (O) 44th (O) President (B-LOC) of (I-LOC) the (I-LOC) United (I-LOC) States (I-LOC)

BIO Tagging: 

Similar to IO tagging, in BIO tagging, each token is labeled as Inside (I), Outside (O), or the beginning of a named entity (B). The B tag is used to mark the first token of an entity. Tokens that are not part of an entity are labeled as O. For example:
Sentence: The Statue of Liberty is located in New York City.
Tagging: [O] [B-LOC] [I-LOC] [O] [O] [B-LOC] [I-LOC] [I-LOC]
Tags: The (O) Statue (B-LOC) of (I-LOC) Liberty (O) is (O) located (B-LOC) in (I-LOC) New (I-LOC) York (I-LOC) City (I-LOC)

Sentence: Microsoft was founded by Bill Gates and Paul Allen.
Tagging: [B-ORG] [O] [O] [O] [B-PER] [I-PER] [O] [B-PER] [I-PER]
Tags: Microsoft (B-ORG) was (O) founded (O) by (O) Bill (B-PER) Gates (I-PER) and (O) Paul (B-PER) Allen (I-PER)

BIOES Tagging: 

This scheme extends BIO tagging by additionally marking the last token of an entity as End (E). It also introduces a Single (S) tag for single-word entities. This scheme provides more granular information about entity boundaries. For example:
Sentence: I bought a book from Amazon.
Tagging: [O] [O] [O] [O] [B-ORG] [E-ORG]
Tags: I (O) bought (O) a (O) book (O) from (B-ORG) Amazon (E-ORG)

Sentence: London is the capital of the United Kingdom.
Tagging: [S-LOC] [O] [O] [O] [O] [B-LOC] [I-LOC] [I-LOC] [I-LOC] [I-LOC] [I-LOC]
Tags: London (S-LOC) is (O) the (O) capital (O) of (O) the (B-LOC) United (I-LOC) Kingdom (I-LOC)

In these examples, the named entities are labeled with the corresponding tags based on the entity type (e.g., ORG for organizations, LOC for locations, PER for persons). The B-LOC, I-LOC, B-ORG, I-ORG, B-PER, I-PER, S-LOC, and E-ORG tags represent the beginning, inside, single-word, and end positions of the named entities, respectively.

Each of these tagging schemes provides information about the boundaries of named entities. The subsequent classification step involves assigning specific entity types to the identified expressions. By combining the boundary information from these tagging schemes with entity classification, named entity recognition systems can extract and categorize named entities in a document.

Named Entity Tagging Features:

Named entity tagging typically involves utilizing a combination of linguistic and contextual features to identify and classify named entities. Here are some commonly used features in named entity recognition (NER) systems:

Numeric patterns:

Numeric patterns are used to identify specific types of named entities that are associated with numbers or numerical expressions. For example, dates can be recognized based on patterns like "MM/DD/YYYY" or "YYYY-MM-DD". Intervals or ranges, such as "10-15 years" or "200-300 meters", can also be identified using numeric patterns. Additionally, percentage values like "20%" or "5.5%" can be recognized based on their numeric format.

Typical word endings: 
Certain types of named entities in English-speaking countries often exhibit common word endings. For example, many organization names end with suffixes like "ex" (e.g., FedEx, Codex), "tech" (e.g., Microsoft, Intellect), or "soft" (e.g., Microsoft, Salesforce). Job titles frequently end with "is" (e.g., journalist, typist). Recognizing these common word endings can aid in identifying potential named entities.

Functional characteristics: 

Functional characteristics of tokens can provide additional cues for named entity recognition. These characteristics may include the length of the token, measured in characters or syllables, as certain types of entities may have specific length patterns. Special characters or symbols within a token, such as hyphens or ampersands, can also be indicative of named entities.

Generalizing patterns: 

Generalizing patterns involve mapping tokens to an underlying form or pattern. This can be useful for recognizing named entities that exhibit similar patterns but have slight variations. For example, mapping tokens to a common form or pattern can help identify variations of a named entity, such as recognizing "USA", "U.S.", and "United States" as referring to the same entity.

MUC Evaluation for Named Entity Tagging:

MUC (Message Understanding Conference) is an evaluation framework specifically designed for evaluating named entity recognition and other natural language processing (NLP) tasks. The MUC evaluation method focuses on assessing the performance of a named entity tagging system based on the precision and recall of correctly identified named entities.

The MUC evaluation process involves the following steps:

Annotation: 

Annotated data is created with named entity tags (e.g., person, organization, location) marked on the text. This annotated data serves as the gold standard or reference.

Entity Matching: 

The system being evaluated produces its own named entity tags for the same text. The MUC evaluation method uses entity matching to compare the system's predicted tags with the reference tags.

Scoring: 

The matching process involves comparing spans of predicted and reference entities to determine if they match. The evaluation considers the following types of matches:
  • Exact Match: The predicted entity span matches the reference entity span exactly in terms of the start and end positions.
  • Overlapping Match: The predicted entity span partially overlaps with the reference entity span.
  • False Positive: The predicted entity span does not match any reference entity.
  • False Negative: The reference entity does not have a corresponding predicted entity.

Precision, Recall, and F1 Score: 

The evaluation measures precision, recall, and F1 score are calculated based on the matching results:
  • Precision: Precision measures the proportion of correctly identified named entities among all entities predicted by the system.
  • Recall: Recall measures the proportion of correctly identified named entities among all the reference entities present in the data.
  • F1 Score: The F1 score is the harmonic mean of precision and recall, providing a balanced measure.
  • The MUC evaluation method does not consider nested or overlapping entities, meaning that only one entity can be assigned to a specific token span. It focuses on the identification of non-overlapping named entities.

Output:

muc_evaluation() function that takes in two lists: reference and predicted. Each list contains sublists representing the named entity tags for individual sentences in the dataset.

The function iterates over the sentences, compares the reference and predicted entity sets, and updates the true positives, false positives, and false negatives accordingly. Finally, it calculates the precision, recall, and F1 score.


Comments

Popular posts from this blog

Information Extraction NLP

Naive Bayes Classification for Sentiment Analysis (NLP)

Natural Language Processing - Basic Text Processing