Content area
Abstract
Named Entity Recognition is a task of Natural Language Processing, which aims to extract and classify named entities such as ”Queen of England”. Depending on the objective of the extraction, the entities can be classified with different labels. These labels usually are Person, Organization, and Location but can be extended and include sub-entities like cars, countries, etc., or very different such as when the scope of the classification is biological, and the entities are Genes or Virus. These entities are extracted from raw text, which may be a well-structured scientific document or an internet post, and written in any language. These constraints create a considerable challenge to create an independent domain model. So, most of the authors have focused on English documents, which is the most explored language and contain more labeled data, which requires a significant amount of human resources. More recently, approaches are focused on Transformers architecture models, which may take up to days to train and consume millions of labeled entities.
My approach is a statistical one, which means it will be language-independent while still requiring much computation power. This model will combine multiple techniques such as Bag of Words, Steeming, and Word2Vec to compute his features. Then, it will be compared with two transformer-based models, that although they have similar architecture, they have respectful differences. The three models will be tested in multiple datasets, each with its challenges, to conduct deep research on each model’s strengths and weaknesses.
After a tough evaluation process the three models achieved performances of over 90% in datasets with high number of samples. The biggest challenge were the datasets with lower data, where the Pipeline achieved better performances than the transformer-based models.





