LX-NER
LX-NER is a freely available online service for the recognition of expressions for named entities in Portuguese. It was developed and is maintained by the NLX-Natural Language and Speech Group at the University of Lisbon, Department of Informatics.
You may be also interested to use our LX-Suite online service for the shallow processing of Portuguese.
Features
LX-NER takes a segment of Portuguese text and identifies, circumscribes and classifies the expressions for named entities it contains. Furthermore, each named entity receives a standard representation. It handles the following types of expressions:
- Number-based expressions
- Numbers:
Expressions denoting numbers are marked as NUMEX. A list of subtypes is considered, allowing for a more refined classification of these expressions:- Arabic:
Entities expressed by a sequence of digits, with the option of using a period to separate a string of 3 digits, counting from the right. - Decimal:
Entities expressed by an arabic number followed by a decimal part, with a comma separating both parts. - Non-compliant:
Entities expressed by digits, the period and comma symbols, organized in any possible way. All entities not covered by the previous 2 subtypes are included here. - Roman:
Entities expressed by the roman letters [IVXLCDM], in either uppercase or lowercase, with the string of letters obeying the well-formedness rules for roman numerals. - Cardinal:
Entities that are expressed by a full or partial word description of an arabic or decimal number. A full cardinal numeral is composed of words, while a partial cardinal number is a hybrid composed by words and arabic or decimal numbers. - Fraction:
Entities expressed by arabic, decimal or cardinal numbers, and specific symbols or expressions representing division. - Magnitude class:
Entities expressed by arabic, decimal or cardinal numbers together with expressions representing numerical magnitude.
- Arabic:
- Measures:
Terms expressing measure values are marked as MEASEX. A list of subtypes is considered, allowing for a more refined classification of these expressions:- Currency:
Expressions composed of an arabic, decimal or cardinal number followed by a word or expression representing a currency (e.g. libras). - Time:
Expressions composed of an arabic, decimal or cardinal number followed by a word or expression representing a time measure (e.g. segundos). - Scientifc units:
Expressions composed of an arabic, decimal or cardinal number followed by a word or expression representing a scientific unit (e.g. toneladas).
- Currency:
- Time:
Terms expressing time are marked as TIMEX. A list of subtypes is considered, allowing for a more refined classification of these expressions:- Date:
Expressions representing a date, whose components can be a day of the week (e.g. Segunda-Feira), a day of the month (e.g. 27), a month (e.g. Novembro) or a year (e.g. 2006). - Time periods:
Expressions made by arabic, roman or cardinal numbers and an explicit indication of a period of time concerning a specific year, decade or century. - Time of the day:
Expressions with different formats, indicating a specific time of the day.
- Date:
- Addresses:
Expressions conveying addresses are marked as ADDREX. A list of subparts is considered, allowing for a more refined classification of these expressions:- Global section:
Expressions referring to the global position of a certain location (e.g. Rua Almeida Garrett). This address part is mandatory for an address to be recognized. - Local section:
Expressions referring to a specific position within the global position (e.g. Nº 17 - 7º Dto). - Zip code:
Expressions referring to the zip code component of an address (e.g. 3654-548 Lisboa).
- Global section:
- Name-based expressions
- Names:
Expressions conveying names are marked as NAMEX. A list of subtypes is considered, allowing for a more refined classification of these expressions:- Persons:
Expressions conveying names of people, with the option of considering the job or social status of a person if present (e.g. Presidente Cavaco Silva). - Organizations:
Expressions conveying names of companies (e.g. LG Electronics) and political organizations (e.g. ONU). - Locations:
Expressions referring to specific geographical locations (e.g. Portugal). - Events:
Expressions referring to competitions, conferences, workshops and similar events (e.g. 2ª Conferência Sobre o Acesso Livre ao Conhecimento). - Works:
Expressions referring to movies, books, paintings and similar works (e.g. O Retrato de Dorian Gray). - Miscellaneous:
Expressions referring to entities that can't be classified according to any of the previous subtypes (e.g. Boeing 747).
- Persons:
Evaluation
- Number-based expressions The number-based component is built upon handcrafted regular expressions. It was developed and evaluated against a manually constructed test-suite including over 300 examples. It scored 85.19% precision and 85.91% recall.
- Name-based expressions The name-based component is built upon stochastic procedures. It was trained over a manually annotated corpus of approximately 208,000 words, and evaluated against an unseen portion with approximately 52,000 words. It scored 86.53% precision and 84.94% recall.
Authorship
LX-NER is being developed by João Balsa, António Branco, Eduardo Ferreira and Sara Silveira, with the help of João Silva, of the NLX-Natural Language and Speech Group, at the University of Lisbon, Department of Informatics.
Acknowledgments
The work leading to the LX-NER was partly supported by FCT-Fundação para a Ciência e Tecnologia under the contract POSI/PLP/47058/2002 for the project TagShare and the contract POSI/PLP/61490/2004 for the project QueXting, and the European Commission under the contract FP6/STREP/27391 for the project LT4eL.
White Papers
Florbela Barreto, António Branco, Eduardo Ferreira, Amália Mendes, Maria Fernanda Bacelar do Nascimento, Filipe Nunes and João Silva, 2006. Open Resources and Tools for the Shallow Processing of Portuguese: The TagShare Project. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'06).
Florbela Barreto, António Branco, Eduardo Ferreira, Amália Mendes, Maria Fernanda Bacelar do Nascimento, Filipe Nunes and João Silva, 2006. Linguistic Resources and Software for Shallow Processing. In Actas do XXI Encontro da Associação Portuguesa de Linguística (APL'05).
Contact Us
Contact us using the following e-mail address: 'nlxgroup' concatenated with 'at' concatenated with 'di.fc.ul.pt'.
Why LX-NER?
LX because LX is the "code" name Lisboners like to use to refer to their hometown.