Domain-specific Evaluation Dataset Generator for Multilingual Text Analysis

İnan, Emrah; Inan, Emrah; Mostafapour, Vahab; Mostafapour, Vahab; Tekbacak, Fatih; Tekbacak, Fatih

doi:10.54856/jiswa.201912084

AKILLI SİSTEMLER VE UYGULAMALARI DERGİSİ
JOURNAL OF INTELLIGENT SYSTEMS WITH APPLICATIONS
J. Intell. Syst. Appl.

E-ISSN: 2667-6893

This work is licensed under a Creative Commons Attribution 4.0 International License.

Domain-specific Evaluation Dataset Generator for Multilingual Text Analysis

Çok Dilli Metin Analizinde Alan Bağımlı Değerlendirme Verisinin Oluşturulması

How to cite: İnan E, Mostafapour V, Tekbacak F. Domain-specific evaluation dataset generator for multilingual text analysis. Akıllı Sistemler ve Uygulamaları Dergisi (Journal of Intelligent Systems with Applications) 2019; 2(2): 140-147.

Full Text: PDF, in Turkish.

Total number of downloads: 780

Title: Domain-specific Evaluation Dataset Generator for Multilingual Text Analysis

Abstract: Web enables to retrieve concise information about specific entities including people, organizations, movies and their features. Additionally, large amount of Web resources generally lies on a unstructured form and it tackles to find critical information for specific entities. Text analysis approaches such as Named Entity Recognizer and Entity Linking aim to identify entities and link them to relevant entities in the given knowledge base. To evaluate these approaches, there are a vast amount of general purpose benchmark datasets. However, it is difficult to evaluate domain-specific approaches due to lack of evaluation datasets for specific domains. This study presents WeDGeM that is a multilingual evaluation set generator for specific domains exploiting Wikipedia category pages and DBpedia hierarchy. Also, Wikipedia disambiguation pages are used to adjust the ambiguity level of the generated texts. Based on this generated test data, a use case for well-known Entity Linking systems supporting Turkish texts are evaluated in the movie domain.

Keywords: Entity linking; named entity recognition; evaluation dataset; Dbpedia; Wikipedia

Başlık: Çok Dilli Metin Analizinde Alan Bağımlı Değerlendirme Verisinin Oluşturulması

Özet: Web, insanlar, organizasyonlar, sinema filmleri ve onların özellikleri ile ilgili belirli varlıklar için gerekli bilgilerin edinilmesini sağlamaktadır. Bununla beraber birçok Web kaynağı genel olarak yapısal olmayan biçimde durmaktadır ve bu durum belirli varlıklar ile ilgili kritik bilginin bulunmasını zorlaştırmaktadır. Tanımlı Varlık Çıkarımı ve Varlık Bağlama gibi metin analizine dayalı yaklaşımlar varlıkların etiketlenmesi ve verilen bilgi tabanı kaynağındaki ilgili varlıklarla bağlanmasını amaçlamaktadır. Böyle yaklaşımları test etmek için çok fazla genel amaçlı test kümeleri bulunmaktadır. Ancak alan bağımlı yaklaşımları test etmek alana özgü veri kümelerinin eksikliğinden dolayı zorlaşmaktadır. Bu çalışma, çok dil destekli test verisini Vikipedi kategori sayfaları ve DBpedia hiyerarşisini kullanarak belirli alanlar için üreten WeDGeM aracını sunmaktadır. Aynı zamanda, Vikipedi anlam ayrımı sayfaları, üretilen test metinlerinin anlam karmaşıklığı seviyesini ayarlamak için kullanılmaktadır. Üretilen bu test verisinde, Türkçe metinlerini destekleyen tanınmış Varlık Bağlama araçları sinema alanında test edilmiştir.

Anahtar kelimeler: Varlık bağlama; tanımlı varlık çıkarımı; test kümesi; Dbpedia; Wikipedia

Bibliography:

Hassanzadeh O, Consens MP. Linked movie database. In LDOW, 2009. Retrieved from http://www.cs.toronto.edu/~oktie/linkedmdb/linkedmdb-18-05-2009-dump.nt
Ernst P, Siu A, Weikum G. Knowlife: A versatile approach for constructing a large knowledge graph for biomedical sciences. BMC Bioinformatics 2015; 16(1): 157.
Dou D, Wang H, Liu H. Semantic data mining: A survey of ontology-based approaches. In 2015 IEEE International Conference on Semantic Computing (ICSC), February 7-9, 2015, Anaheim, CA, USA, pp. 244–251.
Cucerzan S. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), June 2007, Prague, Czech Republic, pp. 708-716.
Kulkarni S, Singh A, Ramakrishnan G, Chakrabarti S. Collective annotation of wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009, Paris, France, pp. 457–466.
Singh S, Subramanya A, Pereira F, McCallum A. Wikilinks: A large-scale cross-document coreference corpus labeled via links to Wikipedia. Technical Report UM-CS-2012-015, University of Massachusetts, Amherst, 2012.
Navigli R. Babelnet and friends: A manifesto for multilingual semantic processing. Intelligenza Artificiale 2013; 7(2): 165-181.
Strassel S, Przybocki MA, Peterson K, Song Z, Maeda K. Linguistic resources and evaluation techniques for evaluation of cross-document automatic content extraction. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2008), May 26-June 1, 2008, Marrakech, Morocco.
Moro A, Cecconi F, Navigli R. Multilingual word sense disambiguation and entity linking for everybody. In Proceedings of the 2014 International Conference on Posters & Demonstrations Track - Volume 1272 (ISWC-PD'14), October 2014, Aachen, Germany, 2014, pp. 25–28.
Mendes PN, Jakob M, García-Silva A, Bizer C. Dbpedia spotlight: Shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems (Semantics'11), September 7-9, 2011, Graz, Austria, pp. 1–8.
Kilgarriff A, Fellbaum C. Wordnet: An electronic lexical database. 2000. Retrieved from https://wordnet.princeton.edu/
Etzioni O, Cafarella M, Downey D, Kok S, Popescu AM, Shaked T, Soderland S, Weld DS, Yates A. Webscale information extraction in KnowItAll: (preliminary results). In Proceedings of the 13th International Conference on World Wide Web (WWW'04), May 17-22, 2004, New York, NY, USA, pp. 100–110.
Brin S. Extracting patterns and relations from the world wide web. In International Workshop on The World Wide Web and Databases (WebDB'98), 1998, London, UK, pp. 172–183.
Ellis J, Getman J, Mott J, Li X, Griffitt K, Strassel S, Wright J. Linguistic resources for 2013 knowledge base population evaluations. In Proceedings of the Sixth Text Analysis Conference (TAC 2013), November 18-19, 2013, Gaithersburg, Maryland, USA.
Navigli R, Ponzetto SP. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence 2012; 193: 217–250.
Daiber J, Jakob M, Hokamp C, Mendes PN. Improving efficiency and accuracy in multilingual entity extraction. In Proceedings of the 9th International Conference on Semantic Systems (I-Semantics), September 2013, pp. 121–124.
Li X, Strassel S, Ji H, Griffitt K, Ellis J. Linguistic resources for entity linking evaluation: From monolingual to crosslingual. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), May 23-25, 2012, Istanbul, Turkey, pp. 3098–3105.
Cornolti M, Ferragina P, Ciaramita M. A framework for benchmarking entity-annotation systems. In Proceedings of the 22nd International Conference on World Wide Web, May 13–17, 2013, Rio de Janeiro, Brazil, pp. 249–260.
Mitchell A, Strassel S, Huang S, Zakhary R. Ace 2004 multilingual training corpus. Linguistic Data Consortium, Philadelphia 2005; 1:1.
Tjong Kim Sang EF, De Meulder F. Introduction to the Conll2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Volume 4, Association for Computational Linguistics, 2003, pp. 142-147.
Spitkovsky VI, Chang AX. A cross-lingual dictionary for english wikipedia concepts. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), May 2012, Istanbul, Turkey, pp. 3168–3175.
Usbeck R, Roder M, Ngonga Ngomo AC, Baron C, Both A, Brummer M, Ceccarelli D, Cornolti M, Cherix D, Eickmann B, Ferragina P, Lemke C, Moro A, Navigli R, Piccinno F, Rizzo G, Sack H, Speck R, Troncy R, Waitelonis J, Wesemann L. GERBIL–General entity annotation benchmark framework. In 24th WWW Conference, 2015.
Pedersen T, Pakhomov SV, Patwardhan S, Chute CG. Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics 2007; 40(3): 288–299.
Pakhomov S, McInnes B, Adam T, Liu Y, Pedersen T, Melton GB. Semantic similarity and relatedness between clinical terms: An experimental study. American Medical Informatics Association Annual Symposium Proceedings Archive 2010; 2010; 572-576.
Arcan M, Turchi M, Tonelli S, Buitelaar P. Enhancing statistical machine translation with bilingual terminology in a CAT environment. In Proceedings of the 11th Biennial Conference of the Association for Machine Translation in the Americas (AMTA 2014), October 2014, Vancouver, Canada, pp. 54–68.
Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufis D, Varga D. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Conference Proceedings: The 5th International Conference on Language Resources and Evaluation - Proceedings. 2006, European Language Resources Association (ELRA), Paris, France, pp. 2142-2147.
Pamay T, Sulubacak U, Torunoglu-Selamet D, Eryigit G. The annotation process of the ITU web treebank. In The 9th Linguistic Annotation Workshop held in conjuncion with NAACL 2015, 2015, pp. 95.
Sak H, Gungor T, Saraclar M. Turkish language resources: Morphological parser, morphological disambiguator and web corpus. Lecture Notes in Computer Sciences 2008; 5221: 417-427.
Ide N, Suderman K. The american national corpus first release. In Proceedings of the Fourth Language Resources and Evaluation Conference (LREC). Journal of English Linguistics 2004; 32(2): 105-113.
Moro A, Navigli R, Tucci FM, Passonneau RJ. Annotating the MASC corpus with Babelnet. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), May 2014, pp. 4214-4219.
Crouch D, Roser S, Abraham F. AQUAINT pilot knowledge-based evaluation: Annotation guidelines. May 2005.