博客年龄:11年6个月
访问:?
文章:602篇

语义标注 权威系统!

分类:语言学
2008-07-20 17:12 阅读(?)评论(0)
 

UCREL Semantic Analysis System (USAS)

Top level codes The UCREL semantic analysis system is a framework for undertaking the automatic semantic analysis of text. The framework has been designed and used across a number of research projects and this page collects together various pointers to those projects and publications produced since 1990.

The semantic tagset used by USAS was originally loosely based on Tom McArthur's Longman Lexicon of Contemporary English (McArthur, 1981). It has a multi-tier structure with 21 major discourse fields (shown here on the right), subdivided, and with the possibility of further fine-grained subdivision in certain cases. We have written an introduction to the USAS category system (PDF file) with examples of prototypical words and multi-word units in each semantic field.

The full tagset is available on-line in plain text form and formatted on one page in PDF. The tagset has been translated to Finnish by Laura Löfberg (University of Tampere, Finland). You can also see the USAS semantic tagset in Russian as a two page PDF and text file.

A visual representation showing the USAS tagset heirarchy is now on-line, along with those for the Louw-Nida model and the Hallig/Von Wartburg/Schmidt/Wilson Model.

Funded projects

The software and linguistic resources underpinning the semantic analysis have been designed and produced during five projects: The ACASD and ACAMRIT projects led to the initial design and implementation of the tools and applied them in the area of interview transcripts. The REVERE project applied the tools in the domain of software engineering documentation using a web front end called Wmatrix. In Benedict, we have re-implemented the English semantic tagging (EST) tool in Java, and improved the linguistic resources in the tool. In addition we developed a Finnish Semantic Tagging (FST) tool. In the ASSIST project, we extended the existing USAS framework to construct a Russian Semantic Tagger (RST).

People

Andrew Wilson was the RA in Linguistics on the first two projects and Paul Rayson was the RA in Computing on all five projects. Scott Piao and Dawn Archer were the RAs on the Benedict project. Olga Mudraya was the RA in Linguistics and Scott Piao was the Computing RA on the Assist project. The grant holders and supervisors were Roger Garside (Computing), Geoff Leech (Linguistics) and Jenny Thomas (Linguistics, now at Bangor). Tony McEnery was the principal investigator for Benedict. For ASSIST, Roger Garside, Tony McEnery, Andrew Wilson and Paul Rayson are the grant holders.

Availability

  • The English version of the USAS framework can be accessed in the web-based corpus tool Wmatrix.
  • The Russian lexicon and MWE list are available for download here. Future updates will also be released here. We provide a zip archive containing three files: a single word lexicon, a proper name lexicon and a MWE lexicon. The current versions (14th December 2006) contain 13,153 single words, 4,444 proper names, and 713 MWE templates. For each word in the files, we include a part-of-speech tag based on the mystem tagset for Russian and a list of possible semantic tags, all of which have been manually checked. For more details about the Russian resources, see Mudraya et al (2006). Creative Commons License These Russian lexical resources are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License.

Publications describing the system (or extensions of the system)

  1. Wilson, A. and Rayson, P. (1993). Automatic Content Analysis of Spoken Discourse. In: C. Souter and E. Atwell (eds), Corpus Based Computational Linguistics. Amsterdam: Rodopi. pp215-226 (text)
  2. Wilson, A. (1993). Towards an Integration of Content Analysis and Discourse Analysis: The Automatic Linkage of Key Relations in Text. UCREL Technical Paper 3, Linguistics Department, Lancaster University. PDF version
  3. Rayson, P., and Wilson, A. (1996). The ACAMRIT semantic tagging system: progress report. In L. J. Evett, and T. G. Rose (eds) Language Engineering for Document Analysis and Recognition, LEDAR, AISB96 Workshop proceedings, pp 13-20. Brighton, England. Faculty of Engineering and Computing, Nottingham Trent University, UK. ISBN 0 905 488628 PDF version
  4. Wilson, A. and Thomas, J.A. (1997) Semantic annotation, in Garside, R., Leech, G., and McEnery, A. (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman, London, pp. 53-65.
  5. Garside, R., and Rayson, P. (1997). Higher-level annotation tools. In. R. Garside, G. Leech, and A. McEnery (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman, London. pp 179 - 193.
  6. Paul Rayson (2002). USAS: UCREL semantic analysis system. Invited talk at Daito Bunka University, Tokyo, Japan. February 2002. (HTML slides)
  7. Dawn Archer, Andrew Wilson, Paul Rayson (2002). Introduction to the USAS category system. Benedict project report, October 2002. (PDF version)
  8. Dawn Archer, Tony McEnery, Paul Rayson, Andrew Hardie (2003). Developing an automated semantic analysis system for Early Modern English. In Dawn Archer, Paul Rayson, Andrew Wilson and Tony McEnery (eds.) Proceedings of the Corpus Linguistics 2003 conference. UCREL technical paper number 16. UCREL, Lancaster University, pp. 22 - 31. PDF version
  9. Laura Löfberg, Dawn Archer, Scott Piao, Paul Rayson, Tony McEnery, Krista Varantola, Jukka-Pekka Juntunen (2003). Porting an English semantic tagger to the Finnish language. In Dawn Archer, Paul Rayson, Andrew Wilson and Tony McEnery (eds.) Proceedings of the Corpus Linguistics 2003 conference. UCREL technical paper number 16. UCREL, Lancaster University, pp. 457 - 464. PDF version
  10. Scott S. L. Piao, Paul Rayson, Dawn Archer, Andrew Wilson and Tony McEnery (2003). Extracting Multiword Expressions with a Semantic Tagger. In proceedings of the Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, at ACL 2003, 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, July 12, 2003, pp. 49-56. PDF version
  11. Piao, Scott S. L., Paul Rayson, Dawn Archer, Tony McEnery (2004). Evaluating Lexical Resources for A Semantic Tagger. In proceedings of 4th International Conference on Language Resources and Evaluation (LREC 2004), May 2004, Lisbon, Portugal, Volume II, pp. 499-502. ISBN 2-9517408-1-6. PDF version
  12. Rayson, P., Archer, D., Piao, S. L., McEnery, T. (2004). The UCREL semantic analysis system. In proceedings of the workshop on Beyond Named Entity Recognition Semantic labelling for NLP tasks in association with 4th International Conference on Language Resources and Evaluation (LREC 2004), 25th May 2004, Lisbon, Portugal, pp. 7-12. PDF version
  13. Archer, D., Rayson, P., Piao, S., McEnery, T. (2004). Comparing the UCREL Semantic Annotation Scheme with Lexicographical Taxonomies. In Williams G. and Vessier S. (eds.) Proceedings of the 11th EURALEX (European Association for Lexicography) International Congress (Euralex 2004), Lorient, France, 6-10 July 2004. Université de Bretagne Sud. Volume III, pp. 817-827. ISBN 2-9522-4570-3. PDF version
  14. Paul Rayson, Scott Piao, Dawn Archer (2004). Modern and Historical Aspects of the UCREL Semantic Analysis System. Invited talk at the University of Sheffield, UK, 16th November 2004. (PDF versionslides)
  15. Rayson, P. (2005) Right from the word go: identifying multi-word-expressions for semantic tagging. Invited talk at BAAL Corpus Linguistics SIG / OTA Workshop: Identifying and Researching Multi-Word Units. Thursday 21st April 2005, Oxford University Computing Services. (PDF versionslides)
  16. Scott S.L. Piao, Dawn Archer, Olga Mudraya, Paul Rayson, Roger Garside, Tony McEnery, Andrew Wilson (2005) A Large Semantic Lexicon for Corpus Annotation. In proceedings of the Corpus Linguistics 2005 conference, July 14-17, Birmingham, UK. Proceedings from the Corpus Linguistics Conference Series on-line e-journal, Vol. 1, no. 1, ISSN 1747-9398. PDF version
  17. Piao, S., Rayson, P., Archer, D., McEnery, T. (2005) Comparing and combining a semantic tagger and a statistical tool for MWE extraction. Computer Speech and Language, (Special issue on Multiword expressions), Volume 19, issue 4, pp. 378 - 397, Elsevier. doi:10.1016/j.csl.2004.11.002
  18. Mudraya, O., Babych, B., Piao, S., Rayson, P., Wilson, A. (2006). Developing a Russian semantic tagger for automatic semantic annotation. In proceedings of Corpus Linguistics 2006, St. Petersburg, from 10-14 October 2006. English PDF version Russian PDF version (slides)

Publications describing applications of the system

  1. Wilson, A. and Leech, G.N. (1993). Automatic Content Analysis and the Stylistic Analysis of Prose Literature. Revue: Informatique et Statistique dans les Sciences Humaines 29: 219-234.
  2. Thomas, J., and Wilson, A. (1996). Methodologies for studying a corpus of doctor-patient interaction. In J. Thomas and M. Short (eds) Using corpora for language research. Longman, London, pp 92-109.
  3. Rayson, P., Garside, R., and Sawyer, P. (1999). Recovering Legacy Requirements. In Proceedings of REFSQ'99 Fifth International Workshop on Requirements Engineering: Foundations of Software Quality, June 14-15 1999, Heidelberg, Germany. Published by University of Namur, pp. 49-54. ISBN 2 87037 307 4. PDF version
  4. Rayson, P., Garside, R., and Sawyer, P. (2000). Assisting requirements engineering with semantic document analysis. In Proceedings of Content-based multimedia information access RIAO 2000 (Recherche d'Informations Assistie par Ordinateur, Computer-Assisted Information Retrieval) International Conference, College de France, Paris, France, April 12-14, 2000. C.I.D., Paris, pp. 1363 - 1371. ISBN 2-905450-07-X PDF version
  5. Rayson, P., Emmet, L., Garside, R., and Sawyer, P. (2000). The REVERE Project: Experiments with the application of probabilistic NLP to Systems Engineering. In proceedings of 5th International Conference on Applications of Natural Language to Information Systems (NLDB'2000). Versailles, France, June 28-30th, 2000. PDF version
  6. Rayson, P., Garside, R., and Sawyer, P. (2000). Assisting Requirements Recovery from Legacy Documents. In Henderson, P. (ed.) Systems Engineering for Business Process Change: collected papers from the EPSRC research programme. Springer-Verlag, London, pp. 251 - 263. ISBN 1-85233-2220 PDF version
  7. Barbara Lewandowska-Tomaszczyk, Michael Oakes & Paul Rayson (2001). Annotated Corpora for Assistance with English-Polish Translation. Paper presented at Corpus Linguistics 2001, Lancaster University, UK, March 30-April 2, 2001. PDF version
  8. S. Sharoff, P. Rayson, O. Mudraya, A. Wilson and T. McEnery (2004). A tool for assisting translators using automatic semantic annotation. Presented at Corpus Use and Learning to Translate (CULT-BCN) Barcelona, January 22nd-24th 2004.
  9. Marilyn Deegan, Harold Short, Dawn Archer, Paul Baker, Tony McEnery, Paul Rayson (2004) Computational Linguistics Meets Metadata, or the Automatic Extraction of Key Words from Full Text Content. RLG Diginews, Vol. 8, No. 2. ISSN 1093-5371.
  10. Jones, M., Rayson, P. and Leech, G. (2004) Key category analysis of a spoken corpus for EAP. Presented at The 2nd Inter-Varietal Applied Corpus Studies (IVACS) International Conference on "Analyzing Discourse in Context" The Graduate School of Education, Queen’s University, Belfast, Northern Ireland, 25 - 26 June, 2004. PDF version
  11. Löfberg L, Juntunen J-P, Nykanen A, Varantola K, Rayson P, Archer D. (2004). Using a semantic tagger as dictionary search tool. In Williams G. and Vessier S. (eds.) Proceedings of the 11th EURALEX (European Association for Lexicography) International Congress (Euralex 2004), Lorient, France, 6-10 July 2004. Université de Bretagne Sud. Volume I, pp. 127-134. ISBN 2-9522-4570-3.
  12. Archer, D. and Rayson, P. (2004) Using an historical semantic tagger as a diagnostic tool for variation in spelling. Presented at Thirteenth International Conference on English Historical Linguistics (ICEHL 13) University of Vienna, Austria 23-29 August, 2004.
  13. Sharoff, S., Babych, B., Rayson, P., Mudraya, P. and Piao, S. (2006) ASSIST: Automated Semantic Assistance for Translators. In companion proceedings to the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006), Trento, Italy, April 3-7, 2006, pp. 139 - 142. ISBN 1-932432-60-4. PDF version
  14. Piao, S. L., Rayson, P., Mudraya, O., Wilson, A. and Garside, R. (2006) Measuring MWE compositionality using semantic annotation. In proceedings of COLING/ACL workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, July 23, 2006, Sydney, Australia. PDF version (Download data for human ratings)
  15. Andrew Wilson, Olga Moudraia (2006) Quantitative or Qualitative Content Analysis? Experiences from a cross-cultural comparison of female students' attitudes to shoe fashions in Germany, Poland and Russia. In Andrew Wilson, Paul Rayson and Dawn Archer (eds.) Corpus Linguistics around the world. Rodopi, Amsterdam.

分享到:
   阅读(?)评论(0)
 
表  情:
加载中...
 

请各位遵纪守法并注意语言文明