Posteado por: fennicienta | Junio 4, 2008

Translation Examples by MT System (Q3)

Now, I am going to translate a short text, in Spanish, into five languages:

“Es una verdad universalmente aceptada que un soltero con posibles ha de buscar esposa.

Por muy poco que se sepa de los gustos u opiniones de tal varón cuando se incorpora a una comunidad, esa verdad tiene tanto arraigo en en la mente de las familias circundantes que se le considera, por derecho, propiedad de una u otra de sus hijas.

-Mi querido señor Bennet -le dijo un día su esposa a este caballero-, ¿te has enterado de que por fin se ha alquilado Netherfield Park?”

Catalan: “És una veritat universalment acceptada que un solter amb possibles ha de buscar dona.

Per molt poc que se sàpiga dels gustos o opinions de tal home quan s’incorpora a una comunitat, aquesta veritat té tant arrelament en en la ment de les famílies circumdants que se li considera, per dret, propietat d’una o una altra de les seves filles.

-El meu estimat senyor Bennet -li va dir un dia la seva dona a aquest cavaller-, t’has assabentat que per fi s’ha llogat Netherfield Park?”

Galician: “É unha verdade universalmente aceptada que un solteiro con posibles ten que buscar esposa.

Por moi pouco que se saiba dos gustos ou opinións de tal home cando se incorpora a unha comunidade, esa verdade ten tanto arraigamento en na mente das familias circundantes que se lle considera, por dereito, propiedade dunha ou outra das súas fillas.

-O meu querido señor Bennet -díxolle un día a súa esposa a este cabaleiro-, decatáchesche de que por fin se alugou Netherfield Park?”

English: ”It is a universally accepted truth that a bachelor with possible ones has to look for wife.

For very little that is known about the tastes or opinions of such a male when it|he|she is incorporated in a community, that truth has so much rooting in in the mind of the surrounding families that he is considered, for law|right, estate|property of one or another of its|his|her|their daughters.

-My dear Mr. Bennet -said it|him a day its|his|her|their wife to this gentleman-, you have found out about at last Netherfield Park having been rented?”

Portuguese: “É uma verdade universalmente aceitada que um solteiro com possíveis há de buscar esposa.

Por muito pouco que se saiba dos gostos ou opiniões de tal varão quando se incorpora a uma comunidade, essa verdade tem tanto arraigo em na mente das famílias circundantes que se lhe considera, por direito, propriedade de uma ou outra de suas filhas.

-Meu querido senhor Bennet -lhe disse um dia sua esposa a este cavalheiro-, te inteiraste que por último se alugou Netherfield Park?”

German: ”Es ist eine universalmente akzeptierte Wahrheit, dass ein Junggeselle mit möglichen Gattin suchen muss.

Durch sehr wenig der sich von den Geschmäcken oder Meinungen so eines Mannes, wenn er in eine Gemeinschaft eingegliedert wird, weiß, hat diese Wahrheit so viel Einwurzelung in im Verstand der umgebenden Familien, die ihn durch|für Recht hält, Eigenschaft|Eigentum von einer oder einer anderen seiner|ihrer Töchter.

-Mein beliebter|lieber Herr Bennet -er sagte einen Tag seine|ihre Gattin zu diesem Herrn|Ritter-, du hast davon erfahren, dass man endlich Netherfield Park gemietet|vermietet hat?”

Sources:

Texto: Capitulo I de “Orgullo y Prejuicio” de Jane Austen

Translation Machines:

Comprendium Translator

Instituto Cervantes servicio de Traducción

Posteado por: fennicienta | Mayo 5, 2008

Definition of some concepts of the Translation World (Q3)

Here I will explain some of the most used words with the purpose to make easier the study of this subject.

  • Machine Translation, also referred by MT, is according to the Free Encyclopedia “sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. At its basic level, MT performs simple substitution of words in one natural language for words in another. Using corpus techniques, more complex translations may be attempted, allowing for better handling of differences in linguistic typology, phrase recognition, and translation of idioms, as well as the isolation of anomalies.
  • Machine aided Translation, or CAT s a form of translation wherein a human translator translates texts using computer software designed to support and facilitate the translation process. Some advanced computer-assisted translation solutions include controlled machine translation (MT). Integration of MT into computer-assisted translation has been implemented in various ways by various parties. Although this type of technology is neither widely known nor available to individual translators, carefully-customized user dictionaries based on correct terminology significantly improve the accuracy of MT, and as a result, they improve the efficiency of translation process.
  • Multilingual content management, is a multilingual website is usually a mixture of global and local content. Local content presents no particular content management issues; global content – which has to be translated across all language locales – does. Deciding where multiple language versions of content are going to be required and where content can be maintained separately for different locales is a critical decision that will affect how a site should be maintained and what it will cost.
  • Translation Technology, s the action of interpretation of the meaning of a text, and subsequent production of an equivalent text, also called a translation, that communicates the same message in another language. The text to be translated is called the “source text,” and the language it is to be translated into is called the “target language”; the final product is sometimes called the “target text.”

Sources:

Machile Translation. Wikipedia. Retrieved May 5, 2008, 11:58. From http://en.wikipedia.org/wiki/Machine_translation

Machine aided Translation. Wikipedia. Retrieved May 5, 2008, 12:03. From http://en.wikipedia.org/wiki/Computer-assisted_translation

Muntilingual content management. Kitsite. Retrieved May 5, 2008, 12:09. From http://www.kitsite.com/articles/multilingual-content-management.html

Translation Technology. Wikipedia Translation. Retrieved May 5, 2008, 12:12. From

http://en.wikipedia.org/w/index.php?title=Translation_technology&redirect=no

http://en.wikipedia.org/wiki/Translation

Posteado por: fennicienta | Mayo 5, 2008

Main Characteristics of a Translation Task by FEMTI (Q3)

To start with lets explain what the FEMTI is. The Framework for Machine Translation Evaluation in ISLE is a resource that helps MT evaluators define contextual evaluation plans. It consist on two interrelated classifications:

  • It lists possible characteristics of the contexts of use that are applicable to MT systems.
  • It lists the possible characteristics of an MT system, along with the metrics that were proposed to measure them.

FEMTI proposes a set of quality characteristics that are relevant to that context, using its embedded knowledge base. Evaluators can modify this set of quality characteristics and select evaluation metrics for each of them, by browsing the second classification. Evaluators can then print the evaluation plan and execute the evaluation.

According to FEMTI the main characteristics of a translation task are the following:

  • Assimilation: “The ultimate purpose of the assimilation task (of which translation forms a part) is to monitor a (relatively) large volume of texts produced by people outside the organization, in (usually) several languages.”
  • Dissemination: “The ultimate purpose of dissemination is to deliver to others a translation of documents produced inside the organization.”
  • Communication: “The ultimate purpose of the communication task is to support multi-turn dialogues between people who speak different languages. The translation quality must be high enough for painless conversation, despite possible syntactically ill-formed input and idiosyncratic word and format usage. The ultimate purpose of dissemination is to deliver to others a translation of documents produced inside the organization.”

Sources:

Posteado por: fennicienta | Abril 14, 2008

Three topics development (Q2)

The first project I have chosen to explain is the VSDS: Viennese Sociolect and Dialect Synthesis which is being developed by Fiedrich Neubarth, who belong to the OFAI Language Technology Group. One important means of natural human-computer interaction is spoken language, so for a variety of applications it is essential to have high quality speech synthesis for different languages. The outcome of this project will be high quality synthetic voices, which allow a computer to “speak” in different Viennese dialects/sociolects. Since the sources of these voices are pieces taken from actual human speech, the outcome of the synthetic voices will sound very natural, close to human speech. With this technology it is possible to realize a lot of applications from the domain of education and tourism to art. A mobile sample application, a Viennese district guide capable of various dialects or variants, is also developed within the project. In the research part of the project efficient methods are investigated for developing synthetic voices for languages that are variants of other languages. Furthermore, it is necessary to employ methods for switching, or shifting between the standard language and dialectal variants, which reflects the fact that this mixing of standards corresponds to the everyday language use of many speakers. User tests are conducted to evaluate the quality of the synthetic voices and of the relevant sample applications.

The second research project explained is from the Edimburgh Language Technology Group. Ewan Klein, Claire Grover as principal investigators from the University of Edimburgh and Chris Manning from Standford University have developed EASIE, which builds on existing techniques for information extraction (IE) in order to develop and implement improved methods for extracting semantic content from text. The results of the research are being used to significantly extend the functionality of Edinburgh’s existing XML-based LT-TTT software, in part by incorporating machine learning approaches developed at Stanford. The objective is to develop and implement improved methods for extracting semantic content from text.

The last project which I will focus on is K-Space, developed by Thierry Declerck, from the Language Technology Lab. It is a network of leading research teams from academia and industry conducting integrative research and dissemination activities in semantic inference for automatic and semi-automatic annotation and retrieval of multimedia content. The aim of K-Space research is to narrow the gap between low-level content descriptions that can be computed automatically by a machine and the richness and subjectivity of semantics in high-level human interpretations of audiovisual media: The Semantic Gap. The Network of Excellence K-Space exploits the complementary expertise of project partners, enables resource optimization and fosters innovative research in the field. Specifically, K-Space integrative research focus on three areas:

  • Content-based multimedia analysis:
    Tools and methodologies for low-level signal processing, object segmentation, audio/speech processing and text analysis, and audiovisual content structuring and description.
  • Knowledge extraction:
    Building of a multimedia ontology infrastructure, knowledge acquisition from multimedia content, knowledge-assisted multimedia analysis, context based multimedia mining and intelligent exploitation of user relevance feedback.
  • Semantic multimedia:
    knowledge representation for multimedia, distributed semantic management of multimedia data, semantics-based interaction with multimedia and multimodal media analysis.

Sources:

Posteado por: fennicienta | Abril 8, 2008

Recent Research Topics on Human Language Technology (Q2)

Next lines will deal with the most recent research topics mentioned in some important sites on Human Language Technology from different research centers in Europe:

The German Research Center for Artificial Intelligence is at the moment working in the following research projects:

  • CoSy-Cognitive Systems for Cognitive Assistants
  • HyLaP-Hybrid Language Pricessing Technologues for a personal associative information access and managemente application
  • K-Space-Knowledge Space of semantic inference for automatic annotation and retrieval of multimedia content
  • MESH-Multimedia Semantic Syndication for Enhanced News Services
  • MUSING-MUlti-Industry, Semantic-based Next Generation Business INtelliGence
  • PAVOQUE-PArametrisation of prosody and VOice QUality for concatenative speech synthesis in view of Emotion expression
  • QALL-ME-Question Answering Learning technologues in a multilingual and Multimoda Enviroment
  • RASCALLI-Responsive Artificial Situated Cognitive Agents Living and Learning on the Internet

In Ireland, the National Centre for Language Technology is developing on:

  • CALL Computer Assisted Language Learning-Integrating CL/NLP/HLT Technology into CALL, CALL for Endangered Languages, CALL for Primary School Environments, CALL for Remedial Learners
  • Corpus Linguistics- Statistical and Rule-Based MT (SMT, RBMT), Example-Based MT (EBMT), Translation Memories (TMs), Boosting Existing MT Systems, Machine-Aided Translation (MAT), Computer-Aided Translation (CAT), Controlled Languages
  • Treebank-Based Unification Grammar Acquisition-Automatic Feature-Structure Annotation Algorithms, Subcategorisation Frame Extraction, Wide-Coverage Robust Probabilistic Unification Grammar Acquisition, PCFG-Based LFG Approximation, HPSG Acquisition, Multilingual Treebank-Based Grammar Acquisition
  • Semantics-Discourse Representation Theory, Linear-Logic Based Semantics, Computation of Logical Forms from Treebanks, Open-Domain Question Answering Systems
  • Speech Technology- Speaker Characterisation, Audio Classification, Retrieval and Coding, Human Computer Interfaces (HCIs)
  • Multilingual Information Retrieval/Extraction
  • Language Evolution

The OFAI Language Technology Group is now involved in four projects and in some of these projects, there is a cooperations with Austrian university departments and companies.

  • VSDS: Viennese Sociolect and Dialect Synthesis (2007 – 2009)
  • SEMPRE: Semantically Aware Profiling for Recommenders (2007 – 2008 )
  • INSPIRATION (2006 – 2010)
  • RASCALLI: Responsive Artificial Situated Cognitive Agents Living and Learning on the Internet (2006 – 2008 )

In Edinburgh is possible to find their Language Technology Group which is on research and development of the following listed topics:

  • EASIE-Combining Shallow Semantics and Domain Knowledge
  • TXM-Text Mining for Biomedical Content Curation
  • CROSSMARC-Cross-retail Multi-agent Retail Comparison
  • SQUAD-Smart Qualitative Data: Methods and Community Tools for Data Mark-Up
  • SEER-Machine Learning for Named Entity Recognition
  • BOPCRIS-Named entity tagging of historical parliamentary proceedings
  • Synthesis-Integrated Models and Tools for Fine-Grained Prosody in Discourse
  • JAST-Joint Action Science and Technology
  • AMI and AMIDA-AMI consortium projects that are developing technologies for meeting browsing and to assist people participating in meetings from a remote location
  • Collaborating Using Diagrams-Study of how pairs collaborate when in planning a route on a map

Sources:

Posteado por: fennicienta | Marzo 24, 2008

European research centres for Human Language Technologies (Q1)

The following are some of the European research centers for Human Language Technologies:

  • The Edinburgh Language Technology Group (LTG) is a research and development group that has been working in the area of natural language engineering since the early 1990s. The LTG was originally established as part of the Human Communication Research Centre, and is now based in the Institute for Communicating and Collaborative Systems of the Division of Informatics, University of Edinburgh, one of the largest communities of natural language processing specialists in Europe.
  • The National Centre for Language Technology (NCLT), by Professor Josef van Genabith, conducts research into the processing of human language by computers, such as speech recognition and synthesis, machine translation, human-computer interfaces, information retrieval and extraction, the teaching and learning of languages using computers and software localisation and globalisation. Research in Human Language Technology (HLT) is interdisciplinary and includes Natural Language Processing (NLP) and Computational Linguistics (CL). HLT has substantial economic implications and potential. The centre carries out basic research and develops applications.
  • The Language Technology Lab whose mission is the improvement of language technology through novel computational techniques for processing text, speech and knowledge, a deeper understanding of human language and thought, studying the true needs of the end user and the demands of the market. They develop novel and improved applications in three areas: Information and Knowledge Management. Document Production, Natural Communication. One of their commercial activities is indexing of German and English texts using the IDX software package.
  • Language Technology (LT) forms a major research area at the Austrian Research Institute for Artificial Intelligence (OFAI) since its inception in 1984. We conduct research in modelling and processing human languages, especially for German. This includes constructing linguistic resources (such as lexicons, grammars, discourse models), processing algorithms (such as morphological components, parsers, generators, speech synthesizers, discourse processing components), and application prototypes (such as natural language interfaces, advisory systems and concept-to-speech systems).

Resources:

Posteado por: fennicienta | Marzo 24, 2008

Definition of Human Language Technology (Q1)

The definitions on this topics are numerous and different on the Net. These are two of them:

Wikipedia, under the name of Natural Language Processing, defines our aim of study as

” a subfield of artificial intelligence and linguistics. It studies the problems of automated generation and understanding of natural human languages. Natural language generation systems convert information from computer databases into normal-sounding human language, and natural language understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate.”

According to the Language Technology Lab, written by Hans Uszkoreit, Human Language Technology

“comprises computational methods, computer programs and electronic devices that are specialized for analyzing, producing or modifying texts and speech. These systems must be based on some knowledge of human language. Therefore language technology defines the engineering brach of computational linguistics”.

Searching for Hans Uszkoreit, we can find his curriculum vitae: Uszkoreit studied Linguistics and Computer Science at the Technical University of Berlin from 1973 to 1977 and the University of Texas at Austin from 1977 to 1981. During this time he also worked as a research associate in a large machine translation project at the Linguistics Research Center. He received the Ph. D. (Doctor in Philosophy) in linguistics from University of Texas in 1984. From 1982 until 1986, he worked as a computer scientist at the Artificial Intelligence Center of SRI International in Menlo Park, Ca. During this time he was also affiliated with the Center for the Study of Language and Information at Stanford University as a senior researcher and later as a project leader. In 1986 he spent six months in Stuttgart on an IBM (International Business Machines Corporation) Research Fellowship at the Science Division of IBM Germany. In December 1986 he returned to Sttutgart to work for IBM Germany as a project leader in the project LILOG (Linguistic and Logical Methods for the Understanding of German Texts). During this time, he also taught at the University of Stuttgart.
Among all his relevant publications and projects we can quote here some of them:

  • Uszkoreit, H. (2007) Methods and Applications for Relation Detection. In: Proceedings of the Third IEEE International Conference on Natural Language Processing and Knowledge Engineering, Beijing, 2007.
  • Uszkoreit, H., F. Xu, J. Steffen and I. Aslan (2006) The pragmatic combination of different cross-lingual resources for multilingual information services In Proceedings of LREC 2006, Genova, Italy, May, 2006.
  • Uszkoreit, H. (2000): Sprache und Sprachtechnologie bei der Strukturierung digitalen Wissens. In: W. Kallmeyer (Ed.) Sprache in neuen Medien, Institut für Deutsche Sprache, Jahrbuch 1999, De Gruyter, Berlin.
  • Uszkoreit, H. (1999): Sprachtechnologie für die Wissensgesellschaft: Herausforderungen und Chancen für die Computerlinguistik und die theoretische Sprachwissenschaft. In: F. Meyer-Krahmer und S. Lange (Eds.), Geisteswissenschaften und Innovationen, Physica Verlag.
  • Uszkoreit, H. (1998): Cross-Lingual Information Retrieval: From Naive Concepts to Realistic Applications. In: Language Technology in Multimedia Information Retrieval, Proceedings of the14th Twente Workshop on Language Technology.

Sources:

Posteado por: fennicienta | Febrero 9, 2008

The outstandings Web Browsers

Windows Internet Explorer (formerly Microsoft Internet Explorer abbreviated MSIE), commonly abbreviated to IE, is a series of graphical web browsers developed by Microsoft and included as part of the Microsoft Windows line of operating systems starting in 1995. It has been the most widely used web browser since 1999. The project was started by Thomas Reardom and subsequently led by Benjamin Slivka. “It has been designed to view the broadest range of web pages and to provide certain features within the operating system, including Microsoft Update. During the heyday of the historic browser wars, Internet Explorer superseded Netscape by supporting many of the progressive features of the time” The adoption rate of Internet Explorer seems to be closely related to that of Microsoft Windows, as it is the default web browser that comes with Windows. Since the integration of Internet Explorer 2.0 with Windows 95 OSR 1 in 1996, and especially after version 4.0’s release, the adoption was greatly accelerated: from below 20% in 1996 to about 40% in 1998 and over 80% in 2000.

Mozilla Firefox is a web browser, gopher client and FTP client project descended from the Mozilla Application Suite, managed by the Mozilla Corporation. Firefox had 16.80% of the recorded market share in Web browsers as of December 2007, making it the second-most-popular browser in current use worldwide after Internet Explorer. It uses the open-source Gecko layout engine, which implements some current Web standards plus a few features which are intended to anticipate likely additions to the standards.Firefox includes tabbed browsing, a spell checker, incremental find, live bookmarking, a download manager, and a search system that uses Google. Functions can be added through more than 2,000 add-ons created by third party developers. It runs on various versions of Microsoft Windows, Mac OS X, Linux, and many other Unix-like operating systems. Its current stable release is version 2.0.0.12, released on February 7, 2008. Firefox’s source code is under the terms of the Mozilla tri-license as free and open source software.

Safari is a web browser developed by Apple Inc. and included in Mac OS X. It was first released as a public beta on January 7, 2003, and is the default browser in Mac OS X v10.3 and later. A beta version for Microsoft Windows was released for the first time on June 11, 2007 with support for Windows XP and Windows Vista, although it was also functional, albeit unofficially, on Windows 2000. Safari has also been run unofficially on Linux under Wine, but the graphical user interface (GUI) and web graphics do not render properly. It has a bookmark management scheme that functions like the iTunes jukebox software, integrates Apple’s QuickTime multimedia technology, and features a tabbed-browsing interface. A web search box is a standard component of the Safari interface, as are software services that automatically fill out web forms, manage passwords via Keychain and spell check entries into web page text fields. The browser also includes an integrated pop-up ad blocker. Also from Apple is the Web Inspector — a DOM Inspector-like utility that lets users and developers browse the Document Object Model of a web page.

Opera is a web browser and Internet suite developed by the Opera Software company. Opera handles common Internet-related tasks such as displaying web sites, sending and receiving e-mail messages, managing contacts, IRC online chatting, downloading files via BitTorrent, and reading web feeds. Opera is offered free of charge for personal computers and mobile phones, but for other devices it must be paid for. Features of Opera include high performance, tabbed browsing, page zooming, mouse gestures, and an integrated download manager. Its security features include built-in phishing protection, strong encryption when browsing secure web sites, and the ability to delete private data such as cookies and browsing history by clicking a button. It is currently the fourth most widely used web browser for personal computers. Opera has a stronger market share, however, on mobile devices such as mobile phones, smartphones, and personal digital assistants.

Sources:

Posteado por: fennicienta | Febrero 6, 2008

Web Browser

According to the Free Encyclopedia a web browser is “a software application that enables a user to display and interact with text, images, videos, music and other information typically located on a Web page at a website on the World Wide Web or a local area network.” They allow users to quickly and easily access information provided on many Web pages at many websites by traversing links, that contain hyperlinks to other Webs. Web browsers format HTML information for display, so the appearance of a Web page may differ between browsers.

They communicate with Web servers primarily using HTTP to fetch webpages. This allows Web browser to submit information to Web servers as well as fetch Web pages from them. ” Pages are located by means of a URL (uniform resource locator), which is treated as an address, beginning with http: for HTTP access. Many browsers also support a variety of other URL types and their corresponding protocols, such as gopher: for Gopher, ftp: for FTP, rtsp: for RTSP, and https: for HTTPS (an SSL encrypted version of HTTP).” The file format for a Web page is usually HTML (hyper-text markup language) and is identified in the HTTP protocol using a MIME content type. Most browsers natively support a variety of formats in addition to HTML, such as the JPEG, PNG and GIF image formats, and can be extended to support more through the use of plugins. The combination of HTTP content type and URL protocol specification allows Web page designers to embed images, animations, video, sound, and streaming media into a Web page, or to make them accessible through the Web page.

In 1992, Tony Johnson released the MidasWWW browser. Based on Motif/X, MidasWWW allowed viewing of PostScript files on the Web from Unix and VMS, and even handled compressed PostScript. Another early popular Web browser was ViolaWWW, which was modeled after HyperCard. However, the explosion in popularity of the Web was triggered by NCSA Mosaic which was a graphical browser running originally on Unix but soon ported to the Amiga platform, and later the Apple Macintosh and Microsoft Windows platforms. Version 1.0 was released in September 1993.

The wars put the Web in the hands of millions of ordinary PC users, but showed how commercialization of the Web could stymie standards efforts. Both Microsoft and Netscape liberally incorporated proprietary extensions to HTML in their products, and tried to gain an edge by product differentiation, leading to the acceptance of the Cascading Style Sheets proposed by Håkon Wium Lie over Netscape’s JavaScript Style Sheets (JSSS) by W3C.

Sources:

Posteado por: fennicienta | Febrero 2, 2008

Minoriti languages on the Internet

It is estimated that there are about 600o spoken languages in the world. Of these, about 50% can be reasonably classified as “moribund”, 40% as endangered, and the remaining 10% is safe. The World Wide Web offers minority languages the opportunity to reach a wide audience at a relatively low cost compared with traditional media. The presence of minority languages in this new medium is as important as the presence in traditional media.

Nowadays, it is clear that English lead the Internet. But, would this be forever? According to Bobby Johnson “the traditional grip that the English language has had on the web is sliding away, as China increasingly begins to assert itself on the internet.” This means that day by day, other languages are taking place to English. Statistics show us most important languages: as said, first position is occupied by English with the 30′1%, secondly we find Chinese growing by day with around a 14′1%, third Spanish representing the 9′0%, then Japanese, French, German… With these numbers we can deduce that the most important languages are the ones that are more spoken in the world. But while Spanish, French or German hold their place, Chinese increased.

Minority languages such as Basque, which has a reduced number of speakers are now increasing its presence on different sites, which helps to preserve it in a way. Those which are in danger of extinction have in the Web a good way to make the language more important, try to reach the most people possible and make aware them that it is still alive. Here are also important webs that are developed in a bilingual mode, Spanish-Basque, Spanish-Catalan, etc.

Sources:

Entradas antiguas »

Categorías