内容发布更新时间 : 2025/1/23 2:18:28星期一 下面是文章的全部内容请认真阅读。
英文原文
A Multilingual Database of Idioms
Aline Villavicencio, Timothy Baldwin, Benjamin Waldron
Abstract
This paper presents a possible architecture for a multilingual database of idioms. We discuss the challenges that idioms present to the creation of such a database and propose a possible encoding that maximises the amount of information that can be stored for different languages. Such a resource provides important information for linguistic, computational linguistic and psycholinguistic use, and allows for the comparison of different phenomena in different languages. This can provide the basis for a better understanding of regularities in idioms across languages. 1.Introduction
This work is concerned with enabling the creation of a multilingual database of idioms. Idioms are often defined as a group of words which have a different meaning when used together from the one it would have if the meaning of each word were taken individually (Collins, 2000). They comprise expressions like spill the beans, kick the bucket and pull strings, that are usually employed in everyday language to precisely express ideas and concepts that cannot be compressed into a single word. Even though some idioms are fixed, and do not present internal variation, such as ad hoc, there is also a large proportion of idioms that allow different degrees of internal variability, and with a variable number of elements. For example, the idiom spill the beans allows internal modification (spill mountains of beans), passivization (The beans were spilled on the latest edition of the report), topicalization (The beans, the opposition spilled), and so on.
As we can see, idioms are a highly heterogeneous kind of multiword expression, ranging from (semi-)fixed cases (e.g. kick the bucket) which only allow morphological in- flection, to more flexible ones (e.g. spill the beans) which can undergo different types of syntactic variation and mod- ification (Nunberg et al.,1994).
Moreover, for the later case, the type of syntactic variation that these idioms allow is highly unpredictable (Riehemann, 2001). Even if these works focus their discussion on idioms in English, the same phenomena can also be found in idioms in other languages. Such variation tends to be a challenge for their successful (computational) linguistic treatment (Sag et al., 2002). In linguistics, for example, they have been often used as evidence for or against the properties of grammatical theories (e.g. must ―syntactic theory‖ include transformational operations? From Nunberg et al. (1994)).In computational linguistics,for applications such as machine translation, appropriate understanding/treatment of idioms is necessary for these systems to be able to deal with natural languages, and avoid the generation of unnatural or nonsensical sentences in the target language.There are even cases where a pair of corresponding idioms in two different languages may share the same properties (e.g. the other side of the coin in English and its literal translation in Portuguese o outro lado da moeda, which is also a noun phrase idiom) But exactly how much variation do these idioms have? What is the proportion of idioms that are fixed in a given language? And what proportion have equivalents in other languages?
Having access to a multilingual database of cases and being able to analyse them can give us some insight into the nature of idioms, and into what is required of a proper treatment of idioms crosslingually. In this work we propose an encoding that supports the collection of idioms in several languages, and the mapping of equivalent parts.
2.Idioms across Languages
Idioms are commonly thought of as metaphors that have became fixed or fossilized over time. While in some cases the metaphor is transparent and can be easily understood even by non-native speakers (e.g. kill two birds with one stone as achieve two things at the same time), in other cases the metaphor is opaque and if the idiom is not known by the hearer, it can lead to misinterpretations (e.g. kick the bucket as die).
Some of these metaphors can be found in idioms across languages, and in some cases, in very similar idioms. For instance, one idiom that can be found in both
English and Portuguese that shows full lexical, syntactic and semantic correspondence is in the red, which is no vermelho in Portuguese, where no is the contraction of in + the and vermelho means red, and both idioms are prepositional phrases (PPs) and have the same meaning.However, there is a large range of variation to be found in idiom pairs across languages, and some idioms do not have such a direct map- ping, and may differ in one or more ways and/or may al- low different forms of modification/variation. For example, some idiom pairs are syntactically and semantically but not lexically equivalent. One example is in the black and its Portuguese counterpart no azul (in the blue), where both are PP idioms and the only difference is in the choice of colour (blue instead of black), or alternatively bring the curtain down on and its counterpart botar um ponto final em (put the final dot in) that are both verbal constructions. There are also idioms that are semantically equivalent, but realised using different constructions across languages. For example, in a corner and encurralado (meaning cornered) are semantically equivalent but realised by different construc- tions – a PP in English and an adjective in Portuguese). Finally, some idioms have multiple idiomatic equivalents in a second language, while others have none, and this information is also of importance (see Tanaka and Baldwin (2003) for a discussion of English and Japanese compound nouns in the context of a machine translation task).
The challenge is then to define a database design which is capable of encoding all the variation found in these phenomena as well as the correspondences between them in a common format. We propose a database design that can be used for such a task, allowing the maximum amount of information to be stored about an idiom and its counterparts in different languages. 3.A Possible Architecture
A typical session starts with the user entering some identification information, specifying his/her native language and then choosing a source language to be mapped to the target language (by default the user’s native language). All idioms from the source language are then made available to the user, who can browse through them, and enter the idiomatic equivalent(s) in the target language. For each idiom, the user
is presented with an explanation of the meaning of the idiom and an example (both in English). The user is then asked to provide information about its syntactic variation (e.g. Can the idiom be topicalised?, Does it allow internal modification?, etc), and about its mapping to the source language (if it exists). As discussed in Section 2., for a particular language pair, there may be considerable variation in the realisation of equivalent idioms. In order to capture this variation, we adopt the following procedure:
1.If the idiom in the target language is lexically, syntacically and semantically equivalent to the idiom in the source language (e.g. in the red and no vermelho), the user is asked to provide a word-to-word mapping of the idiom;
2.Otherwise if they are syntactically and semantically equivalent, but not lexically (e.g. in the black and no azul), the user is asked to provide the mapping between the corresponding words, and for those that are lexically distinct, a translation to the source language;
3.Otherwise if they are only semantically equivalent, the user is asked to input each word of the idiom and its translation to the source language.
For each of these cases, the position of the word in the idiom is also recorded, to account for variations in word order.
One example is new blood in English, where the adjective precedes the noun, and its equivalent in Portuguese sangue novo (blood new), where the adjective follows the noun.
If more than one equivalent exists, then the same process applies to each of the equivalents. After that, or if there are no equivalents, the next idiom is displayed and the user goes through the same process. 4.Test Data
In order to test the design, the database currently contains a sample of 100 high-frequency English idioms extracted randomly from the Collins Cobuild Dictionary of Idioms (Villavicencio and Copestake, 2002). This is used as the starting point(source language seed) to collect translation-equivalent idioms in other languages. Initially, it is this mapping between English and other languages that is
being tested, but the goal is to extend the database to support mappings between idioms in any two languages.This database can be accessed locally and also through a web interface, allowing users in different locations to browse the database and provide information about idioms in their native language. 5.Web Interface
The first step in the annotation process is to stipulate the target language, and optionally select the English idiom index number from which to start the annotation. At the present time, language selection is string-based and not normalized in any way, to avoid restricting the scope of an- notation to any closed set of languages. The interface additionally has a cookie-based facility to identify the annotator for data maintenance purposes and also consistency in multi-session annotations.
Figure 1: Providing a translation and basic idiom properties
Figure 2: Word alignment (1)
In the case that these conditions are not met, a warning message is issued. At