内容发布更新时间 : 2025/1/11 10:13:27星期一 下面是文章的全部内容请认真阅读。
A typical session starts with the user entering some identification information, specifying his/her native language and then choosing a source language to be mapped to the target language (by default the user’s native language). All idioms from the source language are then made available to the user, who can browse through them, and enter the idiomatic equivalent(s) in the target language. For each idiom, the user is presented with an explanation of the meaning of the idiom and an example (both in English). The user is then asked to provide information about its syntactic variation (e.g. Can the idiom be topicalised?, Does it allow internal modification?, etc), and about its mapping to the source language (if it exists). As discussed in Section 2., for a particular language pair, there may be considerable variation in the realisation of equivalent idioms. In order to capture this variation, we adopt the following procedure:
1.If the idiom in the target language is lexically, syntacically and semantically equivalent to the idiom in the source language (e.g. in the red and no vermelho), the user is asked to provide a word-to-word mapping of the idiom;
2.Otherwise if they are syntactically and semantically equivalent, but not lexically (e.g. in the black and no azul), the user is asked to provide the mapping between the corresponding words, and for those that are lexically distinct, a translation to the source language;
3.Otherwise if they are only semantically equivalent, the user is asked to input each word of the idiom and its translation to the source language.
For each of these cases, the position of the word in the idiom is also recorded, to account for variations in word order.
One example is new blood in English, where the adjective precedes the noun, and its equivalent in Portuguese sangue novo (blood new), where the adjective follows the noun.
If more than one equivalent exists, then the same process applies to each of the equivalents. After that, or if there are no equivalents, the next idiom is displayed and
38
the user goes through the same process. 4.Test Data
In order to test the design, the database currently contains a sample of 100 high-frequency English idioms extracted randomly from the Collins Cobuild Dictionary of Idioms (Villavicencio and Copestake, 2002). This is used as the starting point(source language seed) to collect translation-equivalent idioms in other languages. Initially, it is this mapping between English and other languages that is being tested, but the goal is to extend the database to support mappings between idioms in any two languages.This database can be accessed locally and also through a web interface, allowing users in different locations to browse the database and provide information about idioms in their native language. 5.Web Interface
The first step in the annotation process is to stipulate the target language, and optionally select the English idiom index number from which to start the annotation. At the present time, language selection is string-based and not normalized in any way, to avoid restricting the scope of an- notation to any closed set of languages. The interface additionally has a cookie-based facility to identify the annotator for data maintenance purposes and also consistency in multi-session annotations.
Figure 1: Providing a translation and basic idiom properties
39
Figure 2: Word alignment (1)
In the case that these conditions are not met, a warning message is issued. At present, we do not attempt to make any further classification of the nature of mismatch for idioms that are not syntactically equivalent, nor do we attempt to classify the construction type of syntactically-equivalent idioms.
After annotating each idiom pair, the annotator is given the option of adding an additional translation for the source language idiom, or alternatively proceeding to the next idiom. Additionally, the annotator can flag a source language idiom as having no target language equivalent (see Figure 1).
Figure 3: Word alignment (2)
The web interface is publicly accessible at lingo. stanford.edu/cgi-bin/annotate/mli.cgi
40
in the form of a CGI script. 6.Lexical Database
The work reported in the paper relates to a larger project to develop a lexical database (Copestake et al., 2004). This lexical database is primarily for use within a grammar development environment. It provides a resource for the association of stems with grammatical, that is syntactic and semantic, information. In addition to grammatical information entries are associated with bookkeeping information (such as language and dialect) and other information. For example by linking to a semantic database containing detailed fully-expanded lexical semantics we can provide an efficient index for generation, or a data source for purposes. The existence of such a base lexical component within a grammar development environment provides a number of advantages over alternative approaches, including ease of maintainance, efficiency,and the benefits gained by utilising bookkeeping information and data from secondary sources.
By taking advantage of database functionality we can link idioms in the database of idioms discussed in the this paper with idiomatic entries in the lexical database.
As well as basic simplex lexical entries such as bombard the lexical database supports multiword expressions. These we may divide into two classes: those which allow for internal variation, and those which do not.
Consider firstly those idioms which allow for internal variation; for example spill the beans and variations thereof. In the lexical database we associate each such idiom with a template. This template specifies the necessary syntactic and semantic components of the idiom. For example spill the beans and rock the boat are syntactically composed of a verb and associated object; in the first case we require that the verb be (an idiomatic form of) the verb spill; in the second case, we require (an idiomatic form of) the verb rock; and so on. We also require that the simplex lexicon be augmented to include entries for these idiomatic word forms (these idiomatic
41
simplex forms are generated by overriding certain grammatical information in the nonidiomatic basic simplex entry; e.g. the idiomatic spill differs only from the non-idiomatic spill in specifying an idiomatic semantics). For a discussion of a specific approach to encoding such idioms within a grammar see (Copestake et al., 2002).
Those idioms which do not allow for internal variation (ad hoc being an example) may trivially be treated in the same manner as basic simplex entries.
The two classes of idiom outlined above are stored within distinct tables in the lexical database, each idiom being indexed by a unique identifier. Using the identifiers of the idioms in the two data sources, entries in the database of idioms are linked to the grammatical and other information contained in the lexical database, and via the lexical database to further potentially useful sources of information. 7.Discussion
The multilingual idiom database provides important in- formation for linguistic, computational linguistic and psycholinguistic use, and allows for the comparison of different phenomena in different languages. For instance, it may be the case that families of languages have very similar idiom equivalents and the same patterns of modification within them, and this can provide the basis for a better understanding of regularities in idioms across languages. Orthogonally, the semantic mappings may provide evidence supporting the claim that languages base idioms on common metaphors (Neumann, 1999). Moreover, the possibility of analyzing the different degrees of flexibility allowed by different languages for the same idiom is also valuable (e.g. in analysing idiom avoidance in bilinguals (Laufer, 2000)), and the presence (or absence) of certain idioms in different languages may also be of interest (e.g. for historical studies). Finally, such a database may contain data from different speakers of the same language, and pro- vide grounds for investigation of the variation in individuals’ intuitions into, e.g. modification effects and semantic alignment. 8.Conclusion
42