Copyright © 2010-2012 University of Padua, Italy.
This work is licensed under a Creative Commons - NonCommercial - ShareAlike License. This copyright applies to the ASIt Linguistic Linked Dataset and accompanying documentation in RDF. This dataset uses W3C's RDF technology, an open Web standard that can be freely used by anyone.
The Atlante Sintattico d'Italia, Syntactic Atlas of Italy (ASIt) enterprise builds on a long standing tradition of collecting and analysing linguistic corpora, which has originated different efforts and projects over the years. ASIt accounts for minimally different variants within a sample of closely related languages, thus it does not need a thorough part of speech (POS) disambiguation, since the "trivial" identification of basic POS (e.g. Nouns vs Verbs) is not enough to capture cross-linguistic differences between closely related languages. Secondly, the linguistic variants cannot be reduced to lexical distinctions only, i.e. syntactic differences are in general unpredictable on the basis of the properties of single lexical items. A specific tag set designed to capture sentence-level phenomena without taking into consideration POS tags is needed. As a consequence, while other tag sets are designed to carry out a gross linguistic analysis of a vast corpus, the ASIt tag set aims to capture fine-grained grammatical differences by comparing various dialectal translations of the same sentence. Moreover, in order to pin down these subtle asymmetries, the linguistic analysis must be carried out manually.
To explain why the needs for ASIt are so special we have to take into consideration two different aspects: the nature of Italian dialects, and the kind of linguistic theory ASIt aims to interact with. The Italian dialectal area presents a kind of variation that involves parametric choices affecting many general aspects of syntax, morphology, and phonology. The kind of information we want to gather involves not only the presence of a certain element, but also the absence of an element; an element can be omitted only in some constructions and in conjunction with specific characteristics of the language. For this reason, ASIt proposed the creation of a specific set of tags starting from a universal core shared by all languages (on the basis of the work done by DynaSAND), and subsequently developing a language-specific periphery which is compatible with other projects.
Dialectal data stored in the ASIt were gathered during a twenty-year-long survey investigating the distribution of several grammatical phenomena across the dialects of Italy. These data and information were collected by means of questionnaires formed by sets of Italian sentences: dialectal speakers were asked to translate them into their dialects and write their translations in the questionnaire; therefore, each questionnaire is associated with many parallel dialectal translations. At present, there are eight different questionnaires written in Italian and almost 500 questionnaires, corresponding to the eight Italian questionnaires, written in more than 240 different dialects, for a total of more than 54,000 sentences and more than 40,000 tags stored in the data resource managed by the ASIt digital library system.
RDF Schema of the Dataset
The RDF/S modeling the ASIt Linguistic Linked Dataset is organized in three main conceptual areas:
Geographical area | It is the place where a given dialect is spoken and where a speaker is born. |
Derivation area | It focuses on the background of the speaker: the level of knowledge of the dialect, the particular variety of the dialect, the birthplace, the ancestors, the document that she/he translated. |
Tagging area | It is how the document is structured and how it has been tagged (at a sentence level and at a word level. |
A relevant aspect of the ASIt curated database is that it explicitly models sentence level tagging, which is not modeled by any other of the presented linguistic projects. Furthermore, we have developed a language-specific set of POS tags which is suitable for ASIt dialectal data but, at the same time, allows these data to be linked to other databases of dialect syntax. We can therefore imagine the creation of a language-specific tagset as starting from a universal core, shared by all languages, and subsequently developing a language-specific periphery, which is compatible with other databases and able to classify language-specific structures.
In the following Figure we report the main classes and properties defining the RDF/S of the dataset.
As far as the vocabulary adopted in this specification is concerned, we use the namespaces and prefixes reported in the following table; asit is the only vocabulary which is not inherited from other domains.
Prefix | Namespace | Description |
---|---|---|
asit |
http://purl.org/asit/terms/ | ASIt vocabulary terms |
dcterms |
http://purl.org/dc/terms/ | Dublin Core terms |
foaf |
http://xmlns.com/foaf/0.1/ | Friend of a friend |
geo |
http://www.w3.org/2003/01/geo/wgs84_pos/ | WGS84 Geo Positioning |
gn |
http://www.geonames.org/ontology | GeoNames Ontology |
owl |
http://www.w3.org/2002/07/owl/ | OWL vocabulary terms |
rdf |
http://www.w3.org/1999/02/22-rdf-syntax-ns/ | RDF vocabulary terms |
rdfs |
http://www.w3.org/2000/01/rdf-schema/ | RDF Schema |
The ASIt curated database was the starting point for defining the RDF/S underlying the ASIt Linguistic Linked Dataset presented here. In the table below we report the OWL data type properties of the presented classes.
Class | OWL Datatype Properties |
---|---|
Region | gn:officialName, asit:geographicPartition , asit:note |
Province | gn:officialName, gn:shortName, asit:note |
Town | gn:officialName, geo:alt, geo:lat, geo:long, gn:population, asit:note, asit:provinceCapital, asit:provinceLittoral, asit:altimetricArea, asit:mountainTown, asit:surface |
Dialect | asit:dialectName |
Actor | foaf:firstName, foaf:lastName, foaf:name, foaf:birthday, foaf:gender, foaf:mailbox, asit:placeOfBirth, asit:education, asit:job, asit:country, asit:lang, asit:note, asit:affiliation |
Document | dcterms:title, dcterms:date |
Sentence | asit:sentence, asit:transcription, asit:note |
Word | asit:wordText, asit:transcription |
Tag | asit:tagDescription, asit:mandatory |
We exploited this RDF/S to expose the linguistic data in the ASIt curated database as a Linked Dataset whose details are reported in the table below.
Version Number | 1.02 |
Version Date | 2012-08-03 |
Availability | Public |
Serialization | RDF/XML |
Triples | 421,948 |
Size | 38.3 MB |
The ASIt curated database is synchronized with the ASIt Linguistic Linked Dataset, where every entity in the database corresponds to a class in the linked dataset; therefore, the dataset is maintained following the same policies adopted for the database, ensuring the quality of the data exposed. As a consequence, the ASIt Linguistic Linked Dataset size grows proportionally to the size of the data in the curated database: the number of entries associated with a database entity is related to the number of instances of the RDF class we mapped from it. Since the research activities on the linguistic ASIt database is still ongoing, the number of documents and sentences is increasing over time as well as the tags the linguistic researchers associate with them.
This work has been supported by the Project FIRB "Un'inchiesta grammaticale sui dialetti italiani: ricerca sul campo, gestione dei dati, analisi linguistica" (Bando FIRB - Futuro in ricerca 2008).
Copyright ©
Website design and management:
Information Management Systems Research Group