The OpenAmplify Web Service is the first and only of its kind. It exposes, via an open API, 250 man-years of development effort in a web service based upon more than a dozen granted patents. OpenAmplify simply does a better job of surfacing the meaning of web content, at massive scale and speed. Here's an overview of the thinking and technology that makes it possible.
TUT is a project for the development of a collection of morphologically, syntactically and semantically annotated Italian sentences; it includes:
the definition of a native representation format (i.e. TUT format), which is dependency-oriented and aims at capturing the richness of the predicate-argument structure, i.e. a crucial layer of representation for several NLP tasks, such as parsing Information Extraction, Machine Translation and Question Answering.
the conversion in Penn Treebank, in other constituency-based formats and in a format based on the Combinatory Categorial Grammar (see the TUTtoPENN converter web page and the CCG-TUT web page), which increases the possibilities of comparison/evaluation and portability of the resource.
Penn Treebank II tag set
Pattern and MBSP assign meaningful tags to words and groups of words in a sentence. Each tag is a short code (such as "DT" for "determiner").
Penn Treebank II Tags
Note: This information comes from "Bracketing Guidelines for Treebank II Style Penn Treebank Project" - part of the documentation that comes with the Penn Treebank.
Contents:
Bracket Labels
Clause Level
Phrase Level
Word Level
Function Tags
Form/function discrepancies
Grammatical role
Adverbials
Miscellaneous
Index of All Tags
The tagset used in tagging the demo corpus available here is the Penn Treebank Tag set, described for example in Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz: Building a Large Annotated Corpus of English: The Penn Treebank, in Computational Linguistics, Volume 19, Number 2 (June 1993), pp. 313--330 (Special Issue on Using Large Corpora). The tagging was done at UPenn. The following part-of-speech tags are used in the corpus:
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NP Proper noun, singular
15. NPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PP Personal pronoun
19. PP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
30. VBN Verb, past participle
31. VBP Verb, non-3rd person singular present
32. VBZ Verb, 3rd person singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP$ Possessive wh-pronoun
36. WRB Wh-adverb
Just as a noun functions as the Head of a noun phrase, a verb functions as the Head of a verb phrase, and an adjective functions as the Head of an adjective phrase, and so on. We recognise five phrase types in all:
Phrase Type
Head
Example
Noun Phrase Noun [the children in class 5]
Verb Phrase Verb [play the piano]
Adjective Phrase Adjective [delighted to meet you]
Adverb Phrase Adverb [very quickly]
Prepositional Phrase Preposition [in the garden]
A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb. Probabilistic parsers use knowledge of language gained from hand-parsed sentences to try to produce the most likely analysis of new sentences. These statistical parsers still make some mistakes, but commonly work rather well. Their development was one of the biggest breakthroughs in natural language processing in the 1990s. You can try out our parser online.
open source software capable of solving almost any text processing problem
Keywords and keyphrases (multi-word units) are widely used in large document collections. They describe the content of single documents and provide a kind of semantic metadata that is useful for a wide variety of purposes. The task of assigning keyphrases to a document is called keyphrase indexing. For example, academic papers are often accompanied by a set of keyphrases freely chosen by the author. In libraries professional indexers select keyphrases from a controlled vocabulary (also called Subject Headings) according to defined cataloguing rules. On the Internet, digital libraries, or any depositories of data (flickr, del.icio.us, blog articles etc.) also use keyphrases (or here called content tags or content labels) to organize and provide a thematic access to their data.