This report describes the tools developed for multilingual text processing of social media. It gives details about the linguistic approaches used, the scope of the tools, and some results of performance evaluation. WP3 focuses on developing methods to detect relevant and informative posts, as well as content with clear validity, so that these can be dealt with efficiently without the clutter of uninformative or irrelevant posts obscuring the important facts. In order to achieve these objectives, low-level linguistic processing components are first required in order to generate lexical, syntactic and semantic features of the text required by the informativeness and trustworthiness components to be developed later in the project. These tools are also required by components developed in WP4 for the detection of emergency events, modelling and matchmaking. They take as input social media and other kinds of text-based messages and posts, and produce as output additional information about the language of the message, the named entities, and syntactic and semantic information.
In this report, we first describe the suite of tools we have developed for Information Extraction from social media for English, French and German. While English is the main language of messages dealt with by the tools in this project, it is very useful to be able to both recognise and deal with messages in other languages. French and German are therefore used as examples to show the adaptability of our tools to other languages, and our multilingual components thus serve as a testbed for new language adaptation techniques with which we have experimented during the project.
Various aspects of these tools are evaluated for accuracy. Second, we describe the tools we have developed for entity disambiguation and linking from social media, for English, French and German. These ensure not only that we extract relevant instances of locations, names of people and organisations, but that we know which particular instance we are talking about since these names may potentially refer to different things. By linking to a semantic knowledge base, we ensure both disambiguation and also that we have additional knowledge (for example, the coordinates of a location). The tools are evaluated for accuracy as well as speed, since traditionally these techniques are extremely slow and cumbersome to use in real world scenarios. The tools are all made available as GATE Cloud services.