sexta-feira, 12 de agosto de 2005

Pesquisa por conceitos




A IBM planeia oferecer à comunidade open source a sua tecnologia de busca baseada em conceitos e factos. Segundo o gigante norte-americano, a aplicação desta tecnologia poderá revolucionar os métodos de pesquisa de informação, especialmente em empresas e organizações que lidam diariamente com muitos dados.
Ao contrário dos sistemas de pesquisa baseados em palavras, actualmente utilizados pelos motores de busca mais populares, a tecnologia Unstructured Information Management Architecture (UIMA) da IBM é capaz de descobrir significados, relações e factos através da análise dos mais variados tipos de ficheiros (documentos de texto, imagens, e-mails áudio, vídeo, entre outros). Ou seja, o UIMA é capaz de descobrir o que se procura, mesmo de forma implícita, e em sistemas de armazenamento de dados não estruturados. Por exemplo, uma empresa seguradora poderá fazer uma pesquisa por “doenças” na sua base de dados e obter uma lista de todos os clientes que sofrem de doenças, independentemente do seu tipo.
Quinze empresas de desenvolvimento de software de pesquisa e gestão documental já anunciaram que vão utilizar o UIMA como base das suas aplicações.
---
What is the Unstructured Information Management Architecture (UIMA) SDK?
--
Unstructured information management (UIM) applications are software systems that analyze unstructured information (text, audio, video, images, etc.) to discover, organize, and deliver relevant knowledge to the user. In analyzing unstructured information, UIM applications make use of a variety of analysis technologies, including statistical and rule-based Natural Language Processing (NLP), Information Retrieval (IR), machine learning, and ontologies. IBM's UIMA is an architectural and software framework that supports creation, discovery, composition, and deployment of a broad range of analysis capabilities and the linking of them to structured information services, such as databases or search engines. The UIMA framework provides a run-time environment in which developers can plug in and run their UIMA component implementations, along with other independently-developed components, and with which they can build and deploy UIM applications. The framework is not specific to any IDE or platform.

This technology, the UIMA SDK (Software Development Kit), is an all-JavaTM implementation of the UIMA framework, and it supports the implementation, description, composition, and deployment of UIMA components and applications. It also supports the developer with an Eclipse -based development environment that includes a set of tools and utilities for using UIMA.
One large, but not the only, application area of text analysis is improving text search. By detecting important terms and topics within documents, semantic search engines provide the capability to search for concepts and relationships instead of keywords. IBM's enterprise search solution, WebSphere Information Integrator OmniFind Edition, has such semantic search capabilities. It allows UIMA annotators to be plugged into the OmniFind processing flow, enabling semantic search to be performed on the extracted concepts. Since UIMA is used and developed both by IBM research and development teams, there are two locations of the UIMA SDK:

  • The UIMA SDK on alphaWorks is the "early adopter" version of the SDK. It is intended for users who don't use OmniFind, or who want to use features of UIMA that may not be supported by OmniFind. The alphaWorks SDK is also a test bed to gather feedback on new features of the UIMA SDK. Its versions may evolve more rapidly, and are not tied to specific OmniFind releases. The SDK is supported on a "best can do" basis, via the alphaWorks forum.
  • The UIMA SDK on developerWorks is the "OmniFind-compatible" version of the SDK. It is intended for users who want to develop and deploy semantic search solutions with OmniFind or solutions that take advantage of OmniFind's capabilities for enterprise-scale document crawling and extraction. The developerWorks SDK is tested for compatibility with a specific OmniFind version and will be updated to keep it in sync with new OmniFind releases. As the SDK evolves, prior versions will still be available on developerWorks, to ensure that each supported OmniFind version has a corresponding SDK. For customers who have an OmniFind license, this SDK is supported via the IBM support channels and also via the developerWorks forum. Important note: The SDK is currently not available through developerWorks, but will be available soon. In the meantime, please use the alphaWorks SDK. It is the same version as what will be available through developerWorks soon. Hence, there will be no migration or compatibility issues if you start now to develop OmniFind solutions with the alphaWorks SDK.

How does it work?

UIMA is an architecture in which basic building blocks called Analysis Engines (AEs) are composed in order to analyze a document. At the heart of AEs are the analysis algorithms that do all the work to analyze documents and record analysis results (for example, detecting person names). These algorithms are packaged within components that are called Annotators. AEs are the stackable containers for annotators and other analysis engines.
How Annotators represent and share their results is an important part of the UIMA architecture. To enable composition and reuse, UIMA defines a Common Analysis Structure (CAS) precisely for these purposes. The CAS is an object-based container that manages and stores typed objects having properties and values. Object types may be related to each other in a single-inheritance hierarchy. Annotators are given a CAS having the subject of analysis (the document), in addition to any previously created objects (from annotators earlier in the pipeline), and they add their own objects to the CAS. The CAS serves as a common data object, shared among the annotators that are assembled for an application.
Many UIM applications analyze entire collections of documents. UIMA supports this analysis through its Collection Processing Architecture. This part of the architecture allows specification of a "source-to-sink" flow from a collection reader though a set of analysis engines and then to a set of CAS Consumers. The collection reader's job is to connect to and iterate through a source collection, acquiring documents and initializing CASes for analysis. After the analysis engines have added their information to the CAS, CAS consumers do the final CAS processing, for example, sending the CAS contents to a search engine or extracting elements of interest and populating a relational database. A Semantic Search engine is included in the UIMA SDK; it will allow the developer to experiment with indexing analysis results, which will enable semantic searches using the the annotations in the CAS.

What's New in UIMA release 1.1
This version is incorporated into IBM's enterprise search solution, WebSphere Information Integrator OmniFind Edition, allowing search to be augmented using UIMA analytics.
Support for multiple Subjects of Analysis has been added - this is documented in a new chapter in the UIMA SDK User's Guide and Reference.
The Component Descriptor Editor has been greatly enhanced for Eclipse 3, allowing you to edit most UIMA descriptor files (the main exception being the Collection Processing Engine descriptors).
A new GUI-based tool allows interactive semantic search querying.
The Collection Processing Manager has been enhanced, and the SDK now includes examples and documentation describing how to use it.
Further information is available in the following PDF documents:

--
Mais informações em:

1 Comentários:

Anónimo disse...

Very informative blog site. Added to bookmarks.

I have a wiring site/blog. It pretty much covers wiring related stuff.

Come and check it out if you get time.

Arquivo

Categorias