LM/PRU

The LiverMemories/Portale della Ricerca Umanistica (LM/PRU) pipeline is a multilingual aware web service for processing archaeology texts. Currently, the input are documents in Portable Document Format (PDF) with machine-encoded text, for example, encoded via optical character recognition (OCR) or directly from the source applicaton, and the output are tabular formatted multi-column files.

The project’s source code is available from https://bitbucket.org/clic/lm_pru, and its documentation from http://clic.bitbucket.org/lm_pru/. A RESTful interface is located at http://clic.cimec.unitn.it/egon/pru/documents.cgi, and an example use-case of the RESTful interface (upload file, run the pipeline, make results available for download) is available at http://clic.cimec.unitn.it/egon/pru/upload.html.

Pipeline

# THE pipeline
# handle all processing stages from PDF (with text content, i.e. ready OCR-ed)
# to TextPro output; the latter enriched with 'some additional' information. 
# 

The pipeline processes scientific articles in PDF{32000-1.iso.org}format form the humanities.

Firstly, it converts the encoded text into plain text, then it splits the text into a header, a body, and the bibliography. The header contains meta information about the document like author, abstract text, keywords, and the like, the body contains the articles’s main content, and the bibliography contain the referenced work of the article.

The header information and the bibliography are reformatted and are made available for further processing outside the pipeline, the body information is processed further within the pipeline. To this end, first, the main language of the body text is detected, then its document structure is extracted, and then, the HLT processing steps of tokenisation, part-of-speech tagging, chunk parsing, and named-entity recognition are carried out.

PDF to TXT

The task of extracting text from PDF files is notoriously difficult because in an actual PDF file, text portions might be split into several chunks in the middle of its running. Therefore, text extraction needs to reassemble text chunks, i.e. single characters, and chunks of characters need to be merged back into words.

Here, the extraction of running text from PDF files is done by PDFMiner{pdfminer}, a Python PDF parser and analyzer. Due to the difficulty of the task we chose an application that provides fine-grained control over the reassembly process; PDFMiner has configurable options for the distance between characters, words, and lines - and even though, the default settings accomplish fair results, having the possibility to enhance the process, e.g. by choosing values that minimise the number of unknown words in the output, is a very viable advantage.

The output of this processing step is plain text ideally, with the text arranged in the intended reading order, e.g. a left column before a right one.

Post-Processing the Text Output

The plain text output is post-processed for several artefacts: A)hyphenated words are recombined, and the running text of a single paragraph is rearranged into a single line. B)page headers and footers, e.g. authors, article titles, page numbers, are removed.

The following end and beginning of two pages

368-369). Anche i ritrovamenti nei territori della Peni- sola italiana sono in generale rari. Le scoperte nell’Eu- ropa mediterranea occidentale sembrano quindi poco numerose e mal documentate.

9 Borrello Micheli.p65

72

01/06/2006, 15.15

^L73

Fig. 2. Carta di distribuzione dei rinvenimenti europei di ornamenti preistorici in Spond s sp. (da MÜLLER, 1997).

  1. PROVENIENZA DELLA MATERIA PRIMA.

I rinvenimenti italiani, quelli della costa dalmata e della Grecia settentrionale non pongono problemi di determinazione delle aree di approvvigionamento, per- ché si collocano in corrispondenza delle zone costiere

will become

(TABORIN, 1974: 368-369). Anche i ritrovamenti nei territori della Penisola italiana sono in generale rari. Le scoperte nell’Europa mediterranea occidentale sembrano quindi poco numerose e mal documentate. Fig. 2. Carta di distribuzione dei rinvenimenti europei di ornamenti preistorici in Spondylus sp. (da MÜLLER, 1997).

  1. PROVENIENZA DELLA MATERIA PRIMA.

I rinvenimenti italiani, quelli della costa dalmata e della Grecia settentrionale non pongono problemi di determinazione delle aree di approvvigionamento, perché si collocano in corrispondenza delle zone costiere o

Meta-Information and Bibliography Extraction

For extracting the meta-information and splitting the article we use ParsHead and ParsCit, respectably, both modules within the ParsCit project{CouncillGilesKan2008}. However, we added rules for recognising Italian Abstracts, Keywords, and Bibliography.

Language Identification

For identifying the language of given parts we use the *chromium-compact-language-detector*{chromium-cld}, a C++ library with Python bindings for language identification of UTF-8 text, extracted from the Chromium browser. The language identification is carried out on the whole body of the article, and on each paragraph individually. In case of inconclusive results on individual paragraphs, the language of the text body is assumed.

Document Structure Analysis

For extracting the document structure we use SectLabel, a module within the ParsCit project; the detection is aided by the usage of conditional random fields. However, we trained a new model on manually crafted training data for the domain of Italian articles form the Humanities.

HLT Pipeline

For the HTL processing steps of tokenisation, part-of-speech tagging, chunking parsing, and named-entity recognition we use TextPro{textpro}, a suite of modular Natural Language Processing (NLP) tools for analysis of Italian and English texts. The suite has been designed so as to integrate and reuse state of the art NLP components developed by researchers at the Fondazione Bruno Kessler (FBK). However, we first refine the NE results with a gazetteer dictionary, and then, use a specifically built ML-based NE detection framework, trained on manually annotated data from our domain.

API

The LM/PRU pipeline is accessable via an Application Programming Interface (API). The interface uses a RESTful protocol on top of the Hypertext Transfer Protocol (HTTP) version 1.1; clients submit source documents via POST requests, access results via GET requests, and request deletion of resources via DELETE requests. Results are either Extensible Markup Language (XML) documents, or documents of the requested type.

Requests for running the pipeline on new source documents are handled independently, and a new resource is created per request, i.e. only one document can be submitted at a time but multiple parallel requests are possible as well as, parallel processing of multiple source documents on the server.

A typical client-server interaction looks like this: Upon successful POST of a new source document by the client, the server responds with a new resource URI. The client uses this URI to poll the status of the processing until successful completion is being signalled. Now, the client can retrieve the final output (in anyone of the available formats).

Protocol

The REST architectural style implies that URLs are used to represent available resources on a server. This section gives a summary of the resources that are available from the PRU/LM interface with the HTTP-methods that can be used on them.

Available Resources and Supported Methods

The {<BASE_URL>} denotes the location on a server where the LM/PRU pipeline’s interface is exposed on a web server, e.g. http://clic.cimec.unitn.it/egon/pru. Likewise, {<ID>} denotes a specific resource returned by the server upon successful POST of a source document.

<BASE_URL>
    /documents            -- POST   : creation of new resource
        /<ID>             -- DELETE : deletion of resource
            /status       -- GET    : status of backend processing
            /log          -- GET    : log file of backen processing
            /pdf          -- GET    : previously uploaded PDF document
            /bib          -- GET    : extracted references in BibTeX format
            /pretxp       -- GET    : structure-annotated pre-TextPro file
            /txp          -- GET    : TextPro processed file
            /zip          -- GET    : packed directory content
            /

Expected and Available Formats

/documents

To request the creation of a new resource the client calls this URI via HTTP POST using the Content-Type multipart/form-data according to RFC2388. A single file with the parameter NAME and the VALUE “upfile” will be considered.

An exemplary html form: {language=XML}

<form method=”POST” enctype=”multipart/form-data” ACTION=”./documents”> <input type=”file” name=”upfile” maxlength=”1” /> <input type=”submit” id=”sendForm” value=”upload” /> </form>
/documents/<ID>

To request the deletion of the resource, i.e. the whole processed directory, the client calls this URI via HTTP DELETE.

/documents/<ID>/status

To request the state of the processing the client calls this URI via HTTP GET. A single text/plain document with the status will be returned.

/documents/<ID>/log

To request the log file of the processing the client calls this URI via HTTP GET. A single text/plain document with the log file as content will be returned.

/documents/<ID>/pdf

To request the original, unmodified file the client calls this URI via HTTP GET. The response will be of “Content-Type:application/pdf”, the actual content will be binary data. Calling this URI from a browser will (usually) trigger a ‘Save File’ dialogue.

/documents/<ID>/bib

To request the extracted references in BibTeX format the client calls this URI via HTTP GET. A single text/plain document will be returned.

/documents/<ID>/pretxp

To request the pre-TextPro file the client calls this URI via HTTP GET. A single text/plain document will be returned.

/documents/<ID>/txp

To request the TextPro processed file the client calls this URI via HTTP GET. A single text/plain document will be returned.

/documents/<ID>/zip

To request the gzip compressed tar archive of the processed directory the client calls this URI via HTTP GET. The response will be of “Content-Type:application/x-gzip”, and will have a “Content-Disposition: attachment”, the actual content will be binary data. Calling this URI from a browser will (usually) trigger a ‘Save File’ dialogue.

Processing Status

The client may query the latest status of the processing via HTTP GET on the status URI.

The status responses are:
 Q:queued    -- upon successful resource creation, no processing has happened
 R:running   -- document is being processed, continuously growing log file
                available
 S:succeeded -- document has been processed successfully, results available
 F:failed    -- document processing has failed, log file available

While the status codes ‘Q’ and ‘R’ are intermediate and changes can be identified via re-polling, the status code ‘S’ and ‘F’ are final, i.e. no status change will occur, and clients must not continue polling.

Examples

Some examples how to interact with the RESTful API via CURL.

Upload a File

Delete all processed Files

the modules

commonly needed functionality for dealing with TextPro files.

lib_textpro.add_col(comment, content, col_label, col_lines_list, col_num=None)[source]

Set or add column col_label/col_num in comment and content blocks, and return the altered comment and content.

Arguments:

comment – list of strings: a TextPro comment block content – list of strings: a TextPro content block col_label – string: name of the column col_lines_list – list of strings

col_num – int of the column to set

lib_textpro.add_comment(comment, key, value)[source]

Add key-value-line to comment (right before the “# FIELDS:” line), and return the altered comment block.

Arguments:
comment – list of strings: a TextPro comment block key – string: new key to add values – string: corresponding values for the key
lib_textpro.add_content_col(content, col_lines_list)[source]

Add column at the end of content block, set it to the value of col_lines_list, and return the altered content.

Arguments:
content – list of strings: a TextPro content block col_lines_list – list of strings
lib_textpro.add_fields_col(comment, col_label)[source]

Add a column col_label at the end to the “# FIELDS:” line of the comment block, and return the altered comment.

Arguments:
comment – list of strings: a TextPro comment block col_label – string: name of the column
lib_textpro.del_comments_by_key(comment, key)[source]

Delete (all) key-value-line(s) from comment, and return the altered comment block.

lib_textpro.del_content_col(content, col_num)[source]

Delete col_num column from content block, and return the altered content.

Arguments:
content – list of strings: a TextPro content block col_num – int of the column to delete
lib_textpro.del_fields_col_by_int(comment, col_num)[source]

Delete col_num column from the “# FIELDS:” line of the comment block, and return the altered comment.

Arguments:
comment – list of strings: a TextPro comment block col_num – int of the column to delete
lib_textpro.del_fields_col_by_name(comment, col_label)[source]

Delete col_label column from the “# FIELDS:” line of the comment block, and return the altered comment.

Arguments:
comment – list of strings: a TextPro comment block col_labe – string: name of the column to delete
lib_textpro.find_patterns(patterns, tokens)[source]

For each list of tokens in patterns find the longest, exact matching sequences within the list of tokens, and return a dictionary built of tuples (token_start_id, length_of_match).

Arguments:
patterns – list of lists of tokens tokens – list of tokens
lib_textpro.get_comment_value(comment, key)[source]

Return the (first) value for the comment key from the comment block.

Arguments:
comment – list of strings: a TextPro comment block key – string: the key to look for
lib_textpro.get_content_col(content, col)[source]

Return the columnn col from the content block.

Arguments:
content – list of strings: a TextPro content block col – int: index of the column
lib_textpro.get_fields_col_number(comment, col_label)[source]

Return col_label’s index number of the “# FIELDS:” line of the comment block.

Arguments:
comment – list of strings: a TextPro comment block col_label – string: name of the column
lib_textpro.insert_content_col_after_col(content, col_lines_list, col_num)[source]

Insert col_lines as column after col_num in content block, and return the altered content.

Arguments:
content – list of strings: a TextPro content block col_lines_list – list of strings col_num – int of the column to insert /after/
lib_textpro.insert_fields_col_after_int(comment, col_label, col_int)[source]

Insert col_label_new after the col_label_old column of the “# FIELDS:” line of the comment, and return the altered comment.

Arguments:
comment – list of strings: a TextPro comment block col_label_old – string: new name of the column to insert col_int – int of the column to insert /after/
lib_textpro.insert_fields_col_after_name(comment, col_label_new, col_label_old)[source]

Insert col_label_new after the col_label_old column of the “# FIELDS:” line of the comment, and return the altered comment.

Arguments:
comment – list of strings: a TextPro comment block col_label_new – string: new name of the column to insert col_label_old – string: name of the column to insert /after/
lib_textpro.set_comment(comment, key, value)[source]

Set line of comment (re-use old line if available), and return the altered comment block.

Arguments:
comment – list of strings: a TextPro comment block key – string: new key to add values – string: corresponding values for the key
lib_textpro.set_content_col(content, col_lines_list, col_num)[source]

Set column col_num in content block to the values of col_lines_list, and return the altered content. Raises IndexError if col_num is out of range.

Arguments:
content – list of strings: a TextPro content block col_lines_list – list of strings col_num – int
lib_textpro.set_fields_col_by_int(comment, col_label, col_num)[source]

Set the col_num column of the “# FIELDS:” line of the comment block to col_label, and return the altered comment.

Arguments:
comment – list of strings: a TextPro comment block col_label – string: name of the column col_num – int: the column number to set
lib_textpro.set_fields_col_by_name(comment, col_label_new, col_label_old)[source]

Set the col_label_old column of the “# FIELDS:” line of the comment block to col_label_new, and return the altered comment.

Arguments:
comment – list of strings: a TextPro comment block col_label_new – string: new name of the column col_label_old – string: name of the column to set
lib_textpro.tp_blocks(stream)[source]

Return tuple: list of comment lines, and list of content lines.

Arguments:
stream – a ‘TextPro file’ input stream (e.g. sys.stdin)

Indices and tables