The LiverMemories/Portale della Ricerca Umanistica (LM/PRU) pipeline is a multilingual aware web service for processing archaeology texts. Currently, the input are documents in Portable Document Format (PDF) with machine-encoded text, for example, encoded via optical character recognition (OCR) or directly from the source applicaton, and the output are tabular formatted multi-column files.
The project’s source code is available from https://bitbucket.org/clic/lm_pru, and its documentation from http://clic.bitbucket.org/lm_pru/. A RESTful interface is located at http://clic.cimec.unitn.it/egon/pru/documents.cgi, and an example use-case of the RESTful interface (upload file, run the pipeline, make results available for download) is available at http://clic.cimec.unitn.it/egon/pru/upload.html.
# THE pipeline
# handle all processing stages from PDF (with text content, i.e. ready OCR-ed)
# to TextPro output; the latter enriched with 'some additional' information.
#
The pipeline processes scientific articles in PDF{32000-1.iso.org}format form the humanities.
Firstly, it converts the encoded text into plain text, then it splits the text into a header, a body, and the bibliography. The header contains meta information about the document like author, abstract text, keywords, and the like, the body contains the articles’s main content, and the bibliography contain the referenced work of the article.
The header information and the bibliography are reformatted and are made available for further processing outside the pipeline, the body information is processed further within the pipeline. To this end, first, the main language of the body text is detected, then its document structure is extracted, and then, the HLT processing steps of tokenisation, part-of-speech tagging, chunk parsing, and named-entity recognition are carried out.
The task of extracting text from PDF files is notoriously difficult because in an actual PDF file, text portions might be split into several chunks in the middle of its running. Therefore, text extraction needs to reassemble text chunks, i.e. single characters, and chunks of characters need to be merged back into words.
Here, the extraction of running text from PDF files is done by PDFMiner{pdfminer}, a Python PDF parser and analyzer. Due to the difficulty of the task we chose an application that provides fine-grained control over the reassembly process; PDFMiner has configurable options for the distance between characters, words, and lines - and even though, the default settings accomplish fair results, having the possibility to enhance the process, e.g. by choosing values that minimise the number of unknown words in the output, is a very viable advantage.
The output of this processing step is plain text ideally, with the text arranged in the intended reading order, e.g. a left column before a right one.
The plain text output is post-processed for several artefacts: A)hyphenated words are recombined, and the running text of a single paragraph is rearranged into a single line. B)page headers and footers, e.g. authors, article titles, page numbers, are removed.
The following end and beginning of two pages
368-369). Anche i ritrovamenti nei territori della Peni- sola italiana sono in generale rari. Le scoperte nell’Eu- ropa mediterranea occidentale sembrano quindi poco numerose e mal documentate.
9 Borrello Micheli.p65
72
01/06/2006, 15.15
^L73
Fig. 2. Carta di distribuzione dei rinvenimenti europei di ornamenti preistorici in Spond s sp. (da MÜLLER, 1997).
- PROVENIENZA DELLA MATERIA PRIMA.
I rinvenimenti italiani, quelli della costa dalmata e della Grecia settentrionale non pongono problemi di determinazione delle aree di approvvigionamento, per- ché si collocano in corrispondenza delle zone costiere
will become
(TABORIN, 1974: 368-369). Anche i ritrovamenti nei territori della Penisola italiana sono in generale rari. Le scoperte nell’Europa mediterranea occidentale sembrano quindi poco numerose e mal documentate. Fig. 2. Carta di distribuzione dei rinvenimenti europei di ornamenti preistorici in Spondylus sp. (da MÜLLER, 1997).
- PROVENIENZA DELLA MATERIA PRIMA.
I rinvenimenti italiani, quelli della costa dalmata e della Grecia settentrionale non pongono problemi di determinazione delle aree di approvvigionamento, perché si collocano in corrispondenza delle zone costiere o
For extracting the meta-information and splitting the article we use ParsHead and ParsCit, respectably, both modules within the ParsCit project{CouncillGilesKan2008}. However, we added rules for recognising Italian Abstracts, Keywords, and Bibliography.
For identifying the language of given parts we use the *chromium-compact-language-detector*{chromium-cld}, a C++ library with Python bindings for language identification of UTF-8 text, extracted from the Chromium browser. The language identification is carried out on the whole body of the article, and on each paragraph individually. In case of inconclusive results on individual paragraphs, the language of the text body is assumed.
For extracting the document structure we use SectLabel, a module within the ParsCit project; the detection is aided by the usage of conditional random fields. However, we trained a new model on manually crafted training data for the domain of Italian articles form the Humanities.
For the HTL processing steps of tokenisation, part-of-speech tagging, chunking parsing, and named-entity recognition we use TextPro{textpro}, a suite of modular Natural Language Processing (NLP) tools for analysis of Italian and English texts. The suite has been designed so as to integrate and reuse state of the art NLP components developed by researchers at the Fondazione Bruno Kessler (FBK). However, we first refine the NE results with a gazetteer dictionary, and then, use a specifically built ML-based NE detection framework, trained on manually annotated data from our domain.
The LM/PRU pipeline is accessable via an Application Programming Interface (API). The interface uses a RESTful protocol on top of the Hypertext Transfer Protocol (HTTP) version 1.1; clients submit source documents via POST requests, access results via GET requests, and request deletion of resources via DELETE requests. Results are either Extensible Markup Language (XML) documents, or documents of the requested type.
Requests for running the pipeline on new source documents are handled independently, and a new resource is created per request, i.e. only one document can be submitted at a time but multiple parallel requests are possible as well as, parallel processing of multiple source documents on the server.
A typical client-server interaction looks like this: Upon successful POST of a new source document by the client, the server responds with a new resource URI. The client uses this URI to poll the status of the processing until successful completion is being signalled. Now, the client can retrieve the final output (in anyone of the available formats).
The REST architectural style implies that URLs are used to represent available resources on a server. This section gives a summary of the resources that are available from the PRU/LM interface with the HTTP-methods that can be used on them.
The {<BASE_URL>} denotes the location on a server where the LM/PRU pipeline’s interface is exposed on a web server, e.g. http://clic.cimec.unitn.it/egon/pru. Likewise, {<ID>} denotes a specific resource returned by the server upon successful POST of a source document.
<BASE_URL> /documents -- POST : creation of new resource /<ID> -- DELETE : deletion of resource /status -- GET : status of backend processing /log -- GET : log file of backen processing /pdf -- GET : previously uploaded PDF document /bib -- GET : extracted references in BibTeX format /pretxp -- GET : structure-annotated pre-TextPro file /txp -- GET : TextPro processed file /zip -- GET : packed directory content /
To request the creation of a new resource the client calls this URI via HTTP POST using the Content-Type multipart/form-data according to RFC2388. A single file with the parameter NAME and the VALUE “upfile” will be considered.
An exemplary html form: {language=XML}
<form method=”POST” enctype=”multipart/form-data” ACTION=”./documents”> <input type=”file” name=”upfile” maxlength=”1” /> <input type=”submit” id=”sendForm” value=”upload” /> </form>
To request the deletion of the resource, i.e. the whole processed directory, the client calls this URI via HTTP DELETE.
To request the state of the processing the client calls this URI via HTTP GET. A single text/plain document with the status will be returned.
To request the log file of the processing the client calls this URI via HTTP GET. A single text/plain document with the log file as content will be returned.
To request the original, unmodified file the client calls this URI via HTTP GET. The response will be of “Content-Type:application/pdf”, the actual content will be binary data. Calling this URI from a browser will (usually) trigger a ‘Save File’ dialogue.
To request the extracted references in BibTeX format the client calls this URI via HTTP GET. A single text/plain document will be returned.
To request the pre-TextPro file the client calls this URI via HTTP GET. A single text/plain document will be returned.
To request the TextPro processed file the client calls this URI via HTTP GET. A single text/plain document will be returned.
To request the gzip compressed tar archive of the processed directory the client calls this URI via HTTP GET. The response will be of “Content-Type:application/x-gzip”, and will have a “Content-Disposition: attachment”, the actual content will be binary data. Calling this URI from a browser will (usually) trigger a ‘Save File’ dialogue.
The client may query the latest status of the processing via HTTP GET on the status URI.
The status responses are: Q:queued -- upon successful resource creation, no processing has happened R:running -- document is being processed, continuously growing log file available S:succeeded -- document has been processed successfully, results available F:failed -- document processing has failed, log file available
While the status codes ‘Q’ and ‘R’ are intermediate and changes can be identified via re-polling, the status code ‘S’ and ‘F’ are final, i.e. no status change will occur, and clients must not continue polling.
Some examples how to interact with the RESTful API via CURL.
curl -F “upfile=@/tmp/foo_bar.pdf” “http://clic.cimec.unitn.it/egon/pru/documents.cgi” -F “comment=none”
curl “http://clic.cimec.unitn.it/egon/pru/documents.cgi/c70accef-c924-464a-a9d2-7f81bf12a0c4” -request DELETE
curl “http://clic.cimec.unitn.it/egon/pru/documents.cgi” -request DELETE
commonly needed functionality for dealing with TextPro files.
Set or add column col_label/col_num in comment and content blocks, and return the altered comment and content.
comment – list of strings: a TextPro comment block content – list of strings: a TextPro content block col_label – string: name of the column col_lines_list – list of strings
col_num – int of the column to set
Add key-value-line to comment (right before the “# FIELDS:” line), and return the altered comment block.
Add column at the end of content block, set it to the value of col_lines_list, and return the altered content.
Add a column col_label at the end to the “# FIELDS:” line of the comment block, and return the altered comment.
Delete (all) key-value-line(s) from comment, and return the altered comment block.
Delete col_num column from content block, and return the altered content.
Delete col_num column from the “# FIELDS:” line of the comment block, and return the altered comment.
Delete col_label column from the “# FIELDS:” line of the comment block, and return the altered comment.
For each list of tokens in patterns find the longest, exact matching sequences within the list of tokens, and return a dictionary built of tuples (token_start_id, length_of_match).
Return the (first) value for the comment key from the comment block.
Return the columnn col from the content block.
Return col_label’s index number of the “# FIELDS:” line of the comment block.
Insert col_lines as column after col_num in content block, and return the altered content.
Insert col_label_new after the col_label_old column of the “# FIELDS:” line of the comment, and return the altered comment.
Insert col_label_new after the col_label_old column of the “# FIELDS:” line of the comment, and return the altered comment.
Set line of comment (re-use old line if available), and return the altered comment block.
Set column col_num in content block to the values of col_lines_list, and return the altered content. Raises IndexError if col_num is out of range.
Set the col_num column of the “# FIELDS:” line of the comment block to col_label, and return the altered comment.
Set the col_label_old column of the “# FIELDS:” line of the comment block to col_label_new, and return the altered comment.