Skip navigation

MetaMed — Meta Data Extraction and Manipulation Project

About

MetaMed is a meta data extraction tool. Primarily it is used for a medical data file formats. It is a non-interactive command line open source application. It can be also useful as a simple SPARQL query tool and even as a RDF serialisation format converter.

Features

Input and output format support

Implemented support of file formats:

RDF serialisation formats support:

Supported (query) output formats:

Supported RDF load methods for Oracle Database:

SPARQL is spported for RDF data queries.

Installation

It is a command line tool without any installation process. You only extract a downloaded archive file. Next, execute in a command line/terminal (see examples).

Requirements

Download

Download latest production version (with JavaDoc).

You also need to get these files (Oracle Database support will be removed in next version):

Archive: first public version download page (2012) at our department web page.

Usage

Command line usage


usage: java -Xms512m -Xmx2048m -cp ./ -jar metamed-<version>.jar  [-batch
       | -batchnd | -bulk | -bulknd]    [-d <dir>] [-ds <code>] [-e
       <file>] [-ef <format>] [-empty] [-fs <rootdir> | -mem | -ora <auth>
       | -virt <auth>] [-h] [-i <file>] [-if <format>] [-list <file>] [-m
       <name>]  [-o <dir>] [-of <fmt>]  [--oracle-flags <flag>] [-pm
       <file>] [-q <file>] [-r <file>] [-s] [-stn <table>] [-tbs
       <tablespace>] [-tc <count>] [-tmm] [-v]

Options

 -batch,--batch-load                        Enable Batch load instead of
                                            Bulk load mode.
 -batchnd,--batch-load-without-drop-index   Enable Batch load, but do not
                                            drop application index data.
                                            (default)
 -bulk,--bulk-load                          Enable Bulk load instead of
                                            Batch load mode.
 -bulknd,--bulk-load-without-drop-index     Enable Bulk load, but do not
                                            drop application index data.
 -d,--input-data <dir>                      Input data directory or file.
                                            The whole directory is process
                                            recursively.
 -ds,--data-source <code>                   Set the data source name.
 -e,--export <file>                         Export the whole model to a
                                            file.
 -ef,--export-format <format>               File serialization format for
                                            an output RDF file.
 -empty,--empty-model                       Empty the model.
 -fs,--model-file-system <rootdir>          Model is in-memory only, but
                                            it is automatically backed
                                            from/to file system root
                                            directory. You have to specify
                                            a root directory for model.
                                            Model is always backed to the
                                            root directory. You can use -e
                                            with -ef which will duplicate
                                            output to an another
                                            serialization file format.
                                            Please be patient, the model
                                            is written twice in this case.
 -h,--help                                  Print this message.
 -i,--import <file>                         Import RDF files(s) to the
                                            model.
 -if,--import-format <format>               Import serialization format
                                            for RDF files(s).
 -list,--file-list <file>                   File with a list of input data
                                            files for (selective)
                                            processing. Each file name
                                            with (relative or absolute)
                                            path is on one line. An input
                                            data (-d) option can append
                                            additional files for
                                            processing.
 -m,--model <name>                          Set the model name. The model
                                            name a mandatory option for
                                            all supported operations.
                                            (default is 'metamedModel')
 -mem,--model-in-memory                     Model is in-memory only and it
                                            is destroyed when application
                                            exits. You can use -e there
                                            which will output the model to
                                            a file with serialization
                                            (-ef) you need. It is useful
                                            e.g. for a conversion between
                                            two serialization formats.
                                            (default)
 -o,--output <dir>                          Output directory. Query
                                            results will be stored there
                                            with an appropriate file
                                            extension based on an output
                                            format.
 -of,--output-format <fmt>                  Output format - csv, json, ...
 -ora,--model-oracle <auth>                 Model is in Oracle database
                                            with semantic technologies
                                            support enabled. You have to
                                            specify a connection
                                            configuration property file
                                            for the Oracle database.
    --oracle-flags <flag>                   Oracle database flags for
                                            invoking Bulk Load from a
                                            staging table.
 -pm,--prefix-map <file>                    Export an application built-in
                                            prefix map into a text file.
 -q,--query <file>                          File or directory with files
                                            that contains a SPARQL query
                                            string. Prefixes are append
                                            automatically to each query.
 -r,--remove <file>                         Remove statements from the
                                            model. One statement per line.
                                            Subject, predicate and value
                                            are divided by a tabulator
                                            char only. Only URI without
                                            prefixes are allowed. Any
                                            content in s, p or o is
                                            possible when you use ? or $
                                            char (before name).
 -s,--size                                  Print model size.
 -stn,--staging-table-name <table>          Set staging table name.
 -tbs,--tablespace <tablespace>             Set tablespace name to use.
 -tc,--thread-count <count>                 Number of threads used for
                                            input files meta data
                                            extraction. One thread is a
                                            default value.
 -tmm,--thread-models-merge                 Enable merging models when
                                            more threads are used. It is
                                            automatically enabled when
                                            running queries or removing
                                            any statements. It is more
                                            efficient work without merging
                                            when you are extracting meta
                                            data.
 -v,--verbose                               Verbose output.
 -virt,--model-virtuoso <auth>              Model is in Virtuoso database.
                                            You have to specify a
                                            connection configuration
                                            property file.

Examples

A script file metamed.sh is used in all examples below and its content is:

#!/bin/sh

java -Xms512m -Xmx1024m -jar metamed-* $@

You can change a maximum memory limit used by MetaMed by -Xmx option (Java Virtual Machine). The example option -Xmx1024m sets the limit to 1024 MB. The maximum amount of memory which you can use depends on a machine and operating system you use.

Example 1 — Print model size

Connect to the Virtuoso database graph (configuration is in the file connection.properties) and print the size of the model modelName.

./metamed.sh --model-oracle connection.properties --model modelName --size

Example 2 — Extract meta data

Extract all meta data from files in the data directory into the model modelName which is backed in a file system directory rdfStore.

./metamed.sh --model-file-system rdfStore/ -model modelName --input-data data/

All extracted meta data in the in-memory RDF model will be automatically backed to the RDF/XML file in rdfStore/modelName.xml.

Example 3 — Concatenate/Merge RDF data files

Import files RDF files in N-TRIPLE format into the RDF store.

./metamed.sh --model-file-system rdfStore/ --model modelName --import file1.nt file2.nt file3.nt -if N-TRIPLE

The in-memory RDF model with all imported files will be automatically backed to the RDF/XML file in rdfStore/modelName.xml.

Example 4 — Query RDF model with SPARQL

Query RDF model modelName stored in the file system (rdfStore directory) by a query from the file query.sparql.

./metamed.sh --model-file-system rdfStore/ --model modelName --query query.sparql --output outputDirectory/ --output-format csv

The query output in CSV format will be stored in the file outputDirectory/query.sparql.csv.

Example 5 — Export RDF model

Export RDF model modelName stored in the file system (rdfStore directory) into the file export.ttl and use TURTLE serialisation format.

./metamed.sh --model-file-system rdfStore/ --model modelName --export export.ttl --export-format TURTLE

Note: When you use a file system backed model, you have to expect the model is stored to a file twice — (1) export and (2) before application exit, because the model is backed on a file system. It is significant with a large model.

Example 6 — Extract, import, query and export

Use RDF model modelName stored in the file system (rdfStore directory) and do following operations. Before/after each operation is printed a size of the model.

  1. First, extract meta data from files in directory data/.
  2. Import all files (file1.nt, file2.nt, file3.nt) into the model (merging files).
  3. Process SPARQL from the file (query.sparql) and output results into the outputDirectory/query.sparql.csv (CSV format).
  4. Export the whole in-memory RDF model into the export.ttl file with TURTLE serialisation format, after query processing.
./metamed.sh --model-file-system rdfStore/ --model modelName \
  --data data/ --import file1.nt file2.nt file3.nt -if N-TRIPLE \
  --query query.sparql --output outputDirectory/ --output-format csv \
  --export export.ttl --export-format TURTLE --size

Note: When you use a file system backed model, you have to expect the model is stored to a file twice — (1) export and (2) before application exit, because the model is backed on a file system. It is significant with a large model.

ChangeLog

License

The GNU General Public License v3

Contact

You can contact the author Petr Vcelak vcelak@kiv.zcu.cz for more information about this tool.