Text mining

"Text mining can be broadly defined as a knowledge-intensive process in which a user interacts with a document collection over time by using a suite of analysis tools" (Feldman & Sanger: The Text Mining Handbook, 2007):
  • Extract useful information from data sources through the identification and exploration of interesting patterns
  • Data sources are document collections
  • Interesting patterns are found in the unstructured textual data in the documents in these collections

• High-level text mining functional architecture:

(Feldman & Sanger, 2007)
"Text mining ... involves analysing a large collection of documents to discover previously unknown information. The information might be relationships or patterns that are buried in the document collection and which would otherwise be extremely difficult, if not impossible, to discover" (JISC: Value and benefits of text mining, 2012):

• Components of text mining:

external image fig-1-components-of-text-mining.ashx?w=400&h=122&as=1

• Schematic overview of the processes involved in text mining of scholarly content:

external image fig-2.ashx?w=400&h=236&as=1

For instance, PubMed, the National Library of Medicine’s online repository of citation-related information for biomedical research papers, interesting for employing text mining techniques because this online service contains text-based document abstracts for more than 12 million research papers on topics in the life sciences.


KNIME [naim] is a user-friendly graphical workbench for the entire analysis process: data access, data transformation, initial investigation, powerful predictive analytics, visualisation and reporting. The open integration platform provides over 1000 modules (nodes).

KNIME can be downloaded onto the desktop and used free of charge (Windows, Linux, and Mac OS X).

Text processing with KNIME:

The KNIME text processing feature enables to read, process, mine and visualize textual data in a conveniet way. It provides functionality from
  • Natural language processing (NLP)
  • Text mining
  • Information retrieval
The generic workflow to process texts with the KNIME Text Processing plugin usually includes six different procedural as follows:
  1. Reading and parsing: Text input/output
  2. Enrichment: Named entity recognition
  3. Preprocessing: Filtering and manipulation
  4. Frequencies: Word counting and keyword extraction
  5. Transformation: Bow and vector representation
  6. Visualization: Tag cloud

Additional resources:

Network mining with KNIME - Usable Customer Intelligence from Social Media Data:

WEKA: Data Mining Software in Java

Weka from the University of Waikato is a collection of open source machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

KNIME makes a large part of the functionality of the machine learning library WEKA available with a plug-in. It integrates around 100 algorithms for clustering, classification, regression, and association rule mining.