View
 

Text Analysis Tools

DiRT (Digital Research Tools) has a new home! Please visit Bamboo DiRT to explore this excellent collection of research tools.

Definition: 

Text analysis software enables users to determine the frequency with which words or phrases are used, create concordances, view words in context, and otherwise study patterns in texts.

 

 

Tools:

 

  • AntConc 3.2.1: concordance program, "can generate KWIC concordance lines and concordance distribution plots...has tools to analyze word clusters (lexical bundles, n-grams, collocates, word frequencies, and keywords" (Free, Windows/Mac OS X/Linux)

  • Attensity:  text analytic and data extraction framework: "delivers the power of sophisticated data and semantic analytics in an integrated suite of easy-to-use business applications, allowing business leaders, customer support personnel and customers to get relevant and actionable answers fast" (Commercial, cross-platform)
  • Basis Technology: natural language processing technology for the analysis of unstructured multilingual text (Commercial, cross-platform)
  • CATMA (Computer Aided Textual Markup and Analysis): " an open source software with a focus on textual markup and analysis." (Open source)
  • ClearForest: text tagging, extraction and analytics software; "Compare Suite compares texts by keywords, highlights common and unique keywords. Connexor Machine discovers natural language grammatical and semantic information." (Commercial)
  • Coding Analysis Toolkit (CAT): a platform-independent, open source, system which consists of a web-based suite of tools custom built to facilitate efficient and effective analysis of text datasets that have been coded using either an internal coding module or the commercial-off-the-shelf package ATLAS.ti; CAT computes interrater reliability and supports adjudication and validty measurement (Free, web-based)
  • Cypher: a "software program available which generates the RDF graph and SPARQL/SeRQL query representation of a plain language input, allowing users to speak plain language to update and query semantic databases...With robust definition languages, Cypher's grammar and lexicon can quickly and easily be extended to process highly complex sentences and phrases of any natural language, and can cover any vocabulary" (Free, Windows/Mac OS X/Linux)
  • Data for Research (DfR): Mine and analyze JSTOR's collections.  Supports fielded searching; provides ngrams, word frequencies, citations, and tag clouds of key terms; offers API for "content selection and retrieval." (Free, web-based)

  • DICTION 5.0: computer-aided text analysis for determining the tone of a verbal message: certainty, activity, optimism, realism, and commonalty (Commercial, Windows)
  • DiscoverText.com: import data, search and analyze, code documents, generate reports. (Commercial, cloud-based)
  • Edition Production & Presentation Technology (EPPT): "an integrated set of XML tools designed to help humanities editors prepare image-based electronic editions...makes image-based encoding, the laborious process of linking descriptive markup to material evidence through XML, a relatively easy and error-proof task" (Free, PC XP/Mac OS X)
  • HyperPo: "a user-friendly text exploration and analysis program"; supports word frequencies, KWIC (Keyword in Context), cooccurrence and distribution lists, comparison, etc. (Free, web-based)
  • IBM AeroText: "a suite of text mining applications that are used for content analysis...Sample target applications include automatic database generation, document routing, browsing, summarization, enhanced full text search, and targeted document search in addition to link analysis" (Commercial, Windows/Linux/Solaris)
  • IBM InfoSphere: a product line from IBM which "provides a unified data warehouse delivering access to structured and unstructured information and operational and transactional data in real time" (Commercial, Windows/Linux/UNIX)
  • ICTA: "a web-based system for Automated Text Analysis and Discovery of Social Networks from text. It was originally designed to work with email-based and forum-based data. But it can also be used to analyze other types of electronic communication such as blogs and chats." (Free, web-based)
  • JGAAP: "Java-based, modular program for textual analysis, text categorization, and authorship attribution" (Free)
  • Juxta: "tool for comparing and collating multiple witnesses to a single textual work. The software allows users to set any of the witnesses as the base text, to add or remove witness texts, to switch the base text at will, and to annotate Juxta-revealed comparisons and save the results." (Free, cross-platform)
  • Lextek: offers a range of services and software for full-text indexing search and retrieval; automatic classification, routing, and filtering electronic text according to user defined profiles (Commercial, cross-platform)
  • LIWC:  "Linguistic Inquiry and Word Count (LIWC) is a text analysis software program...LIWC is able to calculate the degree to which people use different categories of words across a wide array of texts." (Commercial, with free and limited web analysis available; Windows/Mac)
  • MALLET: "a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text." (open source using the Common Public License, Java-based)
  • MAXQDA: a tool for qualitative data analysis, evaluation, and text analysis: "supports all individuals performing qualitative data analysis and helps to systematically evaluate and interpret texts...also a powerful tool for developing theories and testing the theoretical conclusions of the analysis" (Commercial, Windows)
  • MONK: "a digital environment designed to help humanities scholars discover and analyze patterns in the texts they study. It supports both micro analyses of the verbal texture of an individual text and macro analyses that let you locate texts in the context of a large document space consisting of hundreds or thousands of other texts" (Free, web-based)
  • MonoConc: a "concordance (text searching) program...used in the analysis of English or other texts...also produces wordlists and collocation information" (Commercial, Windows)
  • MorphAdorner: "a Java command-line program which acts as a pipeline manager for processes performing morphological adornment of words in a text...Currently MorphAdorner provides methods for adorning text with standard spellings, parts of speech and lemmata. MorphAdorner also provides facilities for tokenizing text, recognizing sentence boundaries, and extracting names and place." (Free, cross-platform)
  • NORA: "a text-mining application intended to allow the exploration of verbal patterns in text collections" (superseded by MONK; source code & demo available)
  • NVivo: software which "removes many of the manual tasks associated with analysis, like classifying, sorting and arranging information, so you have more time to explore trends, build and test theories and ultimately arrive at answers to questions." (Commercial, Windows)
  • PAIR (Pairwise Alignment for Intertextual Relations): "a simple implementation of a sequence alignment algorithm for humanities text analysis designed to identify "similar passages" in large collections of texts. These may include direct quotations, plagiarism and other forms of borrowings, commonplace expressions and the like." (Open source, Mac/Linux)
  • PhiloLogic: "primary full-text search, retrieval and analysis tool developed by the ARTFL Project and the Digital Library Development Center (DLDC) at the University of Chicago"; support for TEI, DocBook, & plain text (Free, Mac/Linux)
  • Power Text Solutions: service for "free text" analysis and summarization, offers two different approaches: "capable of excerpting the most relevant and informative text passages that are semantically and grammatically complete"  (Free or commercial service available)
  • Public Comment Analysis Toolkit: "assists agencies and researchers in searching, analyzing, and responding to citizen comments submitted to federal regulatory agencies as well as generic text datasets" (Version 2.0 of CAT - See above)
  • Readware Information Processor: "the only software designed to identify abstract objects (from natural language expressions) and use them for contextualizing the human experience and understanding encoded in a text or message": classifies documents by content; provides literal and conceptual search; includes a ConceptBase with English, French or German lexicons (Commercial, Windows/Linux)
  • Recommind MindServer:  "the industry's most powerful search-powered platform with supporting applications that redefine how enterprises can proactively manage their information paradigm", uses PLSA (Probablistic Latent Semantic Analysis) for retrieval and categorization of texts (Commercial)
  • Saplo: A text analysis API with text recommendations, text filtering, text categorization, automatic tagging, automatic related articles and sentiment analysis. Read more in the text analysis API documentation (Free to try; special offers for researchers and universities are available, web-based). 
  • SEASR: tools & frameworks for sharing data and research (including text analysis) in virtual work environments (Free; open source, Windows/Mac/Linux)
  • Stanford POS Tagger [Review]: "piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'.  This software is a Java implementation of the log-linear part-of-speech taggers described in...[articles]" (GPL, requires Java)
  • TACT: "a text-analysis and retrieval system for MS-DOS that permits inquiries on text databases in European languages" (Free, Windows DOS)
  • TagCrowd: "a web application for visualizing word frequencies in any user-supplied text by creating what is popularly known as a tag cloud or text cloud...Create your own tag cloud from any text to visualize word frequency." (Free, web-based)
  • TAMS Analyzer: "an open source qualitative package for the analysis of textual themes. It can be used for transcribing digital media and for conducting discourse analysis in the social and cultural sciences." (Free; open source, Mac/Linux)
  • TAPoR Tools: a searchable list of tools available through the Text Analysis Portal for Research that can be used online.  TAPoR is "a gateway to tools for sophisticated analysis and retrieval, along with representative texts for experimentation...manage electronic texts, experiment with online text tools, [and] learn about digital textuality."  The TAPoRware tools are also available separately. (Free, web-based)

  • TAToo (Text Analysis for me Too): "a Flash widget that you can embed in web pages to call basic text analysis tools from the TAPoR project." (Free, web-based)

  • TextGridLab: "TextGrid aims to create a community grid for the collaborative editing, annotation, analysis and publication of specialist texts"; includes XML Editor, Search Tool, Project Management Tools, and Metadata Annotator (Free, Mac)

  • Textometry (TXM): "helps you to build and analyze tagged and structured corpora"; offers "a full text search engine; a statistics engine;  an import environment;  a scripting engine."  (Open source; Windows/Linux)

  • Textpresso: "a text-mining system for scientific literature. Textpresso's two major elements are (1) access to full text, so that entire articles can be searched, and (2) introduction of categories of biological concepts and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., methods, etc)" (Open source, Linux)
  • TextSTAT - Simple Text Analysis Tool: "a simple programme for the analysis of texts. It reads ASCII/ANSI texts (in different encodings) and HTML files (directly from the internet) and it produces word frequency lists and concordances from these files. This version includes a web-spider which reads as many pages as you want from a particular website and puts them in a TextSTAT-corpus." (Free; cross-platform, via New History Lab)
  • Token-X: "text visualization, analysis, and play tool" (Free, web-based)
  • Vivisimo/Clusty: web search and text clustering engine (see e.g. Shakespeare Searched) (Free, web-based)
  • Visual Text: "integrated development environment for building information extraction systems, natural language processing systems, and text analyzers" (Free for academic use)
  • Voyeur: text analysis suite (Creative Commons, web-based)
  • Whatizit: "a text processing system that allows you to do textmining tasks on text. The tasks come defined by the pipelines in the drop down list of the above window and the text can be pasted in the text area."  Focused on biosciences.  (Free, web-based)
  • WMatrix: "a software tool for corpus analysis and comparison. It provides a web interface to the USAS and CLAWS corpus annotation tools, and standard corpus linguistic methodologies such as frequency lists and concordances. It also extends the keywords method to key grammatical categories and key semantic domains." (Annual subscription, web-based)
  • Word Hoard: "applies to highly canonical literary texts the insights and techniques of corpus linguistics, that is to say, the empirical and computer-assisted study of large bodies of written texts or transcribed speech. In the WordHoard environment, such texts are annotated or tagged by morphological, lexical, prosodic, and narratological criteria" (Free; open source, cross-platform)
  • Wordle: a tool for generating “word clouds” from text that you provide. The clouds give greater prominence to words that appear more frequently in the source text. You can tweak your clouds with different fonts, layouts, and color schemes. (Free, web-based)
  • WordSmith: lexical analysis software tools; includes components for building concordances, locating and identifying keywords in a text, and generating word lists from plain text files  (Commercial, PC)
  • Wordstat: computer-aided text analysis: "Whether you need a text mining tool for fast extraction of themes and trends or achieve careful and precise measurement with a state-of-the-art quantitative content analysis method, WordStat provides a unique combination of both approaches in a flexible and easy to use text analysis software." (Commercial)

  • XAIRA: A text analysis and indexing system designed for large scale XML encoded texts including but not limited to TEI-conformant language corpora. (Open source, now has platform-independent PHP interface as well as Windows client)

 

Resources:

  • caqdas Networking Project: "We provide practical support, training and information in the use of a range of software programs designed to assist qualitative data analysis."

  • KDNuggets: A text analysis software directory

  • TAPoR Text Analysis Recipes: Straightforward, step-by-step instructions for using text analysis tools to accomplish particular research tasks.

  • WikiTADA: The collaborative website of the Text Analysis Developers Alliance.

 

References:

 

See Also:

 

Comments (1)

jafar_mollaei said

at 3:21 am on May 23, 2009

this topics are very good and usefull . Thank You

You don't have permission to comment on this page.