After reading Chapter 9 in your textbook, please provide a brief response to the following assessment question.
Based on what you have learned so far over the weeks, what is the important component of text analysis? What is text mining? What are some of the benefits of text mining? Explain and provide examples.
Advanced Analytical Theory
and Methods: Text Analysis
Week 9
Dr. Mensah
Text Analysis
• Sometimes called text analytics
• Refers to the representation, processing, and modeling of textual data
to derive useful insights.
• An important component includes text mining: the process of
discovering relationships and interesting patterns in large text
collection.
• Deals with textual data that is far more complex
• A corpus (plural: corpora) is a large collection of texts used for various
purposes in Natural Language Processing (NLP)
Example Corpora in Natural Language
Processing
Major Challenges of Text Analysis
• Major challenge with text analysis is that most of the time the text is
not structured.
• This may include quasi-structured, semi-structured, or unstructured
data.
Example Data Sources and Formats for Text
Analysis
Text Analysis Steps
• Parsing is the process that takes unstructured text and imposes a structure for
further analysis. The unstructured text could be a plain text file, a weblog, an
Extensible Markup Language (XML) file, a HyperText Markup Language (HTML)
file, or a Word document. Parsing deconstructs the provided text and renders it in
a more structured way for the subsequent steps.
• Search and retrieval is the identification of the documents in a corpus that
contain search items such as specific words, phrases, topics, or entities like
people or organizations. These search items are generally called key terms.
Search and retrieval originated from the field of library science and is now used
extensively by web search engines.
• Text mining uses the terms and indexes produced by the prior two steps to
discover meaningful insights pertaining to domains or problems of interest.. Text
mining may utilize methods and techniques from various fields of study, such as
statistical analysis, information retrieval, data mining, and natural language
processing.
Part-of-Speech (POS) Tagging, Lemmatization,
and Stemming
• The goal of POS tagging is to build a model whose input is a sentence
and whose output is a tag sequence.
• Both lemmatization and stemming are techniques to reduce the
number of dimensions and reduce inflections or variant forms to the
base form to more accurately measure the number of times each
word appears.
• With the use of a given dictionary, lemmatization finds the correct
dictionary base form of a word. For example, given the sentence:
• obesity causes many problems
• the output of lemmatization would be – obesity cause many problem
Part-of-Speech (POS) Tagging, Lemmatization,
and Stemming
• Different from lemmatization, stemming does not need a dictionary, and it
usually refers to a crude process of stripping affixes based on a set of
heuristics with the hope of correctly achieving the goal to reduce
inflections or variant forms. After the process, words are stripped to
become stems.
• A stem is not necessarily an actual word defined in the natural language,
but it is enough to differentiate itself from the stems of other words. A
well-known rule-based stemming algorithm is Porter’s stemming algorithm.
It defines a set of production rules to iteratively transform words into their
stems.
• For the sentence shown previously:
• obesity causes many problems
• the output of Porter’s stemming algorithm is:obes caus mani problem
A Text Analysis Example
• To further describe the three text analysis steps, consider the
fictitious company ACME, maker of two products: bPhone and
bEbook.
• ACME can monitor the social media buzz using a simple process
based on the three steps
ACME’ Text Analysis Process
• Collect raw text – This corresponds to Phase 1 and Phase 2 of the Data
Analytic Lifecycle. In this step, the Data Science team at ACME monitors
websites for references to specific products.
• Represent text – Convert each review into a suitable document
representation with proper indices, and build a corpus based on these
indexed reviews.
• Compute the usefulness of each word in the reviews using methods such as
TFIDF
• Categorize documents by topics
• Determine sentiments of the reviews. Identify whether the reviews are
positive or negative.
• Review the results and gain greater insights
Web Scraper
• A web scraper is a software program (bot) that systematically browses the World
Wide Web, downloads web pages, extracts useful information, and stores it
somewhere for further study.
• it is nearly impossible to write a one-size-fits-all web scraper
• To build a web scraper for a specific website, one must study the HTML source
code of its web pages to find patterns before extracting any useful content.
• It is common to customize a web scraper for a specific website. To build a web
scraper for a specific website, one must study the HTML source code of its web
pages to find patterns before extracting any useful content.
• The team can then construct the web scraper based on the identified patterns
• The scraper can use the curl tool [7] to fetch HTML source code given specific
URLs, use XPath [8] and regular expressions to select and extract the data that
match the patterns, and write them into a data store.
Regular Expressions
• Regular expressions can find words and strings that match patterns in
the text effectively and efficiently.
• The general idea is that once text from the fields of interest is
obtained, regular expressions can help identify if the text is
interesting for the project.
• The general idea is that once text from the fields of interest is
obtained, regular expressions can help identify if the text is
interesting for the project.
Representing Text
• In this data representation step, raw text is first transformed with text
normalization techniques such as tokenization and case folding. Then it is
represented in a more structured way for analysis.
• Tokenization is the task of separating (also called tokenizing) words from
the body of text. Raw text is converted into collections of tokens after the
tokenization, where each token is generally a word.
• tokenizing based on punctuation marks might not be well suited to certain
scenarios.
• Tokenization is a much more difficult task than one may expect. For
example, should words like state-of-the-art, Wi-Fi, and San Francisco be
considered one token or more? Should words like Résumé, résumé, and
resume all map to the same token?
Case Folding
• Another text normalization technique is called case folding, which
reduces all letters to lowercase (or the opposite if applicable).
• One needs to be cautious applying case folding to tasks such as
information extraction, sentiment analysis, and machine translation.
If implemented incorrectly, case folding may reduce or change the
meaning of the text and create additional noise.
• If case folding must be present, one way to reduce such problems is
to create a lookup table of words not to be case folded.
• After normalizing the text by tokenization and case folding, it needs to
be represented in a more structured way. A simple yet widely used
approach to represent text is called bag-of-words.
Bag-of-words
• Given a document, bag-of-words represents the document as a set of terms,
ignoring information such as order, context, inferences, and discourse. Each word
is considered a term or token (which is often the smallest unit for the analysis). In
many cases, bag-of-words additionally assumes every term in the document is
independent. The document then becomes a vector with one dimension for
every distinct term in the space, and the terms are unordered. The permutation
D* of a document D contains the same words exactly the same number of times
but in a different order. Therefore, using the bag-of-words representation,
document D and its permutation D* would share the same representation.
• Bag-of-words takes quite a naïve approach, as order plays an important role in
the semantics of text. With bag-of-words, many texts with different meanings are
combined into one form. For example, the texts “a dog bites a man” and “a man
bites a dog” have very different meanings, but they would share the same
representation with bag-of-words.
Corpus
• It is important not only to create a representation of a document but
also to create a representation of a corpus. As introduced earlier in
the chapter, a corpus is a collection of documents. A corpus could be
so large that it includes all the documents in one or more languages,
or it could be smaller or limited to a specific domain, such as
technology, medicine, or law. For a web search engine, the entire
World Wide Web is the relevant corpus. Most corpora are much
smaller.
Categories of the Brown Corpus
Corpus
• Many corpora focus on specific domains
• Most corpora come with metadata, such as the size of the corpus and the
domains from which the text is extracted. Some corpora (such as the
Brown Corpus) include the information content of every word appearing in
the text.
• Information content (IC) is a metric to denote the importance of a term in a
corpus. The conventional way [19] of measuring the IC of a term is to
combine the knowledge of its hierarchical structure from an ontology with
statistics on its actual usage in text derived from a corpus.
• Terms with higher IC values are considered more important than terms
with lower IC values. For example, the word necklace generally has a higher
IC value than the word jewelry in an English corpus because jewelry is
more general and is likely to appear more often than necklace.
Information Content
• Corpus statistics such as IC can help identify the importance of a term
from the documents being analyzed.
• IC values included in the metadata of a traditional corpus (such as
Brown corpus) sitting externally as a knowledge base cannot satisfy
the need to analyze the dynamically changed, unstructured data from
the web.
• First, both traditional corpora and IC metadata do not change over
time. Any term not existing in the corpus text and any newly invented
words would automatically receive a zero IC value.
Term Frequency—Inverse Document
Frequency (TFIDF)
• FIDF, a measure widely used in information retrieval and text analysis.
Instead of using a traditional corpus as a knowledge base, TFIDF directly
works on top of the fetched documents and treats these documents as the
“corpus.” TFIDF is robust and efficient on dynamic content, because
document changes require only the update of frequency counts.
• Given a term t and a document d = {t1, t2, t3,…tn} containing n terms, the
simplest form of term frequency of t in d can be defined as the number of
times t appears in d
where
Frequency Vector
• A term frequency vector can become very high dimensional because the
bag-of-words vector space can grow substantially to include all the words
in English. The high dimensionality makes it difficult to store and parse the
text and contribute to performance issues related to text analysis.
• For the purpose of reducing dimensionality, not all the words from a given
language need to be included in the term frequency vector. In English, for
example, it is common to remove words such as the, a, of, and, to, and
other articles that are not likely to contribute to semantic understanding.
These common words are called stop words
• Another simple yet effective way to reduce dimensionality is to store a
term and its frequency only if the term appears at least once in a
document. Any term not existing in the term frequency vector by default
will have a frequency of 0. Therefore, the previous term frequency vector
would be simplified
Topic Modeling
• Document grouping can be achieved with clustering methods such as kmeans clustering [24] or classification methods such as support vector
machines [25], k-nearest neighbors [26], or naïve Bayes [27]. However, a
more feasible and prevalent approach is to use topic modeling. Topic
modeling provides tools to automatically organize, search, understand, and
summarize from vast amounts of information.
• Topic models [28, 29] are statistical models that examine words from a set
of documents, determine the themes over the text, and discover how the
themes are associated or change over time. The process of topic modeling
can be simplified to the following.
• 1.Uncover the hidden topical patterns within a corpus.
• 2.Annotate documents according to these topics.
• 3.Use annotations to organize, search, and summarize texts.
Topic
• A topic is formally defined as a distribution over a fixed vocabulary of
words [29]. Different topics would have different distributions over the
same vocabulary. A topic can be viewed as a cluster of words with related
meanings, and each word has a corresponding weight inside this topic.
• Topic models do not necessarily require prior knowledge of the texts. The
topics can emerge solely based on analyzing the text.
• The simplest topic model is latent Dirichlet allocation (LDA) [29], a
generative probabilistic model of a corpus proposed by David M. Blei and
two other researchers. In generative probabilistic modeling, data is treated
as the result of a generative process that includes hidden variables. LDA
assumes that there is a fixed vocabulary of words, and the number of the
latent topics is predefined and remains constant. LDA assumes that each
latent topic follows a Dirichlet distribution [30] over the vocabulary, and
each document is represented as a random mixture of latent topics.
The intuitions behind LDA
Distributions of ten topics over nine scientific
documents from the Cora dataset
Topic Models
• Topic models can be used in document modeling, document
classification, and collaborative filtering [29]. Topic models not only
can be applied to textual data, they can also help annotate images.
Just as a document can be considered a collection of topics, images
can be considered a collection of image features.
Determining Sentiments
• Sentiment analysis refers to a group of tasks that use statistics and natural
language processing to mine opinions to identify and extract subjective
information from texts.
• Intuitively, to conduct sentiment analysis, one can manually construct lists of
words with positive sentiments (such as brilliant, awesome, and spectacular) and
negative sentiments (such as awful, stupid, and hideous). Related work has
pointed out that such an approach can be expected to achieve accuracy around
60% [35], and it is likely to be outperformed by examination of corpus statistics
[43].
• Classification methods such as naïve Bayes as introduced in Chapter 7, maximum
entropy (MaxEnt), and support vector machines (SVM) are often used to extract
corpus statistics for sentiment analysis. Related research has found out that these
classifiers can score around 80% accuracy [35, 41, 42] on sentiment analysis over
unstructured data. One or more of such classifiers can be applied to unstructured
data, such as movie reviews or even tweets.
Confusion Matrix
• Confusion Matrix is a specific table layout that allows visualization of
the performance of a model over the testing set. Every row and
column corresponds to a possible class in the dataset. Each cell in the
matrix shows the number of test examples for which the actual class
is the row and the predicted class is the column. Good results
correspond to large numbers down the main diagonal (TP and TN)
and small, ideally zero, off-diagonal elements (FP and FN)
Confusion Matrix for the Example Testing Set
Precision and Recall
• Precision and recall are two measures commonly used to evaluate
tasks related to text analysis. Definitions of precision and recall are
given in Equations 9-8 and 9-9.
• Precision is defined as the percentage of documents in the results that are relevant. If by entering keyword
bPhone, the search engine returns 100 documents, and 70 of them are relevant, the precision of the search
engine result is 0.7%.
• Recall is the percentage of returned documents among all relevant documents in the corpus. If by entering
keyword bPhone, the search engine returns 100 documents, only 70 of which are relevant while failing to
return 10 additional, relevant documents, the recall is 70/(70+10) = 0.875.
Precision and Recall
• Precision and recall are important concepts, whether the task is about
information retrieval of a search engine or text analysis over a finite
corpus. A good classifier ideally should achieve both precision and
recall close to 1.0. In information retrieval, a perfect precision score of
1.0 means that every result retrieved by a search was relevant (but
says nothing about whether all relevant documents were retrieved),
whereas a perfect recall score of 1.0 means that all relevant
documents were retrieved by the search (but says nothing about how
many irrelevant documents were also retrieved). Both precision and
recall are therefore based on an understanding and measure of
relevance.
Classifiers
• Classifiers determine sentiments solely based on the datasets on
which they are trained. The domain of the datasets and the
characteristics of the features determine what the knowledge
classifiers can learn. For example, lightweight is a positive feature for
reviews on laptops but not necessarily for reviews on wheelbarrows
or textbooks. In addition, the training and the testing sets should
share similar traits for classifiers to perform well. For example,
classifiers trained on movie reviews generally should not be tested on
tweets or blog comments.