HIS 218 Data Mining and Text Mining

Well, now we are touching on some real, complicated digital tools. These are both pretty new fields of investigation, and historians have only done a very little with these tools. Let's start with some definitions.

The Wikipedia entry on data mining is always a good place to start, and so I've come up with a simplified definition, the "process of discovering patterns in large sets of numerical data." Going a bit further, we might say, "the computer-assisted process of digging through and analyzing enormous sets of data and then extracting the meaning of the data." The wiki entry has a great list of tools (both paid and free) and references. Also take a look at, Doug Alexander, Data Mining. From that, you might conclude that this does hand-in-hand with data visualization, and you would be right.

Text mining is similar to data mining except the source being analyzed is text and not numerical data. The computer analysis helps to "find relationships and patterns in set of textual data." The Wikipedia entry on text mining is good, and there is a ten-year-old definition that is still useful, Marti Hearst, 2003, What is text Mining?

Both of these, data and text mining, require some real, high-level software algorithms that are beyond what we can do in this course.

Beyond, those simple definitions, go next to

The National Centre for Text Mining (NaCTeM)
Ted Underwood, 2012, Where to Start with Text Mining
See also Dan Cohen, 2012, A Conversation with Data: Prospecting Victorian Words and Ideas
Jonathan Hagood, 2012, A Brief Introduction to Data Mining Projects in the Humanities

For data mining, you must start with a data set. Now we have already covered some in the unit on data visualizations, but I also found:

Datasets for Data Mining
University of Pittsburgh, World-Historical Dataverse, External Data Sets
BYU Corpora
Of course, one of the biggest data sets available is the U.S. census.

Also to explore:

Google Books Ngram Viewer
Wordle (create word clouds) Wordcloud can be very useful to quickly see the relationship between specific words in a sample. For example, you could create a Wordcloud based on the UN charter and see which words appear most frequently, which will help you in the interpretation of the document.

Notes on Data Mining and Text Mining