Saturday Links

The NEH Office of Digital Humanities is offering grants to sustain short-term activities related to the objectives of the digital humanities community. Applications due May 11th.

Hannah Alpert-Abrams on alt-ac and post-phd resources and their urgency in the current academic climate.

  • From the above article, a spreadsheet of ‘alt-ac’ individuals who are willing to support and mentor current academics.

Upcoming Workshop: “Locative media, climate change and (im)mobility during the COVID19 pandemic.” Online via King’s College London Department of Digital Humanities. Fri, 29 May 2020. 09:00 – 17:15 BST. See program.

Friday Links

A discussion on going alt-ac from the classics. Especially relevant in today’s job market.

The Newspaper Navigator Dataset: Extracting And Analyzing Visual Content from 16 Million Historic Newspaper Pages in Chronicling America.” A report on the extracted dataset of visual content from 16.3 million pages of newspapers in the Chronicling America dataset.

Here’s a list of all the university presses, the books and journals of which have gone open access since the pandemic.

Academic Job Market Support Network: A repository of sample application docs for those on the job market.

Exploring Data Visualization with Flourish.” Presentation by Kristen Mapes and Kate Topham.

Has Your Area Been Violating Quarantine? Google Knows.

Around the beginning of April, Google released the first “Community Mobility Report” for each US state, as well as for every country for which data exists. The report utilizes Google Maps data to determine the change in activity in certain regions over a baseline. For each jurisdiction, Google analyzes the change in movement within certain predetermined zones:

  • Retail & recreation
  • Grocery & pharmacy
  • Parks
  • Transit stations
  • Workplace
  • Residential

For almost every jurisdiction, we can see the predictable drop-off in activity in areas that are most directly affected by the lockdown (Retail & recreation, Transit Stations, and Workplace showing declines) and a corresponding increase in activity in residential areas as more people stay at home.

However, as you search through the data for your region, you may find significant discrepancies in activity in the outlying area ‘Parks.’ In places in which sufficient data has been collected, we can often see enormous differences between states, with some areas showing massive increases in attendance at parks, whereas others show declines. Consider the difference between Arlington County, VA and Bell County, TX.

In Arlington, VA, everyone is staying in their residential communities.
In rural Bell County, Texas, the decline in retail traffic has not been nearly as severe. Park activity is up.

Looking through the data, it strikes me that rural counties are much less likely to witness steep declines in retail areas due to any lockdown policies. Moreover, we can also see that park activity is down in affluent urban neighborhoods, whereas rural areas see wavering park activity or even steep increases.

At first glance, we can think of several explanations:

  • What Google designates as a ‘park’ is much different in reality between different regions. In urban areas, parks are small, designated spaces in which one is likely to encounter other people. In rural places, parks are wider, less defined spaces offering greater distance between persons. In fact, your home might not be located in a ‘residential’ place at all but rather in what appears to be a ‘park.’
  • Rural areas being more conservative overall, residents might be more skeptical of lockdown policies and instead may consider parks as a safe alternative to staying indoors.
  • Rural residents need to travel through ‘park’ areas to get from one place to another, which may register as ‘park’ activity under Google’s analysis.

Another observation is that these rural areas see much less significant declines in activity toward workplaces and retail areas. Again, this may be attributed to the overall lack of ‘urban’ areas to begin with, so it may be less likely to avoid pockets of ‘retail’ spaces when traveling between one place and another.

It would be interesting to use this data to explore the hypothesis that conservative regions are violating stay-at-home orders. At first glance, we could compare Vermont and West Virginia: Both are generally rural places, but the blue state is seeing larger declines in retail (-40%), while WV is down only 22%, with both areas being up in park activity (although very volatile). Perhaps someone could compare this data set to political opinion polls.

The Hidden Secrets of Wine Reviews

In the fall of 2018, when I was just getting started in the world of text analysis, I signed up for a Cornell course Intro to Data Science where I attempted to get a grasp of the basics of data visualization, regression analysis, and categorization. As part of my final project, I turned to a data set of 130,000 wine reviews from Wine Enthusiast magazine, which was shared on Kaggle a while back. At the time, I only approached this topic as a simple demonstration of the skills I learned in the course. However, over time this project evolved into a different kind of endeavor, revealing strange trends that led me to question some of the assumptions I held about the regularity of aesthetic taste.

Recently, I had the opportunity to share some of this work with the podcast Microform – a project of the Humboldt-Universität zu Berlin. You can listen to recording here, along with the accompanying visualizations.

Wine Review/Weinempfehlung (engl.)

Consider the following short wine reviews, which you might find posted on little cards in your local wine store. At most, they are a few sentences long and are printed with a ‘score’ on a 100-point scale. Most of the time, these texts are relatively benign, but every once in a while you’ll find a few that really pack a lot of content into their limitations of form:

Ripe tangerine, peach, green apple and vanilla flavors combine with brisk, citrusy acidity in this wine. This wine saw a touch of neutral oak, which seems to have contributed a smoky toast character. It picks up complexities as it warms and airs, so don’t drink this too cold.[1] [Points: 87]

This ’06 Zin is as spicy and fruity as you could desire, with masses of currants, cocoa, orange peel, licorice and pepper. But the tannin-acid balance is great, giving it a wonderful structure. Best now for youthful exuberance.[2] [Points: 87]

The wine is a velvet glove in an iron fist. The smooth surface of ripe fruits and rich blackberry flavors, masks the dense tannins that will allow this very great wine to age for many, many years. The acidity and the rich fruit combine with the fine dusty tannins. The wine will surely not be ready to drink before 2027.[3] [Points: 96]

For six bucks you get a dry, rich, creamy Chardonnay with nice flavors of pineapples, pears, apples and smoky vanilla. This is a great price for a wine of this polished appeal.[4] [Points: 84]

In these examples, we see a wide range of short reviews (approximately 50 words each) covering all types of wines, and which are written by a wide variety of authors at the magazine.

In citing these texts under the rubric of the ‘review,’ these small forms exemplify an often-overlooked genre of writing that now dominates the world of professional wine tasting.  Yet apart from its overt commercialization, the review has subtly reshaped the way the public sees and engages with the subjective experience of taste. In terms of its form, the reviews are stringently short in word count, calling to mind the unsurpassable word limit of a Tweet or the punctuating brevity of a news bulletin. Most interesting, however, is the text’s claim to communicate mastery of the wine’s flavor profile in a few words. In doing so, it seeks to grasp the wine’s essential elements and communicate an aspect of experience that is typically resistant to the simplifications of language. As professional sommeliers, the authors are highly trained and experienced in all aspects of wine and supposedly have developed a unique ability to identify distinguishing features of a particular wine that seem almost hidden to the average consumer. In this aspect, the goal of the review is precision – Capture all and only those subtleties of taste that give the wine its unique value. Waste no space on unnecessary thoughts or indistinct features.

Furthermore, these reviews are all accompanied by a particular point value that is assigned by the reviewer who writes the text. The point system is based on a 100-point scale, with 100 being the perfect wine, although in practice most wines do not receive fewer than 80 points. On the one hand, the advantage of this point system lies in its efficiency in presenting the judgments of critics to the marketplace: If a wine receives a high point value, the winemaker may use this review in marketing the product, and indeed one often sees these short blurbs posted on the shelf of the local wine shop. However, this use of points seems to belie the text’s attempt to capture the particularity of experience, as it gives a common rubric for which wines can be compared, sorted, organized, and ranked independently of the unique descriptors provided in the text. As you can see, the two 87-point wines mentioned earlier are hardly identical in terms of taste. The first review (above) emphasizes its acidic character; the second review contains a tannic-acid balance with “spicy” elements. Yet when both wines are assigned the same value, these nuances of the tasting experience are lost, and what remains is an abstract degree of quality that is not rooted in any particular empirical feature of the variety.

Nevertheless, while the score may seem to be in contrast to the text, the two are not entirely divorced from one another. After all, these values are not arbitrarily assigned but generated by the authors themselves to justify the opinions presented in the text or otherwise capture the meaning that they wish to express. Moreover, these scores are not based on a single, universally applicable set of criteria, but are loosely generated by the author’s own set of priorities – one author’s idea of the perfect, 100-point wine may be entirely different from that of another. In this sense, the scores are not always reductive simplifications of experience, but may occasionally present a concise representation that goes beyond the particularity of the text.

In taking these factors together, I am interested in looking at this ‘review’ as a short-form genre that offers unique tensions between description and quantification.

The contemporary form of the wine review can be traced back to the work of Robert Parker, founder of The Wine Advocate and widely considered to be the most influential wine critic in the western world. Beginning in the 1970s, Parker standardized the point-based model to counter what he perceived to be elitism and conflicts of interest in the wine industry. In his view, wine reviews lacked integrity and authenticity, making judgments based on reputation and other factors that are extraneous to the experience of the wine itself. In resisting the Continental decorum that glorified established wineries, Parker embodied a particular American brand of brute straightforwardness and individualism that was both foreign to the European wine community and yet instrumental to popularizing wine culture in the United States. His conception of the wine critic is one of intense courage in defending an independent viewpoint, and as he argues, the only opinions that truly matter of those of the consumers themselves. [Parker, Robert. Parker’s Wine Buyer’s Guide. 6th ed. New York: Simon & Schuster, 2002. Print.]

In this light, Parker sought to develop a system of writing in which wines could be compared on their empirical qualities alone and contextualized along common points of reference.

Yet this emphasis on democratizing wine has hardly diminished the significance of the reviewer in the industry as a whole. In proposing a 100-point system for wine reviews, the resulting ‘Parkerization’ of wine has resulted in a state of affairs in which newer, more unknown wines can be re-classified alongside older reputable brands. This may give critics like Parker enormous power in deciding the fate of any new wine that reaches the market. In fact, some recent statistical studies have suggested that an influential critic may have a significant impact on the price of a wine, and indeed there are a number of cases in which a winemaker has adjusted prices in response to a positive or negative review written by Mr. Parker alone. [Hay, Colin. “Globalisation and the Institutional Re-Embedding of Markets: The Political Economy of Price Formation in the Bordeaux En Primeur Market.” New Political Economy 12.2 (2007): 185–209. Web.] Given this increase in authority of influential wine critics, it seems evident that the numerical value of a review contributes some degree of persuasiveness and brings together the variety of textual responses into a unified system of comparison.

In recent years, the emergence of data science has led many researchers to explore questions of wine quality and prediction from a statistical perspective. In one popular data set, researchers gathered the chemical properties of thousands of wines (e.g. pH levels, sulphates, chlorides, etc.) along with ‘quality’ labels set by humans. [Asuncion, A. & Newman, D.J. (2007). UCI Machine Learning Repository [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, School of Information and Computer Science. http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/] Users would then apply machine learning techniques to generate models to predict the ‘quality’ of the wine on the basis of its chemical properties.  For example, one user modeled wine data and determined the chemical properties that had the greatest influence on the human’s judgment: In this case, the ‘residual sugar’ was most positively correlated with quality, whereas ‘density’ was most negatively correlated with that outcome. As self-trained data scientists made use of this mass data, these models have become increasingly successful in predicting the human’s qualitative judgment on the basis of those empirical inputs. This therefore raises the possibility of a paradigm shift in the way we assess matters of taste. Unlike some earlier paradigms of aesthetic criticism, which emphasize the skill and education of the expert in uncovering underlying layers of the work, the new paradigm sees hidden regularities when one observes the accumulation of all aesthetic judgments in large numbers.

My own interest in this area of research began with the posting of the aforementioned wine review data set in which the user raised the possibility of predicting a wine on the basis of the words used in its description. Whereas most other data sets dealt primarily with numerical data, here was an example in which information extracted from texts could be combined with those numerical features to predict something entirely non-numerical. Moreover, looking at this question from the perspective of a literature scholar, I was intrigued by the possibility that statistical methods could demonstrate a correlation between our descriptions of our mental states (as formulated in the review) and the empirical reality of the wine’s identification. To me, this kind of study was hardly a trivial Fingerübung, but rather a significant development in the way we read small forms, giving us insight into new forms of knowledge that arise through a ‘distant reading’ perspective.

In examining the wine data set, a similar phenomenon emerges in which the contingency of experience in the review gives way toward lawfulness and even predictability at a large scale. When we read one of the reviews, it seems that the review embraces the particular experience of the wine as a singularity. However, when we compare these reviews on a larger scale, a number of interesting observations emerge. For example, consider the way the length of the reviews correlate with the score assigned by the critic:

First, if we collect the lengths of each review (some being shorter than others), then we see a highly linear relationship between that feature and points. In other words, critics tend to write more words about wines that they like than those that they don’t. There could be a number of explanations for this finding, which could lead us back toward a close reading. On the one hand, critics could simply enjoy writing about delicious wines and are more willing to commit their time toward those reviews. Alternatively, we can also see a tendency of the critic’s language to mark differentiations in positive experiences that they otherwise would not with negative ones. A highly rated wine might be characterized by a number of analogous flavors that the critic wants to tease out from the experience, whereas the negative wine might contain a single flavor that overshadows the others [One 80-point review reads: “After the sulfur smell blows off, this wine remains tough in acids and tannins.”].

Another tool to measure the qualities of texts is ‘Textblob,’ [https://textblob.readthedocs.io/en/dev/ . The ‘sentiment’ scores range from -1 (most negative) to 1 (most positive) with 0 being neutral sentiment. The ‘subjectivity’ scores range from 0 (objective language) to 1 (entirely subjective language).] a natural language processing toolkit that can measure the sentiment and subjectivity of any text. The tool is based on a large corpus of words that are tagged with numerical values corresponding to their level of sentiment or subjectivity, as the case may be. For example, a word like ‘sunflower’ may be labelled with positive sentiment, while ‘pretty’ would indicate a high degree of subjectivity. This tool then takes a given text and assign it values by comparing the words in that text against those in the corpus on which it is trained. In the case of this data set, I used this tool to generate sentiment and subjectivity scores for each of the 130,000 reviews, which would offer a text-based numerical feature that could be used to compare against other numerical features, such as price or score.

The sentiment score of texts against point value.
…However, it seems that writers are not especially more likely to assign higher points to wines that they describe in subjective terms.

When we plot Textblob-generated data, we can see a certain correlation emerge between the computer-generated characteristics of the text and the numerical values assigned by the human. For example, as the sentiment of the texts increases from the most negative to the most positive, the reviewer’s score is also likely to increase from the average 80-range up to the more prestigious 90s. A similar trend is also evident in the measurements of subjectivity. The more often a text registers as highly subjective in its language, the average point value also rises.. Therefore, to some extent, the machine learning model can ‘read’ the texts in similar ways as humans do, meaning that the apparent sentiment of the texts corresponds to some degree to the author’s own assessment of that wine. To be sure, there is a large amount of variation, but as with any statistical analysis, there is a regularity that emerges on a large scale.

Reviews with higher point value tend to be more expensive than those that are poorly reviewed.
…However, Textblob does not notice a corresponding correlation between the sentiment of the texts and the price of the wines.

While these computational tools may succeed in ‘reading’ the sentiments of the writers, this method does not always produce intuitive results. If we compare the correlations associated with price, for example, we appear to confront a paradoxical distinction between the computer’s reading and the actual human-generated score. Here it seems critics tend to assign higher points to wines that have a higher price, despite these reviews being blind tastings that do not reveal the product’s market value. If this is accurate, then it seems the critics may be exposed to certain biases leading them to rank pricier wines more highly than others. However, the ‘sentiment’ score shows little correlation with price, despite the sentiment and points showing a positive correlation to each other. To me, this result shows an interesting gap between the sentiment of the text and the score. While the critic may be willing to write favorably about a cheap wine, they may ultimately assign a score that corresponds more closely to its reputation or market value.

The distribution of wine prices by country of origin. The countries are ordered left to right from greater to lesser number of varieties.

Thesis

When accumulated into large numbers, certain forms of small texts often show similar kinds of statistical regularity as other numerical points of information. This regularity seems to hold true not just for matters of empirical reality (the number of births, deaths, crimes, etc.), but also for matters of aesthetic taste. In what I’ve described above, I have given only the most surprising regularities and trends that occur for writing about wines. However, there is no reason to assume that this law of large numbers is limited only to commercial industries but could perhaps occur in many other contexts of short form writing, including literary ones. In this data-driven mode of distant reading, it is increasingly essential for the literary scholar to understand and interpret ‘big data’ of textual information and use those insights to return to the close reading with new questions and hypotheses.


[1] #61995. “Epiphany 2011 Camp Four Vineyard Grenache Blanc (Santa Barbara County).” Sentiment: -0.4, Subjectivity: 0.65.

[2] #63202. “Courtney Benham 2006 Zinfandel (Dry Creek Valley).” Sentiment: 0.933, Subjectivity: 0.683.

[3] #15840. “Château Pétrus 2014 Pomerol.” Sentiment: 0.34, Subjectivity: 0.619 ($2500)

[4] #61694. “Leaping Horse 2008 Chardonnay (California).” Sentiment: 0.427, Subjectivity: 0.775 ($6)

Topic Modeling the Magazin zur Erfahrungsseelenkunde

Overview

In a recent project,  developed in cooperation with the Cornell University Digital Humanities Fellowship, I have created a topic model of the eighteenth-century journal Das Magazin zur Erfahrungsseelenkunde (1783-1793). This project represents my first attempt at text analysis using MALLET and constitutes the first item in a growing portfolio of work in digital humanities. In what follows, I will describe each of the key visualizations below and provide some key steps taken in processing this data.


Historical Background

Karl Philipp Moritz (1756-1793) was an author, journalist, and philosopher who has in recent years gained increased relevance as a transitional figure between the Enlightenment and Romantic periods. In the early 1780s, Moritz became fascinated with the emerging intellectual study of the mind, known as ’empirical psychology’ and recognized the importance of carefully documenting and observing cases of psychic deviance from the norms of average life. Thus, over the following ten years, Moritz edited the journal GNOTHI SAUTON oder das Magazin zur Erfahrungsseelenkunde als ein Lesebuch für Gelehrte und Ungelehrte (GNOTHI SAUTON [Know Thyself] or the Magazine for Empirical Psychology as a Reading Book for the Educated and Uneducated). Instead of limiting the scope of this journal to the dominant intellectual figures of his time, Moritz opened the range of authorship to any authoritative figure who could narrate direct experiences of the abnormal, admonishing his readers to stick only to the facts themselves and to refrain from indulging in any moral nonsense. This proposal seems to be accepted with great enthusiasm, and over the coming years, the journal produced a diverse number of cases, from melancholia, criminality, and violence to speculations on the inner workings of the mind.

For a more complete discussion of Moritz, I recommend reading the digital edition of the journal, provided by Sheila Dickson und Christof Wingertszahn. In this project, my aim is to seek out trends in the wide, unorganized collection of text data that comprises the Magazin (hereafter MzE). In particular, I am interested questions such as the following:

  • What are the commonly recurring linguistic patterns in the corpus?
  • Which authors demonstrate a narrow range of specialization over others?
  • How did author interests develop over time?


Topic Interpretation

After pre-processing the text data into segmented files, I then used MALLET to generate 25 topics within the corpus. These topics, shown above, consist of the top fifty words that occur in similar contexts throughout the journal. To better understand these results, I then needed to generate labels that might capture in a single term the focus of those words. Sometimes this would be relatively easy, especially for topics that were contained to a narrow section of the work. However, in most cases, the topic represents a very general approximation of something that arises throughout the entire work and is not easy to pin down to a single idea. In this sense, the topics should not be understood as areas of discourse, but rather rhetorical methods that might occur in some similar contexts. For example, topic 23 ‘heart’ might not seem so instructive, but it does reveal a recurring language about the ‘heart’ and gives clues that might inform the close reader toward a direction of analysis. Perhaps the term ‘heart’ has some hidden significance and this result might provide statistical grounds for an investigation of cases in which those terms occur.

Topic Probability

In this central visualization, the top 50 segments of each topic are plotted corresponding to their probability. In other words, if a segment is plotted in topic 23 with a 0.30 probability, this means that the segment has a 30% chance of arising, given the terms of that topic. Additionally, when the user hovers the mouse over each segment, one can see the topic label, keywords, and body text of that segment, as well as its location in the corpus. On first glance, we can see the following basic trends in the data:

  • The model does not produce topics of equal distribution in the corpus. Some topics have a few segments that seem to play a leading role in defining the topic (e.g. 15), which helps to indicate certain texts that stand out in their uniqueness.
  • Segments clustered near the higher end of the spectrum suggest a high degree of overlap with the keywords of the topic. In many cases, these may all be segments of a single longer article, all of which contain language that distinguish the article from others.

Full Database

The third page contains an index of each segment according to volume, book, and article. The bars corresponding to each segment comprise the probabilities of each topic occurring in that segment. With this information, one can see how a given article changes in topic from beginning to end. For example, the article “Sprache in Psychologischer Rücksicht” in vol. 1, book 1, contains segments with ~80% probability of topic 3 (Philosophy of Language). When the user hovers over one of these segments, the pop-up shows the top three topics of that segment and the sample text. Unlike the previous visualizations, each word of the sample text is tagged with a number corresponding to the topic to which it is most closely associated.

While this visualization may simply function as an index of the corpus, it is also possible to see transitions within a single article, indicating key turning points. For example, see “Etwas aus Robert G…s Lebensgeschichte” in Vol 1, Book 3. Approximately halfway through the article, the dominant topic shifts from ‘heart’ (comprising the various expressions of emotion) toward ‘duration’ (expressions indicating turn of events, such as ‘day,’ ‘long,’ ‘night,’ etc.). In this regard, I find this chart much more instructive about the corpus, and may prove more useful for a researcher aiming to select articles on the basis of certain topical criteria.

Volume-Topic Attribution

Following the logic of the previous chart, this image illustrates the breakdown of topic probabilities by volume. By closely examining this chart, the data lends credibility to the intuition that this journal increases in theoretical reflection near the end of its life. This could be seen in the increasing share of probability given to topic 12, which contains words related to logical inference (e.g. ‘explain,’ ‘object,’ ‘form.’) and the decreasing share attributed to speculative topics, such as topic 2, affects of the soul (‘imagination,’ ‘image,’ ‘impression’) or topic 23, heart (‘heart,’ ‘death,’ ‘unhappiness’).

In some respects, this image gets closer to the proposal that I originally intended for this project, which would track the occurrence of key terms over a ten year period. While this does not single out any particular term, it is possible to generate some hypotheses about the text, which might be useful for further investigation.

Author-Topic Attribution

In a similar vain as the previous image, this shows the breakdown of topic probabilities for each author of the Magazin (Note that there are still some repeated entries that should be accounted for). The most common writers are Karl Philipp Moritz, Carl Friedrich Pockels, Solomon Maimon, and Marcus Herz. In the future, I would like to continue developing this image to provide improved functionality. Here, it might be helpful to implement functions that might allow comparisons relative to the number of articles that each author wrote. That way, it would be easier to infer the authors who made the greatest contributions to the development of a topic.

Topic Author Count

In this final visualization, each topic is measured according to the number of authors associated with this topic. Some topics appear within a small number of authors (8: the Absolute) whereas others exist among a much wider range of writers (11: Military service). Note the occasionally inverse measurements in comparison with the chart Topic Probability. As shown in topics 2 and 8 (Affects of the Soul and the Absolute), some topics contain articles with a high probability of connection and a corresponding small number of authors. Accordingly, topics 10 (mental sickness) and 11 (military service), which both entail the largest number of authors, have a lower range of article probabilities. Thus, this seems to suggest that there are some topics that appear as “catch-alls,” i.e. they convey all those articles that are not easily determined by another, narrower topic, and are therefore defined by the most general of language. This contrasts with those articles that stand out in their uniqueness and are only conveyed by a minority of contributors.

Conclusions and Future Directions

The production and labeling of topics can often be an intriguing process, revealing curious points of interpretation and unique possibilities of further research. However, as with many computational approaches to literature, these methods should be taken with due skepticism and an avoidance of quick generalization. Ultimately, the ‘topics’ I have proposed here are highly dependent upon the initial conditions by which these outputs were generated. If any of these parameters were slightly different, the list of topics could be entirely different, and might suggest an entirely different outcome for the project.

Despite these limitations, I believe this project was successful to the extent that it identified key turning points in the corpus that might otherwise not be easily found. As demonstrated in the images of the Full Database and Topic Probability, it is possible to identify longer segments that transition between topics, as well as texts that offer a more distinct focus than others. Moreover, I am encouraged by the success of measuring authors by their topic compositions. Going forward, it seems possible that one could account for the authorship of a large number of anonymous reports in the Magazin, some of which could be attributed to one of the several major known contributors.

For a more complete understanding of the processes behind this project, I encourage you to visit my github page, where you will find all the relevant code involved in developing these results. Generally speaking, the project consists of two parts: Scraping and Processing. First, in scraping the data, I used the beautifulsoup Python module to collect the text data from the site linked above. This text is then tokenized and stemmed before finally being segmented into text files of roughly equal length. Then, in the processing section, you will find the code responsible for cleaning the text and tagging each word according to its German part of speech. From here, the data should be processed through MALLET, which generates the list of topic keywords and other important tables, such as the relative weight of each word in the model.

Finally, I would be happy to answer any questions that you may have about this work. Also, I welcome any constructive feedback that any experienced contributors may have in advancing this project. Thank you.