Element 68Element 45Element 44Element 63Element 64Element 43Element 41Element 46Element 47Element 69Element 76Element 62Element 61Element 81Element 82Element 50Element 52Element 79Element 79Element 7Element 8Element 73Element 74Element 17Element 16Element 75Element 13Element 12Element 14Element 15Element 31Element 32Element 59Element 58Element 71Element 70Element 88Element 88Element 56Element 57Element 54Element 55Element 18Element 20Element 23Element 65Element 21Element 22iconsiconsElement 83iconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsElement 84iconsiconsElement 36Element 35Element 1Element 27Element 28Element 30Element 29Element 24Element 25Element 2Element 1Element 66

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

In the new issue of the journal Computational Communication Research, Dr. Gregor Wiedemann writes together with three other authors about how Topic Modeling could be simplified. Topic Modeling is an increasingly popular method for analyzing texts in the social sciences. However, for particularly large amounts of text, the computation of topic models can quickly become too time-consuming. This article therefore answers the question whether and how Topic Models can be calculated reliably and without quality loss on smaller samples for the analysis of the population.
 
You can download the article here (PDF)

Abstract
Topic modeling enables researchers to explore large document corpora. However, accurate model specification requires the calculation of multiple models, which can become infeasibly costly in terms of time and computing resources. In order to circumvent this problem, we test and propose a strategy introducing two easy-to-implement modifications to the modeling process: Instead of modeling the full corpus and the whole vocabulary, we (1) use random document samples and (2) an extensively pruned vocabulary. Using three empirical corpora with different origins and characteristics (news articles, websites, and Tweets), we investigate how different sample sizes and pruning strategies affect the resulting topic models as compared to fully modeled corpora. Our test provides evidence that sampling and pruning are cheap and viable strategies to accelerate model specification. Sample-based topic models closely resemble corpus-based models, if the sample size is large enough (usually >10%). Also, extensive pruning does not compromise the quality of the resulting topics. Altogether, pruning and sample-based modeling leads to increased performance without impairing model quality.
 
Maier, D.; Niekler, A.; Wiedemann, G.; Stoltenberg, D. (2020): How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models. In: Computational Communication Research, 2(2), pp. 139–152. https://doi.org/10.5117/CCR2020.2.001.MAIE

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

In the new issue of the journal Computational Communication Research, Dr. Gregor Wiedemann writes together with three other authors about how Topic Modeling could be simplified. Topic Modeling is an increasingly popular method for analyzing texts in the social sciences. However, for particularly large amounts of text, the computation of topic models can quickly become too time-consuming. This article therefore answers the question whether and how Topic Models can be calculated reliably and without quality loss on smaller samples for the analysis of the population.
 
You can download the article here (PDF)

Abstract
Topic modeling enables researchers to explore large document corpora. However, accurate model specification requires the calculation of multiple models, which can become infeasibly costly in terms of time and computing resources. In order to circumvent this problem, we test and propose a strategy introducing two easy-to-implement modifications to the modeling process: Instead of modeling the full corpus and the whole vocabulary, we (1) use random document samples and (2) an extensively pruned vocabulary. Using three empirical corpora with different origins and characteristics (news articles, websites, and Tweets), we investigate how different sample sizes and pruning strategies affect the resulting topic models as compared to fully modeled corpora. Our test provides evidence that sampling and pruning are cheap and viable strategies to accelerate model specification. Sample-based topic models closely resemble corpus-based models, if the sample size is large enough (usually >10%). Also, extensive pruning does not compromise the quality of the resulting topics. Altogether, pruning and sample-based modeling leads to increased performance without impairing model quality.
 
Maier, D.; Niekler, A.; Wiedemann, G.; Stoltenberg, D. (2020): How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models. In: Computational Communication Research, 2(2), pp. 139–152. https://doi.org/10.5117/CCR2020.2.001.MAIE

About this publication

Year of publication

2020

RELATED KEYWORDS

Newsletter

Subscribe to our newsletter and receive the Institute's latest news via email.

SUBSCRIBE!