In this assignment, you’ll need to use the following dataset:
text_train.json: This file contains a list of documents. It’s used for training models
text_test.json: This file contains a list of document and labels of each document. It’s used for testing performance. This file is in the format shown below. Note, each document has a list of labels.
Text
Labels
faa issues fire warning for lithium …
[T1, T3]
rescuers pull from flooded coal mine …
[T1]
….
…
Q1: K-Mean Clustering
Define a function cluster_kmean() as follows:
Takes two file name strings as inputs: train_f ile is the file path of text_train.json, and test_f ile is the file path of text_test.json
Uses KMeans to cluster all documents in both train_f ile and test_f ile into 3 clusters by cosine similarity. Note, please combine documents in these two files and train a single clustering model from the combined documents.
Tests the clustering model performance using test_f ile :
Let’s only use the first label in the label list of each test document as the ground_truth label, e.g. the first document in the table above will have the ground_truth label “T1”. Apply majority vote rule to map the clusters to the labels in test_f ile , i.e., T1, T2, T3
Calculate precision/recall/f-score for each label
Check centroids/samples in each cluster to interpret it, and give a meaningful name (instead of T1, T2, T3) to it.
This function has no return. Print out precision/recall/f-score. Write down the meaningful cluster names in a document. Also find one document sample from train_f ile for each cluster in the doucment.
Q2: LDA Clustering
Define a function cluster_lda() as follows:
Takes two file name strings as inputs: train_f ile is the file path of text_train.json, and test_f ile
is the file path of text_test.json
Uses LDA to train a topic model with only documents in train_f ile and the number of topics K = 3
Predicts the topic distribution of each document in
(i.e. the topic with highest probability)
Evaluates the topic model performance using topic prediction from documents in test_f ile :
Let’s use the first label in the label list of each test document as the ground_truth label,
e.g. the first document in the table above will have the ground_truth label “T1”.
Apply majority vote rule to map the topics to the labels in test_f ile , i.e., T1, T2,
T3 Calculate precision/recall/f-score for each label
Based on the word distribution of each topic, give the topic a meaningful name
(instead of T1, T2, T3).
This function has no return. Print out precision/recall/f-score. Also, provide a document which
contains:
the meaningful topic names
one document sample from train_f ile for each topic
performance comparison between Q1 and Q2.
test_f ile , and selects only the top one topic