In this assignment, you’ll need to use the following dataset:

text_train.json: This file contains a list of documents. It’s used for training models

text_test.json: This file contains a list of document and labels of each document. It’s used for testing performance. This file is in the format shown below. Note, each document has a list of labels.

Text

Labels

faa issues fire warning for lithium …

[T1, T3]

rescuers pull from flooded coal mine …

[T1]

….

Q1: K-Mean Clustering

Define a function cluster_kmean() as follows:

Takes two file name strings as inputs: train_f ile is the file path of text_train.json, and test_f ile is the file path of text_test.json

Uses KMeans to cluster all documents in both train_f ile and test_f ile into 3 clusters by cosine similarity. Note, please combine documents in these two files and train a single clustering model from the combined documents.

Tests the clustering model performance using test_f ile :

Let’s only use the first label in the label list of each test document as the ground_truth label, e.g. the first document in the table above will have the ground_truth label “T1”. Apply majority vote rule to map the clusters to the labels in test_f ile , i.e., T1, T2, T3

Calculate precision/recall/f-score for each label

Check centroids/samples in each cluster to interpret it, and give a meaningful name (instead of T1, T2, T3) to it.

This function has no return. Print out precision/recall/f-score. Write down the meaningful cluster names in a document. Also find one document sample from train_f ile for each cluster in the doucment.

Q2: LDA Clustering

Define a function cluster_lda() as follows:

Takes two file name strings as inputs: train_f ile is the file path of text_train.json, and test_f ile

is the file path of text_test.json

Uses LDA to train a topic model with only documents in train_f ile and the number of topics K = 3

Predicts the topic distribution of each document in

(i.e. the topic with highest probability)

Evaluates the topic model performance using topic prediction from documents in test_f ile :

Let’s use the first label in the label list of each test document as the ground_truth label,

e.g. the first document in the table above will have the ground_truth label “T1”.

Apply majority vote rule to map the topics to the labels in test_f ile , i.e., T1, T2,

T3 Calculate precision/recall/f-score for each label

Based on the word distribution of each topic, give the topic a meaningful name

(instead of T1, T2, T3).

This function has no return. Print out precision/recall/f-score. Also, provide a document which

contains:

the meaningful topic names

one document sample from train_f ile for each topic

performance comparison between Q1 and Q2.

test_f ile , and selects only the top one topic

 
Do you need a similar assignment done for you from scratch? We have qualified writers to help you. We assure you an A+ quality paper that is free from plagiarism. Order now for an Amazing Discount!
Use Discount Code "Newclient" for a 15% Discount!

NB: We do not resell papers. Upon ordering, we do an original paper exclusively for you.