Assignment Task :

1 Description 

For this assignment, you will need to create a complete program to perform sen- timent classi cation for movie reviews from feature extraction to classi cation. For a given review, your program should be able to predict whether it is positive i.e. like the movie, or negative, i.e. dislike the movie. 

 

Task 1. Feature extraction 

Use the MapReduce model to convert all text data into matrices. Convert ratings to vectors. ese will be used for classi cation in Task 2. Use TF-IDF to vectorise the text les. See previous practical classes and lectures materials for TF-IDF. One step further though is to represent each text le (review) as a very long and sparse vector as the following. Assume wordslist is the nal list of distinct words contained in all reviews and its length is N. en each review will be a vector of length N, with each position associated with the word in wordlist and the value being either 0, if the corresponding word is absent in the review, or the word’s TF-IDF. For example, if wordlist = [‘word1’, ‘word2’, ‘word3’, ‘word4’] and review 1 contains word1 and word4, then the vector representation of review 1 is [0.1, 0, 0, 0.4] assuming TF-IDF of word 1 and word 4 in review 1 is 0.1 and 0.4 respectively. Note that TF is calculated from one single document while IDF is obtained from all documents in the collection. 

 

3.1 Requirements: 

3.1.1 Req. 1 A Map-reduce model is a must. Implement it using Hadoop streaming. All data are available on SCEM HDFS. e recommendation is to work on the tiny version of the data to make the code work. You may try your code on the full version. However, the application to full version is not required. 

3.1.2 Req. 2 Generate two matrices: training data, training targets, and two vectors: test data, test targets. training data should have N rows and D columns with each row corresponding to each review in the training set (N is the total number of reviews in the training set and D is the total number of words). N and D vary depending on which version of the data you use. training targets should have N elements each of which is the rating of the review. test data and test targets are similarly de ned. 

 

Notes: 

1. If feature extraction is too di cult for you, you can use pre-computed bag of words features included in this data set. Refer to the appendix and README le for details. If pre-computed features are used, a 60% penalty will be incurred for this task, i.e. the maximum marks you can get from this task is 6 if you do so. 

2. Using a map-reduce model to extract TF-IDF is mandatory. If not used, a 20% penalty for this task will be incurred. ere is no constraint on how to form the training and test matrices and vectors. ere are many versions of TF-IDF. ere is no preference for which version to use. 

3. You can use data frame (using pandas package) instead of matrices and vectors to store training and test data and targets. 

 

Marking scheme for task 1: 

• Text le reading (1pt): read the text les for TF-IDF extraction. 

• Rating scores extraction (3pts): parse the name of text les to extract rat- ings. 

• TF-IDF extraction (8pts): use MapReduce class to extract TF-IDF for each text le. 

• Forming matrices and target vectors (or data frames) (3pts): collect TF-IDFs to form training and test data for task 2. 

 

4 Task 2. Classi cation  

Construct a classi cation model for review sentiment prediction, meaning that: given a customer movie review (taken from the test set), your program should be able to predict whether it is positive or negative. 

ere is no limitation on how many classi ers and what speci c model you should use. You can simply pick one that works for you for this task, either from those covered in lecture and practical class materials or any other classi- ers from any python packages. A good starting point is the scikit-learn (i.e. sklearn) package. 

A few things you need to address in your python program are listed as re- quirements below. 

 

4.1 Requirements: 

4.1.1 Req. 1 Data pre-processing. In task 1, you extracted the ratings vectors for training and test. ese are raw ratings. As we are interested in sentiment prediction, i.e. to predict either the review is positive or negative, you need to convert all ratings >5 as positive class and all ratings <=5 as negative class. Choose a coding scheme, e.g. 0 for positive, 1 for negative. 

4.1.2 Req. 2 Normalisation. Apply at least one normalisation scheme and compare the per- formance of the classi er(s) with and without normalisation. 

4.1.3 Req. 3 Training and model selection. Use cross validation to select the best parameters for your classi er. ere may be many parameters to tune in some classi ers (such as random forest classi er — RFC). You can focus on the most important one(s) such as max depth and n estimators in RFC. Refer to the scikit-learn package documentation for details. 

4.1.4 Req. 4  Test on test data. A er model selection, apply the best model, i.e. the model with the parameters that produce the best cross validation scores, to test data, make a prediction for each review, and record prediction accuracy 

 

Note: 

1. Always train your classi er(s) ONLY on training data including cross vali- dation. A er model selection, apply the best model on test data to evaluate the performance. 

2. Good performance, i.e. higher accuracy on test data, is not essential for this task. However, if your classi er has accuracy lower than about 60%, it usually means that there are some mistakes somewhere in your code. So try to score as high an accuracy as possible. 

3. You are encouraged to try many classi ers. If the coding is right, this should not be too di cult. 

 

Marking scheme for task 2: 

• Data pre-processing (1pts): convert ratings to positive and negative coding scheme. 

• Normalisation and comparison (3pts): apply normalisation and compare performance di erence with and without it. 

• Training on training data (3pts): training performed on training data. 

• Cross validation (6pts): apply cross validation on training data. 

• Testing on test data (2pts): best model applied to test data and accuracy produced. 

 

This Engineering Assignement has been solved by our Engineering Experts at UniLearnO. Our Assignment Writing Experts are efficient to provide a fresh solution to this question. We are serving more than 10000+ Students in Australia, UK & US by helping them to score HD in their academics. Our Experts are well trained to follow all marking rubrics & referencing style.

Be it a used or new solution, the quality of the work submitted by our assignment experts remains unhampered. You may continue to expect the same or even better quality with the used and new assignment solution files respectively. There’s one thing to be noticed that you could choose one between the two and acquire an HD either way. You could choose a new assignment solution file to get yourself an exclusive, plagiarism (with free Turnitin file), expert quality assignment or order an old solution file that was considered worthy of the highest distinction.

  • Uploaded By : Grace
  • Posted on : June 02nd, 2019
  • Downloads : 279

Whatsapp Tap to ChatGet instant assistance