Assignment Task :

Problem Statement:

As the problem of drug abuse intensifies in the U.S., many studies that primarily utilize social media data, such as postings on Twitter, to study drug abuse-related activities use machine learning as a powerful tool for text classification and filtering. However, given the wide range of topics of Twitter users,
tweets related to drug abuse are rare in most of the datasets. This imbalanced data remains a major issue in building effective tweet classifiers, and is especially obvious for studies that include abuse-related slang terms.
 

In this final challenge, we would like to explore two methods to facilitate the capturing process of drug abuse activities more effectively (using Twitter data):
1.
Generate a visualization showing the distribution of drug-abuse-related tweets throughout the country
2. Discover keywords in drug-abuse-related tweets using term-frequency/invert document frequency

 

CHALLENGE 1: Visualizing the distribution of drug-abuse-related tweets
The geo-location information tagged in drug-related tweets is very useful to capture the distribution of drug abuse-risk behaviors. An example of the tweet distribution across the US is shown in the figure below. According to this visualization, the potential greatest drug threat regions could have been in the Florida region, the Great Lakes region, the Mid-Atlantic region, the New York/New Jersey region, the New England region, the Pacific region, the Southeast region, the Southwest region, and the West Central region. However, this information might be biased since the geo-location distribution should be normalized by population density.
Your task is to produce the data needed for a less-biased visualization at the census tract level, where each represents the normalized number of drug-related tweets per population density. In other words, for each census tract, you need to compute the ratio of the number drug-related tweets over the population in the tract.
 

Your Objective: spatially join tweets that contains drug-related terms with the census tracts, and compute the normalized number of tweets using the provided boundary and population data. You must implement this using Spark, and demonstrate that you can do this task in a scalable way such that if there are additional tweets, or
census tract data available, your code can still run efficiently (perhaps at the expense of using more cores).
 

CHALLENGE 2: Identifying keywords for drug-abuse-related tweets
We would like to build a classifier of drug-abuse-related tweets to automatically determine whether a tweet is related to drug-abused. We already have labeled data for this classifier, but in order to efficiently extract tweet features, we would like to use a form of the Term Frequency-Inverse Document Frequency (TF-IDF). This
metric can help us determine the importance of a word in a tweet, and whether we should use that as a feature in our model. In this challenge, to simplify the problem, you are asked to compute a simpler model of TF-IDF.
 

Your Objective: for each tweet that contains drug-related terms, compute the top 3 words with the smallest document frequency. From these words, please provide the top 100 words, and their respective tweet counts as your output. For example, assuming the following tweet message has passed your filter of drug-related terms
and is within 500 largest cities in the US: this is drug-related message
For each word in the message, you need to compute their document frequency defined as the number of tweets
(including those that do not pass your filter) that contains the word. For example:

  • the word “is” is included in 1 million tweets
  • the word “this” is included in 50 thousands tweets
  • the word “drug-related” is included in a thousand tweets
  • the word “message” is included in 10 thousands tweets

In this case, the top 3 words are “drug-related”, “message”, and “this”, in the exact order. Given the top 3 words for each tweets, you then need to compute which are the top 100 words that appeared the most in the top 3 words of the tweets
 

 

This Science Assignment has been solved by our Science Experts at UniLearnO. Our Assignment Writing Experts are efficient to provide a fresh solution to this question. We are serving more than 10000+ Students in Australia, UK & US by helping them to score HD in their academics. Our Experts are well trained to follow all marking rubrics & referencing style.

Be it a used or new solution, the quality of the work submitted by our assignment experts remains unhampered. You may continue to expect the same or even better quality with the used and new assignment solution files respectively. There’s one thing to be noticed that you could choose one between the two and acquire an HD either way. You could choose a new assignment solution file to get yourself an exclusive, plagiarism (with free Turnitin file), expert quality assignment or order an old solution file that was considered worthy of the highest distinction.

Eureka! You've stumped our genius minds (for now)! This exciting new question has our experts buzzing with curiosity. We can't wait to craft a fresh solution just for you!

  • Uploaded By : Grace
  • Posted on : December 01st, 2018

Whatsapp Tap to ChatGet instant assistance