The following tools are suggested to pre-process/prepare Big data before data analysis. Some of these tools can also be used to perform analyses of the data. They can be accessed by any MSc student from the MSc lab in rooms 2.25abc of the Kilburn Building. Note from the information below that, for each tool name, there is one or more links to suggested tutorials on how to use the tool. Other tutorials for these tools (and also other tools) can be found from the Web. You may choose to use a combination of these tools to pre-process the data set you are supposed to use in your coursework. In fact, you are required to work with at least two tools and you should be able to compare one
against the other evaluate them using criteria such as ease of use, friendliness, completeness and robustness of functionality in the context of the tasks described in this document.
Download from the Blackboard Website of this course unit the zip files containing one month worth of road sensor-collected data. The data was collected during the months of January and March of year 1998, and reflect the traffic situation of the city of Mansfield in the UK during part of the Winter and Spring of that year. Note that the data describes observations collected during the 1st half of the month of January 1998 and the 1st half of the month of March. There are a total of 30 files and each file corresponds to a day in one of the two months. The data appears to be fairly complete. Each file contains hundreds of thousands of observations (records) in text format text format text format.
Table 1: Examples of Questions to be Answered During Data Analysis
Which sensors are associated with the highest volumes of traffic?
What times of the day are the busiest for the city location covered by sensor N60311E?
What range of traffic volume variation can you expect between 15:00 and 16:00 at the location covered by sensor N60311E?
Which days of the week are associated with the highest traffic volumes in general?
Is January a busier month in terms of traffic volume than March?
Which city locations (identified by the sensor Ids) present the same traffic patterns?
Can you generate any graphical representation of patterns for any of the attributes? For example, a BarPlot to display the occupancy patterns of the Winter Fridays (January) for sensor N60311E, considering the rush hours.
Can you compare occupancy patterns considering a month, week day, time of day, of two different sensors.
What suspicious values (outliers) for any of the attributes have you found?
Can you find any faulty sensors?
Is average a meaningful measure of traffic? Why or why not?
What times of the day, considering any of the two months, is associated with lowest traffic?
Can you generate the mode for a given attribute (such as occupancy)?
To structure the data as suggested in Task 2, you can use any of the suggested tools (R, OpenRefine, Excel, etc.). During this process some data cleaning may be performed, for example, to remove redundant columns, to rename columns, to separate attribute names from their values, as in some cases, they may come concatenated, such as ‘EB’ and ‘0’. You can also remove some ‘M08’ records and leave only the ‘M14’ ones. Remember to document every step of the process, explaining all you have done and why, using screenshots.
You can start profiling the data aiming to detect errors and outliers. You should be able to find your own way to profile this data, as there is no ready to use recipe. Your performance will be judged using the following criteria:
Use statistics, such as: average, median (50th quar average, median (50th quartile), mode, midrange, qu tile), mode, midrange, quartiles (1st and 3rd), int artiles (1st and 3rd), inter artiles (1st and 3rd), inter-quartile range, variance, standard deviation, sum, count, maximum and minimum to understand the count, maximum and minimum data, detect errors and outliers, explaining how you found them, which tool you used to find each error and why. For example, compare and discuss the results you obtain (1) when you calculate the average occupancy for each sensor, considering each month, each week day, each time period (in one hour intervals ), and (2) when you calculate the median for occupancy in the same manner.
Read about the suggested tools, play with a subset of them, so that you can comment on the advantages and disadvantages of each (you can choose to work with a tool not suggested in this document, if it allows you to process the data in the way suggested in this document). For example, which of the tools is best suited to cope with the amount of data you are working with? Which is the least suited? Which tool allows you to profile your data with the least level of interaction/effort from you, so that the profiling task is almost fully automated? Which one requires more effort from you to obtain a good profile of the data? What are the main steps for obtaining a good data profile using each of the tools?
Get 500 Words For FREE on Your Next Assignment By Australia's #1 Assignment Help Provider