BIG DATA BLOG

Monday, 13 January 2025

Additional Comments

Overall I have really enjoyed learning about Big Data and never realised how important of a role it plays in everyday life. From learning about it's conceptualisation, it's applications, advantages and disadvantages, it's a very interesting subject to learn about.

I particularly enjoyed making our own artificial intelligence and seeing it in operation, as well as making our own visualisations in class.

I am very interested in seeing the development of Big Data in the future and seeing what new discoveries will emerge. I think Big Data and AI will revolutionise the way we live, but only time will tell.

Types of Problem Suited for Big Data Analysis

Big Data is tailored to tackle problems that generate a lot of data, particularly large volumes. The foundation of human civilisation relied heavily on knowledge and intelligence. Similarly, the progression of Big Data will only benefit our advancement further. Big Data can only do this with adequate samples though. This creates gaps in our knowledge that cannot be filled, as they require more data. Some efforts are being made to tackle these problems. Here are some of the problems that Big Data aims to solve, some are more extreme than others:

Science/Astronomy

There is a lot that we don't know about our universe, however with new data collection efforts being made outside of our planet, data is being generated just very slowly. With the progression of new technologies we can gather more information about stars, planets, and asteroids. Big Data can be applied well beyond planet earth to find potentially habitable planets and new elements, as well as give a deeper understanding about the things we cannot observe with our own eyes. It can also be used to chart our universe and understand patterns in foreign celestial objects. Big Data can also help in scientific areas like healthcare. Diseases still effect many people every day, horrible illnesses like cancer and Alzheimer's still have no cure. Developments in Big Data may render a cure for these afflictions, or possibly a remedy that lessens the negative effects.

Environment

Another area that have scientists baffled is our own oceans. They make up over 70% of our planet and most of them have not been explored fully due to the immense pressure in the deep ocean. Some creatures have yet to be discovered, and advancements in Big Data could help develop new ocean technology to withstand the pressure in the deep. Furthermore, it could also identify how these marine animals survive in a hostile environment, giving key insight into the areas we cannot observe physically. Big Data can also help with the ongoing climate crisis, predicting where and when natural disasters are likely to occur. By measuring typical weather patterns and other environmental factors, Big Data can alert people of these disasters before they even occur. It can also innovate new ways to fight the climate crisis and make our living on the planet more sustainable.

Urban Planning

Traffic is the number one thing people like to complain about, and rightfully so. Poor infrastructure, bad planning, and overcrowding has resulted in mass amounts of vehicles, and even people in the streets. Big Data can find optimal ways to construct and redesign cities by using urban planning. This can effectively distribute traffic and ensure the flow of traffic is stable. This can benefit our emergency services as well, as less traffic means quicker response times and less congestion. Big Data can find optimal ways of constructing cities that benefits everyone! How cool is that?

References:

https://ilearn.fife.ac.uk/course/view.php?id=9751#section-13

Strategies for Limiting the Negative Effects of Big Data

Big Data has positive and negative applications, aiming to reduce the frequency and intensity of these negative effects are key in maintaining stability to ensure these tools are used correctly. In class, we learned about some of the ways these negative effects have been limited. Here are a few examples:

GDPR and Legislation

Many ideas have surfaced about how to protect ourselves online. Unless it is fully implemented and acknowledged legally, little change will occur. The GDPR was introduced in 2016, and reformed data protection and privacy by ensuring data was being used for legitimate reasons and only when necessary. It gives people peace of mind when using the internet and reduces the risk of cyber attacks leaking valuable, sensitive information. The Data Protection Act of 2018 gives people the right to know how their data is being used, and gives transparency on what kind of information is being held. These strategies have prevented Big Data from being mishandled or used maliciously. Legislation is constantly being reviewed to keep up with the ever-changing digital world.

Anonymisation/Encryption

Anonymisation is a method of protecting individuals online. It means those generating data are more difficult to be traced or identified. This reduces the number of data leaks being used to target people individually for personal gain. This adds a layer of privacy, however it is not perfect. Cross referencing data can still identify people, and has flaws unless it is used with sufficient encryption. This is a technique that has been used since ancient times, however with the emergence of computers and Big Data it has grown more complex and advanced. It limits the access of data by encoding information and making it unreadable. It uses ciphers which makes it off limits to the general population, and can only be decrypted by certain people with a decryption key. Encryption played a crucial role in World War 2 where the Enigma system used by the Axis forces was able to be decrypted by the Allies, giving them a significant upper hand during the conflict! How cool is that? Authentication and other tools used for encryption have become a lot more powerful since then with the introduction of Big Data. This allows safety and privacy to be maintained and protects sensitive information.

References:

https://ilearn.fife.ac.uk/course/view.php?id=9751#section-12

Sunday, 12 January 2025

Limitations of Predictive Analytics

Being able to predict future trends is undoubtedly a useful tool to have. Being able to know an outcome ahead of time sounds great, but conducting an effective prediction is complicated. Here are some of the limitations of using prediction in Big Data:

Reliant on Accurate Data

Predictive algorithms are highly reliant on accurate, testable data. Veracity and variability are essential when conducting prediction. If the results are skewed even slightly, it can offset the entire prediction. This means the outcome can change drastically if the anomalies aren't picked up on. Furthermore, data that fluctuates may also change the outcome, as some data algorithms aren't advanced enough to analyse complex shifts/spikes in patterns.

Reliant on Historical Data

Another limitation would be the dependence on previous data. This might not be a problem for everyone, but could be problematic for businesses. Without any previous data, predictions cannot be made which means launching new stores or products cannot benefit from prediction. The same can be said with any application without former data, a foundation must be laid first.

Consequences of Inaccurate Prediction

Inaccurate prediction can result in many issues, and in some cases it may even cause harm. This is not cool. Areas of application include healthcare, where assumptions and shortcuts cannot afford to be wrong. This can cause physical damages or misdiagnoses to patients. It can also pose a threat in science, which has little room for error. Slight mistakes in prediction can have adverse and unforeseen results which costs a lot of time and money. Lastly, it can pose a threat to people financially. Investors rely heavily on prediction and means that if a predictive algorithm fails, it puts their capital and assets at risk.

References:

https://ilearn.fife.ac.uk/course/view.php?id=9751

Technological Requirements

Big Data has a lot of potential, however many technologies are needed. Here are some of the technological requirements we looked at in class:

Storage

To effectively use Big Data, there must be enough capacity to handle large data sets. This is because Big Data handles exabytes of information which, with current technology, requires a lot of physical space. Cloud storage is optimised for Big Data usage as it allows mass storage without a physical link. Certain applications are not ideal for Big Data as they lack the capacity required for large scale storage, traditional databases are a prime example of this. On the contrary, applications like Hadoop are tailored to handle large volumes of data. The surplus of generated data in modern times requires a constant need for additional storage, and more compact servers capable of holding more data.

Processing/Analysis

Once data has been compiled and stored, it can be managed and processed. This requires technology like data mining algorithms. Data mining can be conducted through various applications, many offered by Apache. The information is drawn from data lakes and data warehouses, where large scale data sets are stored. Data mining allows data points to be extracted for their value, giving insight to data analysts. Machine learning can also be utilised to process data automatically and efficiently. The physical hardware required to conduct larger scale data processing is quite advanced and costs a lot, however for small and medium scale data processing you can most likely use the device you're reading this on! How cool is that?

Visualisation

After data has been processed, it must be displayed. This can be done by putting data into graphs and charts. This can be done through a number of applications and websites easily as long as you can access the internet. Specialised applications can be used for larger data sets that still allow data to be displayed effectively. This allows conclusions to be drawn in a aesthetic, streamlined manner.

In summary, large data sets require special software and hardware which can be expensive for an average person, however this probably won't be an issue for companies and government institutions. Smaller sample sizes can still be tested by anyone with an average computer and access to the internet.

References:

https://ilearn.fife.ac.uk/course/view.php?id=9751#section-17

Limitations of Traditional Statistics

Recently in class we learned about the limitations of traditional statistics. While they can be useful in some aspects, there are some disadvantages that limit their uses, especially when compared with Big Data. Here are a few we learned about:

Volume

One of the main setbacks of traditional statistics is they cannot handle large data sets. When we consider the volume of Big Data, some applications cannot compare every single data point, therefore hindering the accuracy of the data. One example would be Microsoft, which uses traditional statistics in applications like Access and Excel. These are held back by their storage capacity. Even still, these applications struggle well before they reach their limit if the hardware being used is not powerful enough. These limitation result in data that cannot be analysed to their full potential.

Variety

Traditional statistics struggle to handle qualitative data. Furthermore, Big Data contains unstructured, semi-structured and structured data. While traditional statistics are well equipped to analyse numerical data, they struggle with unstructured data in particular and may find it difficult when approached by sentences, audio or images. This is not cool.

Velocity

Another limitation would be the velocity. When traditional statistics are used it can take time to complete, compile, and conclude. This means traditional statistics are unable to be used for real-time information like the stock market. The results created will not be useful because the original variables will have changed, meaning we cannot rely on the results.

References:

https://ilearn.fife.ac.uk/course/view.php?id=9751#section-6

Traditional Statistics

In class we learned about statistics. Traditional statistics are methods of analysing data using averages and standard deviation. These are used to describe and test variables numerically. By using statistics, data can be compared and put into a table where value can be extracted. The two types of statistics used:

Descriptive

These are used to summarise data. It uses measures of central tendency to find 'typical' expected results and averages. It uses the mean, which is calculated by adding up the total value of all the variables and then dividing by the total number of variables. Median is also used which is calculated by sorting the variables into ascending or descending order. The number in the middle of the sample is then chosen as the average. If there are is an even sample, the two middle numbers are added then divided by 2. The mode can also be used, where the number with the most instances in the sample is chosen as the average. All measures of central tendency have their uses and may yield different results, but gives data analysts insight into the contents of their data.

Measures of variability are another type of descriptive statistic. This includes the range, where the smallest value is subtracted from the largest value when placed in ascending order. Standard deviation is another measure of variability, where vast differences in numerical value can be measured, or how values 'deviate' on average from one another. Measures of variability focuses on how data differs from each other.

Inferential

Inferential statistics use probability to calculate the reliability of results and decide how much room there is for randomness to occur. The P value is used to make assumptions of wider samples, to predict how likely it is for a random values to appear. If the P value is less than 0.05 results are likely to be more accurate and reliable. Having a P value of 0.05 translates to a 5% chance that the outcome is a product of pure chance. Confidence intervals are used to estimate and predict future events. For example, the P value of me passing this course is 0.01. How cool is that?

References:

https://ilearn.fife.ac.uk/course/view.php?id=9751#section-6