BIG DATA BLOG: January 2025

Monday, 13 January 2025

Additional Comments

Overall I have really enjoyed learning about Big Data and never realised how important of a role it plays in everyday life. From learning about it's conceptualisation, it's applications, advantages and disadvantages, it's a very interesting subject to learn about.

I particularly enjoyed making our own artificial intelligence and seeing it in operation, as well as making our own visualisations in class.

I am very interested in seeing the development of Big Data in the future and seeing what new discoveries will emerge. I think Big Data and AI will revolutionise the way we live, but only time will tell.

Types of Problem Suited for Big Data Analysis

Big Data is tailored to tackle problems that generate a lot of data, particularly large volumes. The foundation of human civilisation relied heavily on knowledge and intelligence. Similarly, the progression of Big Data will only benefit our advancement further. Big Data can only do this with adequate samples though. This creates gaps in our knowledge that cannot be filled, as they require more data. Some efforts are being made to tackle these problems. Here are some of the problems that Big Data aims to solve, some are more extreme than others:

Science/Astronomy

There is a lot that we don't know about our universe, however with new data collection efforts being made outside of our planet, data is being generated just very slowly. With the progression of new technologies we can gather more information about stars, planets, and asteroids. Big Data can be applied well beyond planet earth to find potentially habitable planets and new elements, as well as give a deeper understanding about the things we cannot observe with our own eyes. It can also be used to chart our universe and understand patterns in foreign celestial objects. Big Data can also help in scientific areas like healthcare. Diseases still effect many people every day, horrible illnesses like cancer and Alzheimer's still have no cure. Developments in Big Data may render a cure for these afflictions, or possibly a remedy that lessens the negative effects.

Environment

Another area that have scientists baffled is our own oceans. They make up over 70% of our planet and most of them have not been explored fully due to the immense pressure in the deep ocean. Some creatures have yet to be discovered, and advancements in Big Data could help develop new ocean technology to withstand the pressure in the deep. Furthermore, it could also identify how these marine animals survive in a hostile environment, giving key insight into the areas we cannot observe physically. Big Data can also help with the ongoing climate crisis, predicting where and when natural disasters are likely to occur. By measuring typical weather patterns and other environmental factors, Big Data can alert people of these disasters before they even occur. It can also innovate new ways to fight the climate crisis and make our living on the planet more sustainable.

Urban Planning

Traffic is the number one thing people like to complain about, and rightfully so. Poor infrastructure, bad planning, and overcrowding has resulted in mass amounts of vehicles, and even people in the streets. Big Data can find optimal ways to construct and redesign cities by using urban planning. This can effectively distribute traffic and ensure the flow of traffic is stable. This can benefit our emergency services as well, as less traffic means quicker response times and less congestion. Big Data can find optimal ways of constructing cities that benefits everyone! How cool is that?

References:

https://ilearn.fife.ac.uk/course/view.php?id=9751#section-13

Strategies for Limiting the Negative Effects of Big Data

Big Data has positive and negative applications, aiming to reduce the frequency and intensity of these negative effects are key in maintaining stability to ensure these tools are used correctly. In class, we learned about some of the ways these negative effects have been limited. Here are a few examples:

GDPR and Legislation

Many ideas have surfaced about how to protect ourselves online. Unless it is fully implemented and acknowledged legally, little change will occur. The GDPR was introduced in 2016, and reformed data protection and privacy by ensuring data was being used for legitimate reasons and only when necessary. It gives people peace of mind when using the internet and reduces the risk of cyber attacks leaking valuable, sensitive information. The Data Protection Act of 2018 gives people the right to know how their data is being used, and gives transparency on what kind of information is being held. These strategies have prevented Big Data from being mishandled or used maliciously. Legislation is constantly being reviewed to keep up with the ever-changing digital world.

Anonymisation/Encryption

Anonymisation is a method of protecting individuals online. It means those generating data are more difficult to be traced or identified. This reduces the number of data leaks being used to target people individually for personal gain. This adds a layer of privacy, however it is not perfect. Cross referencing data can still identify people, and has flaws unless it is used with sufficient encryption. This is a technique that has been used since ancient times, however with the emergence of computers and Big Data it has grown more complex and advanced. It limits the access of data by encoding information and making it unreadable. It uses ciphers which makes it off limits to the general population, and can only be decrypted by certain people with a decryption key. Encryption played a crucial role in World War 2 where the Enigma system used by the Axis forces was able to be decrypted by the Allies, giving them a significant upper hand during the conflict! How cool is that? Authentication and other tools used for encryption have become a lot more powerful since then with the introduction of Big Data. This allows safety and privacy to be maintained and protects sensitive information.

References:

https://ilearn.fife.ac.uk/course/view.php?id=9751#section-12

Sunday, 12 January 2025

Limitations of Predictive Analytics

Being able to predict future trends is undoubtedly a useful tool to have. Being able to know an outcome ahead of time sounds great, but conducting an effective prediction is complicated. Here are some of the limitations of using prediction in Big Data:

Reliant on Accurate Data

Predictive algorithms are highly reliant on accurate, testable data. Veracity and variability are essential when conducting prediction. If the results are skewed even slightly, it can offset the entire prediction. This means the outcome can change drastically if the anomalies aren't picked up on. Furthermore, data that fluctuates may also change the outcome, as some data algorithms aren't advanced enough to analyse complex shifts/spikes in patterns.

Reliant on Historical Data

Another limitation would be the dependence on previous data. This might not be a problem for everyone, but could be problematic for businesses. Without any previous data, predictions cannot be made which means launching new stores or products cannot benefit from prediction. The same can be said with any application without former data, a foundation must be laid first.

Consequences of Inaccurate Prediction

Inaccurate prediction can result in many issues, and in some cases it may even cause harm. This is not cool. Areas of application include healthcare, where assumptions and shortcuts cannot afford to be wrong. This can cause physical damages or misdiagnoses to patients. It can also pose a threat in science, which has little room for error. Slight mistakes in prediction can have adverse and unforeseen results which costs a lot of time and money. Lastly, it can pose a threat to people financially. Investors rely heavily on prediction and means that if a predictive algorithm fails, it puts their capital and assets at risk.

References:

https://ilearn.fife.ac.uk/course/view.php?id=9751

Technological Requirements

Big Data has a lot of potential, however many technologies are needed. Here are some of the technological requirements we looked at in class:

Storage

To effectively use Big Data, there must be enough capacity to handle large data sets. This is because Big Data handles exabytes of information which, with current technology, requires a lot of physical space. Cloud storage is optimised for Big Data usage as it allows mass storage without a physical link. Certain applications are not ideal for Big Data as they lack the capacity required for large scale storage, traditional databases are a prime example of this. On the contrary, applications like Hadoop are tailored to handle large volumes of data. The surplus of generated data in modern times requires a constant need for additional storage, and more compact servers capable of holding more data.

Processing/Analysis

Once data has been compiled and stored, it can be managed and processed. This requires technology like data mining algorithms. Data mining can be conducted through various applications, many offered by Apache. The information is drawn from data lakes and data warehouses, where large scale data sets are stored. Data mining allows data points to be extracted for their value, giving insight to data analysts. Machine learning can also be utilised to process data automatically and efficiently. The physical hardware required to conduct larger scale data processing is quite advanced and costs a lot, however for small and medium scale data processing you can most likely use the device you're reading this on! How cool is that?

Visualisation

After data has been processed, it must be displayed. This can be done by putting data into graphs and charts. This can be done through a number of applications and websites easily as long as you can access the internet. Specialised applications can be used for larger data sets that still allow data to be displayed effectively. This allows conclusions to be drawn in a aesthetic, streamlined manner.

In summary, large data sets require special software and hardware which can be expensive for an average person, however this probably won't be an issue for companies and government institutions. Smaller sample sizes can still be tested by anyone with an average computer and access to the internet.

References:

https://ilearn.fife.ac.uk/course/view.php?id=9751#section-17

Limitations of Traditional Statistics

Recently in class we learned about the limitations of traditional statistics. While they can be useful in some aspects, there are some disadvantages that limit their uses, especially when compared with Big Data. Here are a few we learned about:

Volume

One of the main setbacks of traditional statistics is they cannot handle large data sets. When we consider the volume of Big Data, some applications cannot compare every single data point, therefore hindering the accuracy of the data. One example would be Microsoft, which uses traditional statistics in applications like Access and Excel. These are held back by their storage capacity. Even still, these applications struggle well before they reach their limit if the hardware being used is not powerful enough. These limitation result in data that cannot be analysed to their full potential.

Variety

Traditional statistics struggle to handle qualitative data. Furthermore, Big Data contains unstructured, semi-structured and structured data. While traditional statistics are well equipped to analyse numerical data, they struggle with unstructured data in particular and may find it difficult when approached by sentences, audio or images. This is not cool.

Velocity

Another limitation would be the velocity. When traditional statistics are used it can take time to complete, compile, and conclude. This means traditional statistics are unable to be used for real-time information like the stock market. The results created will not be useful because the original variables will have changed, meaning we cannot rely on the results.

References:

https://ilearn.fife.ac.uk/course/view.php?id=9751#section-6

Traditional Statistics

In class we learned about statistics. Traditional statistics are methods of analysing data using averages and standard deviation. These are used to describe and test variables numerically. By using statistics, data can be compared and put into a table where value can be extracted. The two types of statistics used:

Descriptive

These are used to summarise data. It uses measures of central tendency to find 'typical' expected results and averages. It uses the mean, which is calculated by adding up the total value of all the variables and then dividing by the total number of variables. Median is also used which is calculated by sorting the variables into ascending or descending order. The number in the middle of the sample is then chosen as the average. If there are is an even sample, the two middle numbers are added then divided by 2. The mode can also be used, where the number with the most instances in the sample is chosen as the average. All measures of central tendency have their uses and may yield different results, but gives data analysts insight into the contents of their data.

Measures of variability are another type of descriptive statistic. This includes the range, where the smallest value is subtracted from the largest value when placed in ascending order. Standard deviation is another measure of variability, where vast differences in numerical value can be measured, or how values 'deviate' on average from one another. Measures of variability focuses on how data differs from each other.

Inferential

Inferential statistics use probability to calculate the reliability of results and decide how much room there is for randomness to occur. The P value is used to make assumptions of wider samples, to predict how likely it is for a random values to appear. If the P value is less than 0.05 results are likely to be more accurate and reliable. Having a P value of 0.05 translates to a 5% chance that the outcome is a product of pure chance. Confidence intervals are used to estimate and predict future events. For example, the P value of me passing this course is 0.01. How cool is that?

References:

https://ilearn.fife.ac.uk/course/view.php?id=9751#section-6

Value of Data (Including Future Value)

In class we learned about what makes Big Data so valuable. The short answer is it can provide information when extracted, this can only be done if the right tools are used however. On the surface, unstructured data has no value unless it is extracted and compared.

We learned about Meta, the company that owns many social media companies like Facebook, Instagram, and WhatsApp. The physical assets the company is estimated to be worth between 30 and 40 billion dollars. However, with the inclusion of intangible assets the total value of the company skyrockets to between 600 and 700 billion dollars. Obviously this figure isn't entirely data, but a lot of this wealth is generated by user data. Companies will pay money for this data, and actually make a profit from it.

The value of data arises when it is sold to advertisers and data miners, where data will be consolidated into databases and used to create demographics. When the value is extracted, targeted adverts can be recommended to users who show interest in similar products. The irony is, a lot of these adverts will be shown on sites like Facebook or Instagram causing the process to repeat itself.

While data may not hold physical value, the information given when extracted properly is. It's true what they say, knowledge is power. Forecasts have predicted that Big Data alone will surpass 500 billion dollars by 2027, valued approximately the same as the GDP of The United Arab Emirates. How cool is that?

The value of data has many applications and can be generated/used in:

Business (E-commerce, marketing, fraud detection, inventory management)
Society (Education, government, crime)
Science (Healthcare, astronomy, AI, new technologies)

References:

https://ilearn.fife.ac.uk/course/view.php?id=9751#section-7

Friday, 10 January 2025

Reasons for the Growth of Data

During class we studied the reasons behind Big Data and how it emerged. There are multiple factors that contributed to this 'data explosion' but here are a few examples:

E-commerce

Online shopping is becoming more common with companies like Amazon, Shein, and eBay. This has allowed an excess of user data to flood in based on customer shopping habits. This is especially true during the COVID-19 pandemic in which most people did their shopping online. This contributed massively to the rapid growth of Big Data and changed the way businesses market their products.

Automisation

People traditionally use paper products in schools, universities and the workplace, however over time we have become increasingly reliant on technology. This has generated a lot of new data and allows us to apply Big Data in new ways. This can be done on a national level, and is applicable to most countries around the world who are also making the switch digitally. Estonia, a small country in the Baltic region of Europe, is the first proclaimed country to become fully digital and paperless, as well as being the most digitally advanced society. Any document submitted is conducted electronically, and even their voting system is done digitally! How cool is that? With the impending climate crisis and the automisation of data, more countries are likely to make the switch which will generate a lot more data.

Social Media and Accessibility

Similarly, the world has also been connected via the use of social media. This is primarily done through apps like Facebook, Instagram, TikTok, and Snapchat among others. These small scale interactions are part of a larger network that interlinks user interests and behaviour. The digitisation in recent years means there is an increasing number of people using social media every single day. This has contributed massively to the growth of Big Data as more and more people sign up for accounts and interact on these apps.

References:

https://ilearn.fife.ac.uk/course/view.php?id=9751#section-4

Thursday, 9 January 2025

Applications of Big Data

Today in class we covered the different ways data can be portrayed. We used two different applications to display this information.

Word cloud

Below is a word cloud i created that contains all the lyrics for 'She's Electric' by Oasis. The bigger words represent the words that are mentioned more frequently throughout the song. This was very quick to make and is very effective at displaying information.

Machine Learning/Teachable AI

I personally find machine learning to be super exciting and interesting. Being able to create our own AI in class was a blast. It uses classification techniques and recognition to distinguish between certain criteria and give a certain response. For our example, We created an AI that was able to recognise if someone was wearing glasses or not and read a text-to-speech response saying "glasses" or "no glasses." i wanted to recreate this myself however with a basic prompt. This time the AI would distinguish between apples and oranges, created by AI imagery as well. I will then generate a brand new image of an apple, an orange, and a mango to test the accuracy of the model. The results were surprisingly good and are listed below.

Some of the samples used for apples and oranges.

Apple test.

Orange test.

Mango test.

Overall, this was a lot of fun and I was surprised to see how accurate the results were. It's interesting how the model classifies the mango as majority apple but still has a small percentage of orange. How cool is that?

References:

https://www.wordclouds.com/

https://deepai.org/machine-learning-model/text2img

https://teachablemachine.withgoogle.com/models/574549pNR/

Monday, 6 January 2025

Data Mining Methods

Data mining is the process of extracting information from data. There are many different methods that serve different purposes, each with their own pros and cons. Here are some examples we learned about in class:

Classification

This method is used so sort data into different groups. This is used to categorise and sort information in order to 'classify' data. It must be supervised when training. One application of this method would be a decision tree. This is very similar to a flow chart and goes through a series of yes or no questions to draw conclusions. It is also able to detect spam and fraudulent emails, how cool is that?

Clustering

This is where data is grouped into small clusters, with each point containing similar value. It starts by branching out gradually and grouping each data point by using the mean value. K-means is often used to handle large data sets. (see diagram 1).

Prediction

This method takes existing data and uses it to 'forecast' future results. This can identify trends and has many applications in modern society, like predicting weather patterns or sales projections. It operates by looking at sequences within large data sets to confidently predict the next value. Pattern recognition has become very effective due to the influx of raw data and progression of data mining capabilities.

Neural Networks

This is a method that is meant to simulate the inner workings of the human brain. It creates pathways using nodes. The layout consists of the input layer, hidden layer, and output layer. They have multiple applications and surpass other methods in dealing with unstructured data. (see diagram 2).

Outlier Detection

These operate by looking at data sets and finding anomalies within the data. This can be done by using standard deviation, but on a large scale. Outstanding figures can be identified and examined quickly. This can help purify a data set, resulting in more accurate results.

References:

https://ilearn.fife.ac.uk/course/view.php?id=9751#section-14

Implications of Big Data for Individuals

Big Data is an effective tool and has its uses. However, some of the effects have lasting impact on individuals. Here are some of the implications we learned about in class:

News sources

Unreliable news websites contain biased information or may be misinformed. This can be used to create fearmongering and generate a large number of clicks. Individuals may come across these sites in their spare time and become misinformed, especially young people who are impressionable as well as older people who aren't as familiar with technology. This directly targets people at the source, and makes it hard to distinguish true from false. This misleads people and spreads inaccurate information.

Digital Exclusion

This is a phenomenon in which certain people are unable to access technologies, typically affecting people with disabilities, the elderly, and those with poor finances. This means there are not as many opportunities available for these people, especially nowadays where everything is mostly online like job applications. Overall, this can result in limited access to educational materials, health services, and social interaction. This is not cool.

Lack of freedom

Being constantly monitored can induce a 'chilling effect' in which people act with less freedom. This is because they are aware their movements are being recorded. This can alter behaviour and people don't necessarily act as they would normally. People are limited in their choices and do not like to be restricted. The surge of Big Data over the years has accelerated so much that it tracks everything we do on a daily basis, our habits, it knows everything about us. This is daunting. As a result, everyone acts with a little bit more caution. It's weird to imagine what the world would look like today without the acceleration of Big Data and surveillance.

References:

https://ilearn.fife.ac.uk/course/view.php?id=9751#section-11

Implications of Big Data for Society

In class we looked at the potential implications on society caused by Big Data. While Big Data can be a beneficial tool, it is very unpredictable and will have issues when utilised. its relatively new implementation can have adverse effects on society as a whole, here are a few examples that we looked at:

Automated decision making

Algorithms make assumptions and aren't always accurate. When deciding an outcome, automation can overlook certain aspects. Consider an automated response conducting an online job interview, or marking an exam. Artificial intelligence in particular is poor at making deductions based on user input, and even when trained it still makes errors. Look at the YouTube video at the end for a funny example where an AI has to collect user input/clues and determine a target country. These errors can hinder society by hiring the wrong people for the job or making incorrect passes and fails. These decisions can also lead to filter bubbles or echo chambers, in which users see information they are already acclimated to. This lead to fragmentation and lots of small groups with opposing ideas, in contrast to one big group where everyone is on the same page.

Cultural shifts

The use of Big Data is encouraging people to behave differently, and how they socialize within their groups. Big Data can be weaponised to sculpt societies to behave a certain way. The release of Facebook in the country of Myanmar sparked controversy when it promoted hate speech against local ethnic groups in the region. This was actively promoted, perhaps in pursuit of profit, clicks, and user interactions. This caused violence and resulted in thousands taking refuge in neighbouring countries. This shows that Big Data can be used to shape society and can be dangerous. This is not cool.

Instability

Many uses of Big Data are controversial, even with the presence of the GDPR. Most people using Big Data are large companies or incredibly wealthy individuals. With such a powerful tool at their disposal they can exert their influence in many ways. Whether it is providing better opportunities to certain groups in society, oppression, or using data maliciously. This can weaken the social structure while maintaining complete control, resulting in social instability.

References:

https://youtu.be/iOfYZ-wMfNA?si=-f0Yhj4KmhzEE2Qi

https://www.bbc.co.uk/news/blogs-trending-45449938

https://ilearn.fife.ac.uk/course/view.php?id=9751#section-10