Unstructured data  

What are unstructured data ?

The data can be classified according to their origin, their rank, the type of language we want to work with, etc. The most practical and global classification that allows us to work effectively in the digital world is based on its structure. According to this, there are 3 types of data: structured, semi-structured and unstructured.

Unstructured data accounts for 80% of the volume of all data generated, and the percentage is constantly growing. This data may have an internal structure, but it does not follow any predefined schema or data model.

It can be text or non-text data; it can be generated by a machine or a person, and it can be stored in a NoSQL database or directly in a Datalake.

The best-known examples are:

  • Text files: word files, spreadsheets, presentations, logs...
  • E-mails, the body of the message, the rest of the information is usually semi-structured, as indicated above
  • Data from social networks such as Facebook, Twitter, Linkedin
  • Data from websites such as Youtube, Instagram etc
  • Mobile data: messages, location, chats...
  • Pictures, videos, audios etc
  • Weather data, satellite images, sensor data etc

Working, not only with unstructured data but with huge volumes of it, is a real challenge, which we are responding to with new tools based on machine learning, new storage and computing models based on cloud systems, changes in traditional data engineering strategies (from ETL to ELT models), integration of native and opensource solutions, etc. To all this is added the added complexity of responding in real-time to a growing number of applications such as those based on IoT devices, online commerce, etc.

It is very important to be aware of the type of data handled in each case, to decide which resources and tools are most appropriate for each situation. This will allow us to define the most efficient architectures that cover the needs of a company with the best cost-benefit ratio.