Semi-structured data

What are semi-structured data ?

The data can be classified according to their origin, their rank, the type of language we want to work with, etc. The most practical and global classification that allows us to work effectively in the digital world is based on its structure. According to this, there are 3 types of data: structured, semi-structured and unstructured.

Semi-structured data does not have a defined schema. They do not fit into a table/row/column format but are organised using labels or "tags" that allow them to be grouped and create hierarchies. They are also known as non-relational or NoSQL.

This type of data represents about 5-10% of the overall data volume. However, it has very relevant use cases from a commercial point of view, in large data infrastructures and real-time web applications. Some well-known services based on this type of data are the Amazon recommendation system or Linkedin services.

Many of the use cases are related to data transport, sensor data sharing, electronic data exchange, social media platforms, and NoSQL databases.

The most well-known examples of semi-structured data are

  • Emails, where native metadata can be sorted and searched by keywords
  • XML markup language, who’s flexible, the tag-based structure allows for the universalisation of data structure, storage and transport on the Web.
  • The open standard JSON (JavaScript Object Notation), another semi-structured data exchange format that is widely used in the transmission of data between web applications and servers
  • NoSQL databases, which do not separate the schema from the data itself, are more flexible. They make it possible to store information that does not adapt well to the record/table format, such as the text of variable length. They also facilitate the exchange of data between different databases.