Blog

MDM as a Competitive Advantage in Organizations

junio 18, 2024 by Bluetab

MDM as a Competitive Advantage in Organizations

Just like natural resources, data acts as the driving fuel for innovation, decision-making, and value creation across various sectors. From large tech companies to small startups, digital transformation is empowering data to become the foundation that generates knowledge, optimizes efficiency, and offers personalized experiences to users.

Master Data Management (MDM) plays an essential role in providing a solid structure to ensure the integrity, quality, and consistency of data throughout the organization.

Despite this discipline existing since the mid-90s, some organizations have not fully adopted MDM. This could be due to various factors such as a lack of understanding of its benefits, cost, complexity, and/or maintenance.

According to a Gartner survey, the global MDM market was valued at $14.6 billion in 2022 and is expected to reach $24 billion by 2028, with a compound annual growth rate (CAGR) of 8.2%.

Before diving into the world of MDM, it is important to understand some relevant concepts. To manage master data, the first question we ask is: What is master data? Master data constitutes the set of shared, essential, and critical data for business execution. It has a lifecycle (validity period) and contains key information for the organization’s operation, such as customer data, product information, account numbers, and more.

Once defined, it is important to understand their characteristics, as master data is unique, persistent, and integral, with broad coverage, among other qualities. This is vital to ensure consistency and quality.

Therefore, it is essential to have an approach that considers both organizational aspects (identification of data owners, impacted users, matrices, etc.) as well as processes (related to policies, workflows, procedures, and mappings). Hence, our proposal at Bluetab on this approach is summarized in each of these dimensions.

Another aspect to consider from our experience with master data, which is key to starting an organizational implementation, is understanding its «lifecycle.» This includes:

The business areas inputting the master data (referring to the areas that will consume the information).
The processes associated with the master data (that create, block, report, update the master data attributes—in other words, the treatment that the master data will undergo).
The areas outputting the master data (referring to the areas that ultimately maintain the master data).
All of this is intertwined with the data owners and supported by associated policies, procedures, and documentation.

Master Data Management (MDM) is a «discipline,» and why? Because it brings together a set of knowledge, policies, practices, processes, and technologies (referred to as a technological tool to collect, store, manage, and analyze master data). This allows us to conclude that it is much more than just a tool.

Below, we provide some examples that will help to better understand the contribution of proper master data management in various sectors:

Retail Sector: Retail companies, for example, a bakery, would use MDM to manage master data for product catalogs, customers, suppliers, employees, recipes, inventory, and locations. This creates a detailed customer profile to ensure a consistent and personalized shopping experience across all sales channels.
Financial Sector: Financial institutions could manage customer data, accounts, financial products, pricing, availability, historical transactions, and credit information. This helps improve the accuracy and security of financial transactions and operations, as well as verify customer identities before opening an account.
Healthcare Sector: In healthcare, the most important data is used to manage patient data, procedure data, diagnostic data, imaging data, medical facilities, and medications, ensuring the integrity and privacy of confidential information. For example, a hospital can use MDM to generate an EMR (Electronic Medical Record) for each patient.
Telecommunications Sector: In telecommunications, companies use MDM to manage master data for their devices, services, suppliers, customers, and billing.

In Master Data Management, the following fundamental operations are performed: data cleaning, which removes duplicates; data enrichment, which ensures complete records; and the establishment of a single source of truth. The time it may take depends on the state of the organization’s records and its business objectives. Below, we can visualize the tasks that are carried out:

Now that we have a clearer concept, it’s important to keep in mind that the strategy for managing master data is to keep it organized: up-to-date, accurate, non-redundant, consistent, and integral.

What benefits does implementing an MDM provide?

Data Quality and Consistency: Improves the quality of master data by eliminating duplicates and correcting errors, ensuring the integrity of information throughout the organization.
Efficiency and Resource Savings: Saves time and resources by automating tasks of data cleaning, enrichment, and updating, freeing up staff for more strategic tasks.
Informed Decision-Making: Allows the identification of patterns and trends from reliable data, driving strategic and timely decision-making.
Enhanced Customer Experience: Improves the customer experience by providing a 360-degree view of the customer, enabling more personalized and relevant interactions.
At Bluetab, we have helped clients from various industries with their master data strategy, from the definition, analysis, and design of the architecture to the implementation of an integrated solution. From this experience, we share these 5 steps to help you start managing master data:

List Your Objectives and Define a Scope

First, identify which data entities are of commercial priority within the organization. Once identified, evaluate the number of sources, definitions, exceptions, and volumes that the entities have.

Define the Data You Will Use

Which part of the data is important for decision-making? It could simply be all or several fields of the record to fill in, such as name, address, and phone number. Get support from governance personnel for the definition.

Establish Processes and Owners

Who will be responsible for having the rights to modify or create the data? For what and how will this data be used to reinforce or enhance the business? Once these questions are formulated, it is important to have a process for how the information will be handled from the master data registration to its final sharing (users or applications).

Seek Scalability

Once you have defined the processes, try to ensure they can be integrated with future changes. Take the time to define your processes and avoid making drastic changes in the future.

Find the Right Data Architecture, Don’t Take Shortcuts

Once the previous steps are defined and generated, it’s time to approach your Big Data & Analytics strategic partner to ensure these definitions are compatible within the system or databases that house your company’s information.

Final Considerations

Based on our experience, we suggest considering the following aspects when assessing/defining the process for each domain in master data management, subject to the project scope:

Management of Routes:
- Consider how the owner of the creation of master data registers it (automatically and eliminating manual data entry from any other application) and how any current process of an area/person centralizes the information from other areas involved in the master data manually (emails, calls, Excel sheets, etc.). This should be automated in a workflow.
Alerts & Notifications:
- It is recommended to establish deadlines for the completeness of the data for each area and the responsible party updating a master data.
- The time required to complete each data entry should be agreed upon among all involved areas, and alerts should be configured to communicate the updated master data.
Blocking and Discontinuation Processes:
- A viable alternative is to make these changes operationally and then communicate them to the MDM through replication.
Integration:
- Evaluate the possibility of integrating with third parties to automate the registration process for clients, suppliers, etc., and avoid manual entry: RENIEC, SUNAT, Google (coordinates X, Y, Z), or other agents, evaluating suitability for the business.
Incorporation of Third Parties:
- Consider the incorporation of clients and suppliers at the start of the master data creation flows and at the points of updating.

In summary, master data is the most important common data for an organization and serves as the foundation for many day-to-day processes at the enterprise level. Master data management helps ensure that data is up-to-date, accurate, non-redundant, consistent, integral, and properly shared, providing tangible benefits in data quality, operational efficiency, informed decision-making, and customer experience. This contributes to the success and competitiveness of the organization in an increasingly data-driven digital environment.

If you found this article interesting, we appreciate you sharing it. At Bluetab, we look forward to hearing about the challenges and needs you have in your organization regarding master and reference data.

Do you want to learn more about what we offer and see other success stories?

SOLUCIONES, SOMOS EXPERTOS

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Te puede interesar

Detección de Fraude Bancario con aprendizaje automático

septiembre 17, 2020

Data-Drive Agriculture; Big Data, Cloud & AI aplicados

noviembre 4, 2020

Data governance in the Contact Center services sector

septiembre 1, 2022

Características esenciales que debemos tener en cuenta al adoptar un paradigma en la nube

septiembre 12, 2022

Bluetab en la ElixirConfEU 2023

mayo 3, 2023

LakeHouse Streaming en AWS con Apache Flink y Hudi (Parte 2)

octubre 4, 2023

Oscar Hernández, new CEO of Bluetab LATAM.

mayo 16, 2024 by Bluetab

Oscar Hernández, new CEO of Bluetab LATAM.

Oscar Hernández Rosales takes on the responsibility as CEO of Bluetab LATAM and will be in charge of developing, leading, and executing Bluetab’s strategy in the region, with the objective of expanding the company’s products and services to support the continuous digital transformation of its clients and the creation of value.

During this transition, Oscar will continue to serve as Country Manager of Mexico, ensuring effective coordination between our local operations and our regional strategy, further strengthening our position in the market.

"This new challenge is a privilege for me. I am committed to leading with vision, continuing to strengthen the Bluetab culture, and working for the well-being of our collaborators and the success of the business. An important challenge in an industry that constantly adapts to the evolution of new technologies. Bluetab innovates, anticipates, and is dedicated to providing the best customer experience, supported by a professional, talented, and passionate team that understands the needs of organizations," says Oscar.

Would you like to learn more about what we offer and see other success stories?

SOLUCIONES, SOMOS EXPERTOS

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Te puede interesar

¿Qué está pasando en el mundo de la AI?

marzo 6, 2023

Workshop Ingeniería del caos sobre Kubernetes con Litmus

julio 7, 2021

KubeCon 2023: Una mirada hacia el futuro de Kubernetes

abril 26, 2023

Azure Data Studio y Copilot

octubre 11, 2023

Mi experiencia en el mundo de Big Data – Parte I

octubre 14, 2021

Los Incentivos y el Desarrollo de Negocio en las Telecomunicaciones

octubre 9, 2020

Leadership changes at Bluetab EMEA

abril 3, 2024 by Bluetab

Leadership changes at Bluetab EMEA

Photo: Luis Malagón, CEO of Bluetab EMEA, and Tom Uhart, Co-Founder y Data & AI Offering Lead

Luis Malagón becomes the new CEO of Bluetab EMEA after more than 10 years of experience within the company, having contributed significantly to its success and positioning. His proven leadership qualities position him perfectly to drive Bluetab in its next phase of growth.

‘This new challenge leading the EMEA region is a great opportunity to continue fostering a customer-oriented culture and enhancing their transformation processes. Collaboration is part of our DNA, and this, combined with an exceptional team, positions us in the right place at the right time. Together with IBM Consulting, we will continue to lead the market in Data and Artificial Intelligence solutions’, states Luis.

At Bluetab, we have been leading the data sector for nearly 20 years. Throughout this time, we have adapted to various trends and accompanied our clients in their digital transformation, and now we continue to do so with the arrival of Generative AI.

Tom Uhart’s new journey

Tom Uhart, Co-Founder of Bluetab and until now CEO of EMEA, will continue to drive the project from his new role as Data & AI Offering Lead. In this way, Tom will continue to enhance the company’s positioning and international expansion hand in hand with the IBM group and other key players in the sector.

‘Looking back, I am very proud to have seen Bluetab grow over all these years. A team that stands out for its great technical talent, rebellious spirit, and culture of closeness. We have achieved great goals, overcome obstacles, and created a legacy of which we can all be proud. Now it's time to leave the next stage of Bluetab's growth in Luis's hands, which I am sure will be a great success and will take the company to the next level’, says Tom.

Do you want to know more about what we offer and to see other success stories?

SOLUTIONS, WE ARE EXPERTS

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

MODELOS DE ENTREGA DE SERVICIOS EN LA NUBE

junio 27, 2022

$ docker run 2021

febrero 2, 2021

Potencia Tu Negocio con GenAI y GCP: Simple y para Todos

marzo 27, 2024

¿Existe el Azar?

noviembre 10, 2021

Gobierno de Datos: ¿tendencia o necesidad?

octubre 13, 2022

Usando los Grandes Modelos de Lenguaje en información privada

marzo 11, 2024

Cambios de liderazgo en Bluetab EMEA

abril 3, 2024 by Bluetab

Cambios de liderazgo en Bluetab EMEA

Foto: Luis Malagón, CEO de Bluetab EMEA, y Tom Uhart, Co-Fundador y Data & AI Offering Lead

Luis Malagón se convierte en el nuevo CEO de Bluetab EMEA tras más de 10 años de experiencia en la compañía y habiendo contribuido enormemente a su éxito y posicionamiento. Sus probadas cualidades de liderazgo lo posicionan perfectamente para impulsar Bluetab en su siguiente fase de crecimiento.

“Este nuevo reto al frente de la región de EMEA es una gran oportunidad para seguir fomentando una cultura orientada al cliente y potenciar sus procesos de transformación. La colaboración forma parte de nuestro ADN y esto sumado a un equipo excepcional nos posiciona en el lugar adecuado y en el momento adecuado. Junto a IBM Consulting vamos a continuar liderando el mercado de las soluciones de Datos e Inteligencia Artificial”, afirma Luis.

“En Bluetab llevamos casi 20 años liderando el sector de los datos. En todo este tiempo, hemos ido adaptándonos a las diferentes tendencias y acompañando a nuestros clientes en su transformación digital, y ahora seguimos haciéndolo con la llegada de la IA Generativa.»

El nuevo rumbo de Tom Uhart

Tom Uhart, Co-Fundador de Bluetab y hasta ahora CEO de EMEA, continuará impulsando el proyecto desde su nuevo rol de Data & AI Offering Lead. De esta manera Tom seguirá impulsando el posicionamiento de la compañía y su expansión internacional de la mano del grupo IBM y otros key players del sector.

“Echando la vista atrás, me siento muy orgulloso de haber visto a Bluetab crecer durante todos estos años. Un equipo que sobresale por su gran talento técnico, espíritu inconformista y cultura de cercanía. Hemos alcanzado grandes metas, superado obstáculos y creado un legado del que todos y todas podemos sentirnos orgullosos. Ahora es la hora de dejar en manos de Luis la siguiente etapa de crecimiento de Bluetab, que estoy seguro será un gran éxito y llevará a la compañía al siguiente nivel”, afirma Tom.

¿Quieres saber más de lo que ofrecemos y ver otros casos de éxito?

SOLUCIONES, SOMOS EXPERTOS

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Te puede interesar

Gobierno del Dato: Una mirada en la realidad y el futuro

mayo 18, 2022

Bluetab se certifica como AWS Well Architected Partner Program

octubre 19, 2020

Oscar Hernández, nuevo CEO de Bluetab LATAM

mayo 16, 2024

Big Data e IoT

febrero 10, 2021

Databricks sobre Azure – Una perspectiva de Arquitectura (parte 2)

marzo 24, 2022

LakeHouse Streaming on AWS with Apache Flink and Hudi (Part 2)

octubre 4, 2023

Boost Your Business with GenAI and GCP: Simple and for Everyone

marzo 27, 2024 by Bluetab

Alfonso Zamora
Cloud Engineer

Introduction

The main goal of this article is to present a solution for data analysis and engineering from a business perspective, without requiring specialized technical knowledge.

Companies have a large number of data engineering processes to extract the most value from their business, and sometimes, very complex solutions for the required use case. From here, we propose to simplify the operation so that a business user, who previously could not carry out the development and implementation of the technical part, will now be self-sufficient, and will be able to implement their own technical solutions with natural language.

To fulfill our goal, we will make use of various services from the Google Cloud platform to create both the necessary infrastructure and the different technological components to extract all the value from business information.

Before we begin

Before we begin with the development of the article, let’s explain some basic concepts about the services and different frameworks we will use for implementation:

Cloud Storage[1]: It is a cloud storage service provided by Google Cloud Platform (GCP) that allows users to securely and scalably store and retrieve data.
BigQuery[2]: It is a fully managed data analytics service that allows you to run SQL queries on massive datasets in GCP. It is especially effective for large-scale data analysis.
Terraform[3]: It is an infrastructure as code (IaC) tool developed by HashiCorp. It allows users to describe and manage infrastructure using configuration files in the HashiCorp Configuration Language (HCL). With Terraform, you can define resources and providers declaratively, making it easier to create and manage infrastructure on platforms like AWS, Azure, and Google Cloud.
PySpark[4]: It is a Python interface for Apache Spark, an open-source distributed processing framework. PySpark makes it easy to develop parallel and distributed data analysis applications using the power of Spark.
Dataproc[5]: It is a cluster management service for Apache Spark and Hadoop on GCP that enables efficient execution of large-scale data analysis and processing tasks. Dataproc supports running PySpark code, making it easy to perform distributed operations on large datasets in the Google Cloud infrastructure.

What is an LLM?

An LLM (Large Language Model) is a type of artificial intelligence (AI) algorithm that utilizes deep learning techniques and massive datasets to comprehend, summarize, generate, and predict new content. An example of an LLM could be ChatGPT, which makes use of the GPT model developed by OpenAI.

In our case, we will be using the Codey model (code-bison), which is a model implemented by Google that is optimized for generating code as it has been trained specifically for this specialization, which is part of the VertexAI stack.

However, it’s not only important which model we are going to use, but also how we are going to use it. By this, I mean it’s necessary to understand the input parameters that directly affect the responses our model will provide, among which we can highlight the following:

Temperature: This parameter controls the randomness in the model’s predictions. A low temperature, such as 0.1, generates more deterministic and focused results, while a high temperature, such as 0.8, introduces more variability and creativity in the model’s responses.

Prefix (Prompt): The prompt is the input text provided to the model to initiate text generation. The choice of prompt is crucial as it guides the model on the specific task expected to be performed. The formulation of the prompt can influence the quality and relevance of the model’s responses, although the length should be considered to meet the maximum number of input tokens, which is 6144.

Output Tokens (max_output_tokens): This parameter limits the maximum number of tokens that will be generated in the output. Controlling this value is useful for avoiding excessively long responses or for adjusting the output length according to the specific requirements of the application.

Candidate Count: This parameter controls the number of candidate responses the model generates before selecting the best option. A higher value can be useful for exploring various potential responses, but it will also increase computational cost.

Development of the prompt

Once we have defined the parameters and understand well what each of them is for, and we comprehend what a prompt is, let’s focus on how to use it and implement one that can adapt to our needs.

As mentioned earlier, the goal is to generate both PySpark code and Terraform in order to perform infrastructure creation and data processing tasks. Since these are completely different tasks, as a first important decision for our prompt, we have chosen to divide it into two specific parts so that each prompt is trained with examples to generate one language or the other.

For each prompt, an introduction is made to specify what the objective will be and what requests will be made, followed by a series of examples in which input in natural language is given simulating a request, and then the desired output is also given to assign the text to the specific code. The goal is to generate a structured prompt that can be efficiently processed by the model so that in the following cases, it can associate the available examples with appropriate responses.

Let’s put these small tips into practice to see one of the entries to the Terraform prompt:

input: I need to read a file named `myfile.csv` inside the folder `data/input` in a bucket called `application-data-input`, and I need to create a new dataset and table in BigQuery called `tst_tf_ai`. The process has to read the file, remove the first three characters of the name, and write the result in BigQuery. The schema of the table will be ‘name’ of String type and ‘surname’ of String type. The infrastructure has to be deployed in the project `project-cloud-223` in GCP, and region `europe-southwest1`.
output: «`hcl
terraform {
required_version = «>= 0.15»
required_providers {
  google = «~> 5.0»
}
}

resource «google_storage_bucket» «temporary_files» {
project = «project-cloud-223»
name = «temporary-files-dataproc»
location = «europe-southwest1»
force_destroy = true
}

resource «google_storage_bucket» «bucket1» {
project = «project-cloud-223»
name = «application-data-input»
location = «europe-southwest1»
force_destroy = true
}

resource «google_storage_bucket_object» «folder1» {
name = «data/input/»
bucket = google_storage_bucket.bucket1.name
content = » «
}

resource «google_bigquery_dataset» «dataset» {
project = «project-cloud-223»
dataset_id = «tst_tf_ai»
friendly_name = «tst_tf_ai»
description = «Description»
location = «europe-southwest1»
}

resource «google_bigquery_table» «table» {
project = «project-cloud-223»
dataset_id = google_bigquery_dataset.dataset.dataset_id
deletion_protection = false
table_id = «tst_tf_ai»
schema = <<EOF
[
{
  «name»: «name»,
  «type»: «STRING»,
  «mode»: «NULLABLE»,
  «description»: «The name»
},
{
  «name»: «surname»,
  «type»: «STRING»,
  «mode»: «NULLABLE»,
  «description»: «The surname»
}
]
EOF
}
«`

Author Name

It’s important to implement examples as closely as possible to your use case so that the responses are more accurate, and also to have plenty of examples with a variety of requests to make it smarter when returning responses. One of the practices to make the prompt implementation more interactive could be to try different requests, and if it’s unable to do what’s been asked, the instructions should be modified.

As we have observed, developing the prompt does require technical knowledge to translate requests into code, so this task should be tackled by a technical person to subsequently empower the business user. In other words, we need a technical person to generate the initial knowledge base so that business users can then make use of these types of tools.

It has also been noticed that generating code in Terraform is more complex than generating code in PySpark, so more input examples were required in creating the Terraform prompt to tailor it to our use case. For example, we have applied in the examples that in Terraform it always creates a temporary bucket (temporary-files-dataproc) so that it can be used by Dataproc.

Practical Cases

Three examples have been carried out with different requests, requiring more or less infrastructure and transformations to see if our prompt is robust enough.

In the file ai_gen.py, we see the necessary code to make the requests and the three examples, in which it is worth highlighting the configuration chosen for the model parameters:

It has been decided to set the value of candidate_count to 1 so that it has no more than one valid final response to return. Additionally, as mentioned, increasing this number also entails increased costs.
The max_output_tokens has been set to 2048, which is the highest number of tokens for this model, as if it needs to generate a response with various transformations, it won’t fail due to this limitation.
The temperature has been varied between the Terraform and PySpark code. For Terraform, we have opted for 0 so that it always gives the response that is considered closest to our prompt, ensuring it doesn’t generate more than strictly necessary for our objective. In contrast, for PySpark, we have opted for 0.2, which is a low temperature to prevent excessive creativity, yet still allowing it to provide diverse responses with each call, enabling performance testing among them.

We are going to carry out an example of a request that is available in the following GitHub repository, where it is detailed step by step in the README to be able to execute it yourself. The request is as follows:

In the realm of ‘customer_table,’ my objective is the seamless integration of pivotal fields such as ‘customer_id’, ‘name’, and ‘email’. These components promise to furnish crucial insights into the essence of our valued customer base.

Conversely, when delving into the nuances of ‘sales_table,’ the envisioned tapestry includes essential elements like ‘order_id’ ‘product’ ‘price’, ‘amount’ and ‘customer_id’. Theseattributes, meticulously curated, will play a pivotal role in the nuanced exploration and analysis of sales-related data.

The ‘bigtable_info’ table will have all the fields resulting from the union of the two tables, ‘customer_table’ and ‘sales_table.’ Here, the outcome of joining the two tables by the ‘customer_id’ numeric field will be stored.

Furthermore, as part of our meticulous data collection strategy, I plan to inaugurate a dedicated Google Cloud Storage bucket christened ‘sales-etl-bucket.’ This repository is strategically designed to serve as a robust container for collating data, particularly hailing from CSV files. This endeavor is set to enhance the efficiency and organizational prowess of our data management processes.

To enhance the structure, two distinct subfolders, ‘input/sales’ and ‘input/customers’ have been ingeniously incorporated within the ‘sales-etl-bucket,’ ensuring a refined organization of the incoming data streams.

You will need to read the CSV files within the ‘input/sales’ folder to write the data into the ‘sales_table’. Additionally, you should perform the same operation with the files from the ‘input/customers’ folder to write the data into the ‘customer_table’. Finally, you are required to perform a join between the sales and customers information based on the ‘customer_id’ number field and write the merged information into the ‘bigtable_info’.

Author Name

We execute and will analyze the outputs generated in the folders generated/customers/python/etl.py and generated/terraform/main.tf.

terraform {
 required_version = ">= 0.15"
 required_providers {
   google = "~> 5.0"
 }
}


resource "google_storage_bucket" "temporary_files" {
 project       = "my-cloud-project"
 name          = "temporary-files-dataproc"
 location      = "europe-southwest1"
 force_destroy = true
}


resource "google_storage_bucket" "sales_etl_bucket" {
 project       = "my-cloud-project"
 name          = "sales-etl-bucket"
 location      = "europe-southwest1"
 force_destroy = true
}


resource "google_storage_bucket_object" "sales_folder" {
 name    = "input/sales/"
 bucket  = google_storage_bucket.sales_etl_bucket.name
 content = " "
}


resource "google_storage_bucket_object" "customers_folder" {
 name    = "input/customers/"
 bucket  = google_storage_bucket.sales_etl_bucket.name
 content = " "
}


resource "google_bigquery_dataset" "raw_sales_data" {
 project       = "my-cloud-project"
 dataset_id    = "raw_sales_data"
 friendly_name = "Raw Sales Data"
 description   = "This dataset contains the raw sales data."
 location      = "europe-southwest1"
}


resource "google_bigquery_table" "customer_table" {
 project              = "my-cloud-project"
 dataset_id           = google_bigquery_dataset.raw_sales_data.dataset_id
 deletion_protection  = false
 table_id             = "customer_table"
 schema               = <<EOF
[
 {
   "name": "customer_id",
   "type": "INT64",
   "mode": "REQUIRED",
   "description": "The customer ID."
 },
 {
   "name": "name",
   "type": "STRING",
   "mode": "NULLABLE",
   "description": "The customer's name."
 },
 {
   "name": "email",
   "type": "STRING",
   "mode": "NULLABLE",
   "description": "The customer's email address."
 }
]
EOF
}


resource "google_bigquery_table" "sales_table" {
 project              = "my-cloud-project"
 dataset_id           = google_bigquery_dataset.raw_sales_data.dataset_id
 deletion_protection  = false
 table_id             = "sales_table"
 schema               = <<EOF
[
 {
   "name": "order_id",
   "type": "INT64",
   "mode": "REQUIRED",
   "description": "The order ID."
 },
 {
   "name": "product",
   "type": "STRING",
   "mode": "NULLABLE",
   "description": "The product name."
 },
 {
   "name": "price",
   "type": "FLOAT64",
   "mode": "NULLABLE",
   "description": "The product price."
 },
 {
   "name": "amount",
   "type": "INT64",
   "mode": "NULLABLE",
   "description": "The product amount."
 },
 {
   "name": "customer_id",
   "type": "INT64",
   "mode": "REQUIRED",
   "description": "The customer ID."
 }
]
EOF
}


resource "google_bigquery_dataset" "master_sales_data" {
 project       = "my-cloud-project"
 dataset_id    = "master_sales_data"
 friendly_name = "Master Sales Data"
 description   = "This dataset contains the master sales data."
 location      = "europe-southwest1"
}


resource "google_bigquery_table" "bigtable_info" {
 project              = "my-cloud-project"
 dataset_id           = google_bigquery_dataset.master_sales_data.dataset_id
 deletion_protection  = false
 table_id             = "bigtable_info"
 schema               = <<EOF
[
 {
   "name": "customer_id",
   "type": "INT64",
   "mode": "REQUIRED",
   "description": "The customer ID."
 },
 {
   "name": "name",
   "type": "STRING",
   "mode": "NULLABLE",
   "description": "The customer's name."
 },
 {
   "name": "email",
   "type": "STRING",
   "mode": "NULLABLE",
   "description": "The customer's email address."
 },
 {
   "name": "order_id",
   "type": "INT64",
   "mode": "REQUIRED",
   "description": "The order ID."
 },
 {
   "name": "product",
   "type": "STRING",
   "mode": "NULLABLE",
   "description": "The product name."
 },
 {
   "name": "price",
   "type": "FLOAT64",
   "mode": "NULLABLE",
   "description": "The product price."
 },
 {
   "name": "amount",
   "type": "INT64",
   "mode": "NULLABLE",
   "description": "The product amount."
 }
]
EOF
}

We see that it will add a bucket with two folders to upload the data to the corresponding folder depending on whether it is sales or customer data. It has also created two datasets, one for the raw layer to have the data as it came from the source with the tables of customers and sales, and the second dataset for the final table that will store the result of joining the information from both tables.

Let’s review the PySpark code:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F


# Configure Spark session
spark = SparkSession.builder.appName("SalesETL").getOrCreate()


# Define file paths
sales_path = "gs://sales-etl-bucket/input/sales/*.csv"
customers_path = "gs://sales-etl-bucket/input/customers/*.csv"


# Read CSV files into DataFrames
sales_df = spark.read.csv(sales_path, header=True, inferSchema=True)
customers_df = spark.read.csv(customers_path, header=True, inferSchema=True)


# Write tables to BigQuery
sales_df.write.format("bigquery") \
   .option("temporaryGcsBucket", "temporary-files-dataproc") \
   .option("table",  "raw_sales_data.sales_table") \
   .mode("overwrite") \
   .save()
customers_df.write.format("bigquery") \
   .option("temporaryGcsBucket", "temporary-files-dataproc") \
   .option("table",  "raw_sales_data.customer_table") \
   .mode("overwrite") \
   .save()


# Join sales and customers tables
bigtable_info_df = sales_df.join(customers_df, on="customer_id", how="inner")


# Write joined table to BigQuery
bigtable_info_df.write.format("bigquery") \
   .option("temporaryGcsBucket", "temporary-files-dataproc") \
   .option("table",  "master_sales_data.bigtable_info") \
   .mode("overwrite") \
   .save()


# Stop the Spark session
spark.stop()

It can be observed that the generated code reads from each of the folders and inserts each data into its corresponding table.

Para poder asegurarnos de que el ejemplo está bien realizado, podemos seguir los pasos del README en el repositorio GitHub[8] para aplicar los cambios en el código terraform, subir los ficheros de ejemplo que tenemos en la carpeta example_data y a ejecutar un Batch en Dataproc.

Finally, we check if the information stored in BigQuery is correct:

Table customer:

Tabla sales:

Final table:

This way, we have managed to have a fully operational functional process through natural language. There is another example that can be executed, although I also encourage creating more examples, or even improving the prompt, to incorporate more complex examples and also adapt it to your use case.

Conclusions and Recommendations

As the examples are very specific to particular technologies, any change in the prompt in any example can affect the results, or also modifying any word in the input request. This means that the prompt is not robust enough to assimilate different expressions without affecting the generated code. To have a productive prompt and system, more training and different variety of solutions, requests, expressions, etc., are needed. With all this, we will finally be able to have a first version to present to our business user so that they can be autonomous.

Specifying the maximum possible detail to an LLM is crucial for obtaining precise and contextual results. Here are several tips to keep in mind to achieve appropriate results:

Clarity and Conciseness:
- Be clear and concise in your prompt, avoiding long and complicated sentences.
- Clearly define the problem or task you want the model to address.
Specificity:
- Provide specific details about what you are looking for. The more precise you are, the better results you will get.
Variability and Diversity:
- Consider including different types of examples or cases to assess the model’s ability to handle variability.
Iterative Feedback:
- If possible, iterate on your prompt based on the results obtained and the model’s feedback.
Testing and Adjustment:
- Before using the prompt extensively, test it with examples and adjust as needed to achieve desired results.

Future Perspectives

In the field of LLMs, future lines of development focus on improving the efficiency and accessibility of language model implementation. Here are some key improvements that could significantly enhance user experience and system effectiveness:

1. Use of different LLM models:

The inclusion of a feature that allows users to compare the results generated by different models would be essential. This feature would provide users with valuable information about the relative performance of the available models, helping them select the most suitable model for their specific needs in terms of accuracy, speed, and required resources.

2. User feedback capability:

Implementing a feedback system that allows users to rate and provide feedback on the generated responses could be useful for continuously improving the model’s quality. This information could be used to adjust and refine the model over time, adapting to users’ changing preferences and needs.

3. RAG (Retrieval-augmented generation)

RAG (Retrieval-augmented generation) is an approach that combines text generation and information retrieval to enhance the responses of language models. It involves using retrieval mechanisms to obtain relevant information from a database or textual corpus, which is then integrated into the text generation process to improve the quality and coherence of the generated responses.

Links of Interest

Cloud Storage[1]: https://cloud.google.com/storage/docs

BigQuery[2]: https://cloud.google.com/bigquery/docs

Terraform[3]: https://developer.hashicorp.com/terraform/docs

PySpark[4]: https://spark.apache.org/docs/latest/api/python/index.html

Dataproc[5]: https://cloud.google.com/dataproc/docs

Codey[6]: https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/code-generation

VertexAI[7]: https://cloud.google.com/vertex-ai/docs

GitHub[8]: https://github.com/alfonsozamorac/etl-genai

Tabla de contenidos

Introduction
Before we begin
What is an LLM?
Development of the prompt
Practical Cases

Container vulnerability scanning with Trivy

marzo 22, 2024 by Bluetab

Container vulnerability scanning
with Trivy

Within the framework of security in container, the build phase is of vital importance as we need to select the base image on which applications will run. Not having automatic mechanisms for vulnerability scanning can lead to production environments with insecure applications with the risks that involves.

In this article we will cover vulnerability scanning using Aqua Security’s Trivy solution, but before we begin, we need to explain what the basis is for these types of solutions for identifying vulnerabilities in Docker images.

Introduction to CVE (Common Vulnerabilities and Exposures)

CVE is a list of information maintained by MITRE Corporation which is aimed at centralising the records of known security vulnerabilities, where each reference has a CVE-ID number, description of the vulnerability, which versions of the software are affected, possible fix for the flaw (if any) or how to configure to mitigate the vulnerability and references to publications or posts in forums or blogs where the vulnerability has been made public or its exploitation is demonstrated.

The CVE-ID provides a standard naming convention for uniquely identifying a vulnerability. They are classified into 5 typologies, which we will look at in the Interpreting the analysis section. These types are assigned based on different metrics (if you are curious, see CVSS Calculator v3).

CVE has become the standard for vulnerability recording, so it is used by the great majority of technology companies and individuals.

There are various channels for keeping informed of all the news related to vulnerabilities: official blog, Twitter, cvelist on GitHub and LinkedIn.

If you want more detailed information about a vulnerability, you can also consult the NIST website, specifically the NVD (National Vulnerability Database).

We invite you to search for one of the following critical vulnerabilities. It is quite possible that they have affected you directly or indirectly. We should forewarn you that they have been among the most discussed

CVE-2017-5753
CVE-2017-5754

If you detect a vulnerability, we encourage you to register it using the form below.

Aqua Security – Trivy

Trivy is an open source tool focused on detecting vulnerabilities in OS-level packages and dependency files for various languages:

OS packages: (Alpine, Red Hat Universal Base Image, Red Hat Enterprise Linux, CentOS, Oracle Linux, Debian, Ubuntu, Amazon Linux, openSUSE Leap, SUSE Enterprise Linux, Photon OS and Distroless)
Application dependencies: (Bundler, Composer, Pipenv, Poetry, npm, yarn and Cargo)

Aqua Security, a company specialising in development of security solutions, acquired Trivy in 2019. Together with a substantial number of collaborators, they are responsible for developing and maintaining it.

Installation

Trivy has installers for most Linux and MacOS systems. For our tests, we will use the generic installer:

curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/master/contrib/install.sh | sudo sh -s -- -b /usr/local/bin

If we do not want to persist the binary on our system, we have a Docker image:

docker run --rm -v /var/run/docker.sock:/var/run/docker.sock -v /tmp/trivycache:/root/.cache/ aquasec/trivy python:3.4-alpine

Basic operations

Local images

Trivy has installers for most Linux and MacOS systems. For our tests, we will use the generic installer:

#!/bin/bash
docker build -t cloud-practice/alpine:latest -<<EOF
FROM alpine:latest
RUN echo "hello world"
EOF

trivy image cloud-practice/alpine:latest

Remote images

#!/bin/bash
trivy image python:3.4-alpine

Local projects:
Enable you to analyse dependency files (outputs):
- Pipfile.lock: Python
- package-lock_react.json: React
- Gemfile_rails.lock: Rails
- Gemfile.lock: Ruby
- Dockerfile: Docker
- composer_laravel.lock: PHP Lavarel
- Cargo.lock: Rust

#!/bin/bash
git clone https://github.com/knqyf263/trivy-ci-test
trivy fs trivy-ci-test

Public repositories:

#!/bin/bash
trivy repo https://github.com/knqyf263/trivy-ci-test

Private image repositories:

Cache database
The vulnerability database is hosted on GitHub. To avoid downloading this database in each analysis operation, you can use the --cache-dir <dir> parameter:

#!/bin/bash trivy –cache-dir .cache/trivy image python:3.4-alpine3.9

Filter by severity

#!/bin/bash
trivy image --severity HIGH,CRITICAL ruby:2.4.0

Filter unfixed vulnerabilities

#!/bin/bash
trivy image --ignore-unfixed ruby:2.4.0

Specify output code
This option is very useful in the continuous integration process, as we can specify that your pipeline ends in error when vulnerabilities of the critical type are found, but medium and high types finish properly.

#!/bin/bash
trivy image --exit-code 0 --severity MEDIUM,HIGH ruby:2.4.0
trivy image --exit-code 1 --severity CRITICAL ruby:2.4.0

Ignore specific vulnerabilities
You can specify those CVEs you want to ignore by using the .trivyignore file. This can be useful if the image contains a vulnerability that does not affect your development.

#!/bin/bash
cat .trivyignore
# Accept the risk
CVE-2018-14618

# No impact in our settings
CVE-2019-1543

Export output in JSON format:
This option is useful if you want to automate a process before an output, display the results in a custom front end, or persist the output in a structured format.

#!/bin/bash
trivy image -f json -o results.json golang:1.12-alpine
cat results.json | jq

Export output in SARIF format:
There is a standard called SARIF (Static Analysis Results Interchange Format) that defines the format for outputs that any vulnerability analysis tool should have.

#!/bin/bash
wget https://raw.githubusercontent.com/aquasecurity/trivy/master/contrib/sarif.tpl
trivy image --format template --template "@sarif.tpl" -o report-golang.sarif  golang:1.12-alpine
cat report-golang.sarif

VS Code has the sarif-viewer extension for viewing vulnerabilities.

Continuous integration processes

Trivy has templates for the leading CI/CD solutions:

#!/bin/bash
$ cat .gitlab-ci.yml
stages:
  - test

trivy:
  stage: test
  image: docker:stable-git
  before_script:
    - docker build -t trivy-ci-test:${CI_COMMIT_REF_NAME} .
    - export VERSION=$(curl --silent "https://api.github.com/repos/aquasecurity/trivy/releases/latest" | grep '"tag_name":' | sed -E 's/.*"v([^"]+)".*/\1/')
    - wget https://github.com/aquasecurity/trivy/releases/download/v${VERSION}/trivy_${VERSION}_Linux-64bit.tar.gz
    - tar zxvf trivy_${VERSION}_Linux-64bit.tar.gz
  variables:
    DOCKER_DRIVER: overlay2
  allow_failure: true
  services:
    - docker:stable-dind
  script:
    - ./trivy --exit-code 0 --severity HIGH --no-progress --auto-refresh trivy-ci-test:${CI_COMMIT_REF_NAME}
    - ./trivy --exit-code 1 --severity CRITICAL --no-progress --auto-refresh trivy-ci-test:${CI_COMMIT_REF_NAME}

Interpreting the analysis

#!/bin/bash
trivy image httpd:2.2-alpine
2020-10-24T09:46:43.186+0200    INFO    Need to update DB
2020-10-24T09:46:43.186+0200    INFO    Downloading DB...
18.63 MiB / 18.63 MiB [---------------------------------------------------------] 100.00% 8.78 MiB p/s 3s
2020-10-24T09:47:08.571+0200    INFO    Detecting Alpine vulnerabilities...
2020-10-24T09:47:08.573+0200    WARN    This OS version is no longer supported by the distribution: alpine 3.4.6
2020-10-24T09:47:08.573+0200    WARN    The vulnerability detection may be insufficient because security updates are not provided

httpd:2.2-alpine (alpine 3.4.6)
===============================
Total: 32 (UNKNOWN: 0, LOW: 0, MEDIUM: 15, HIGH: 14, CRITICAL: 3)

+-----------------------+------------------+----------+-------------------+------------------+--------------------------------+
|        LIBRARY        | VULNERABILITY ID | SEVERITY | INSTALLED VERSION |  FIXED VERSION   |             TITLE              |
+-----------------------+------------------+----------+-------------------+------------------+--------------------------------+
| libcrypto1.0          | CVE-2018-0732    | HIGH     | 1.0.2n-r0         | 1.0.2o-r1        | openssl: Malicious server can  |
|                       |                  |          |                   |                  | send large prime to client     |
|                       |                  |          |                   |                  | during DH(E) TLS...            |
+-----------------------+------------------+----------+-------------------+------------------+--------------------------------+
| postgresql-dev        | CVE-2018-1115    | CRITICAL | 9.5.10-r0         | 9.5.13-r0        | postgresql: Too-permissive     |
|                       |                  |          |                   |                  | access control list on         |
|                       |                  |          |                   |                  | function pg_logfile_rotate()   |
+-----------------------+------------------+----------+-------------------+------------------+--------------------------------+
| libssh2-1             | CVE-2019-17498   | LOW      | 1.8.0-2.1         |                  | libssh2: integer overflow in   |
|                       |                  |          |                   |                  | SSH_MSG_DISCONNECT logic in    |
|                       |                  |          |                   |                  | packet.c                       |
+-----------------------+------------------+----------+-------------------+------------------+--------------------------------+

Library: the library/package identifying the vulnerability.
Vulnerability ID: vulnerability identifier (according to CVE standard).
Severity: there is a classification with 5 typologies [source] which are assigned a CVSS (Common Vulnerability Scoring System) score:
- Critical (CVSS Score 9.0-10.0): flaws that could be easily exploited by a remote unauthenticated attacker and lead to system compromise (arbitrary code execution) without requiring user interaction.
- High (CVSS score 7.0-8.9): flaws that can easily compromise the confidentiality, integrity or availability of resources.
- Medium (CVSS score 4.0-6.9): flaws that may be more difficult to exploit but could still lead to some compromise of the confidentiality, integrity or availability of resources under certain circumstances.
- Low (CVSS score 0.1-3.9): all other issues that may have a security impact. These are the types of vulnerabilities that are believed to require unlikely circumstances to be able to be exploited, or which would give minimal consequences.
- Unknown (CVSS score 0.0): allocated to vulnerabilities with no assigned score.
Installed version: the version installed in the system analysed.
Fixed version: the version in which the issue is fixed. If the version is not reported, this means the fix is pending.
Title: A short description of the vulnerability. For further information, see the NVD.

Now you know how to interpret at the analysis information at a high level. So, what actions should you take? We give you some pointers in the Recommendations section.

Recommendations

This section describes some of the most important aspects within the scope of vulnerabilities in containers:
- Avoid (wherever possible) using images in which critical and high severity vulnerabilities have been identified.
- Include image analysis in CI processes
  Security in development is not optional; automate your testing and do not rely on manual processes.
- Use lightweight images, fewer exposures:
  Images of the Alpine / BusyBox type are built with as few packages as possible (the base image is 5 MB), resulting in reduced attack vectors. They support multiple architectures and are updated quite frequently.

REPOSITORY  TAG     IMAGE ID      CREATED      SIZE
alpine      latest  961769676411  4 weeks ago  5.58MB
ubuntu      latest  2ca708c1c9cc  2 days ago   64.2MB
debian      latest  c2c03a296d23  9 days ago   114MB
centos      latest  67fa590cfc1c  4 weeks ago  202MB

If for a dependencies reason, you cannot customise an Alpine base image, look for slim-type images from trusted software vendors. Apart from the security component, people who share a network with you will appreciate not having to download 1 GB images.

Get images from official repositories: Using DockerHub is recommended, and preferably images from official publishers. DockerHub and CVEs
Keep images up to date: the following example shows an analysis of two different Apache versions:

Image published in 11/2018

httpd:2.2-alpine (alpine 3.4.6)
 Total: 32 (UNKNOWN: 0, LOW: 0, MEDIUM: 15, **HIGH: 14, CRITICAL: 3**)

Image published in 01/2020

httpd:alpine (alpine 3.12.1)
 Total: 0 (UNKNOWN: 0, LOW: 0, MEDIUM: 0, **HIGH: 0, CRITICAL: 0**)

As you can see, if a development was completed in 2018 and no maintenance was performed, you could be exposing a relatively vulnerable Apache. This is not an issue resulting from the use of containers. However, because of the versatility Docker provides for testing new product versions, we now have no excuse.

Pay special attention to vulnerabilities affecting the application layer:
According to the study conducted by the company edgescan, 19% of vulnerabilities detected in 2018 were associated with Layer 7 (OSI Model), with XSS (Cross-site Scripting) type attacks standing out above all.
Select latest images with special care:
Although this advice is closely related to the use of lightweight images, we consider it worth inserting a note on latest images:

Latest Apache image (Alpine base 3.12)

httpd:alpine (alpine 3.12.1)
 Total: 0 (UNKNOWN: 0, LOW: 0, MEDIUM: 0, HIGH: 0, CRITICAL: 0)

Latest Apache image (Debian base 10.6)

httpd:latest (debian 10.6)
 Total: 119 (UNKNOWN: 0, LOW: 87, MEDIUM: 10, HIGH: 22, CRITICAL: 0)

We are using the same version of Apache (2.4.46) in both cases, the difference is in the number of critical vulnerabilities.
Does this mean that the Debian base 10 image makes the application running on that system vulnerable? It may or may not be. You need to assess whether the vulnerabilities could compromise your application. The recommendation is to use the Alpine image.

Evaluate the use of Docker distroless images
The distroless concept is from Google and consists of Docker images based on Debian9/Debian10, without package managers, shells or utilities. The images are focused on programming languages (Java, Python, Golang, Node.js, dotnet and Rust), containing only what is required to run the applications. As they do not have package managers, you cannot install your own dependencies, which can be a big advantage or in other cases a big obstacle. Do testing and if it fits your project requirements, go ahead; it is always useful to have alternatives. Maintenance is Google’s responsibility, so the security aspect will be well-defined.

Container vulnerability scanner ecosystem

In our case we have used Trivy as it is a reliable, stable, open source tool that is being developed continually, but there are numerous tools for container analysis:

Do you want to know more about what we offer and to see other success stories?

Ángel Maroco

AWS Cloud Architect

My name is Ángel Maroco and I have been working in the IT sector for over a decade. I started my career in web development and then moved on for a significant period to IT platforms in banking environments and have been working on designing solutions in AWS environments for the last 5 years.

I now combine my role as an architect with being head of /bluetab Cloud Practice, with the mission of fostering Cloud culture within the company.

SOLUTIONS, WE ARE EXPERTS

Blog

MDM as a Competitive Advantage in Organizations

Final Considerations

Do you want to learn more about what we offer and see other success stories?

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Oscar Hernández, new CEO of Bluetab LATAM.

Bluetab

Would you like to learn more about what we offer and see other success stories?

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Leadership changes at Bluetab EMEA

Bluetab

Tom Uhart’s new journey

Do you want to know more about what we offer and to see other success stories?

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Cambios de liderazgo en Bluetab EMEA

Bluetab

El nuevo rumbo de Tom Uhart

¿Quieres saber más de lo que ofrecemos y ver otros casos de éxito?

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Introduction

Before we begin

What is an LLM?

Development of the prompt

Practical Cases

Conclusions and Recommendations

Future Perspectives

Links of Interest

Container vulnerability scanningwith Trivy

Aqua Security – Trivy

Installation

Basic operations

Continuous integration processes

Interpreting the analysis

Recommendations

Container vulnerability scanner ecosystem

Do you want to know more about what we offer and to see other success stories?

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Footer

Container vulnerability scanning
with Trivy