Hello, my name is André Luiz, I am Computer Scientist and I have experience as Software
Engineer, and Data Scientist in multiple industries: Edtech, Public Prosecution and Court of
Law, Oil & Gas Distribution, and Agribusiness.
Activities:
Working on models for Visual Learning Analytics to support instructors in the analysis of student logs from Virtual Learning Environment in order to support them in the decision making.
Activities:
Working on the Cacuriá development project, an authoring tool for multimedia Learning Objects (LOs) to allow teachers to create multimedia educational content for interactive TV and the Web without requiring programming skills. Cacuriá is integrated with the iVoD (Interactive Video on Demand) service from RNP, a National Research and Educational Network responsible for promoting the development of networks in Brazil, including the development of innovative applications and services.
Supporting Product Managers and Designers in their decision-making
In my role as the Data Science Manager at Gaivota
in 2022 and Seedz in 2023, I led a critical
project focused on analyzing user interactions within the company's web system, known as Seedz Mercator.
This project played a central role in enhancing the user experience, determining the efficacy of content and
design elements, ascertaining the most visited pages, identifying engaging components, spotting technical glitches,
gaining insights into individual user preferences and behaviors, and facilitating data-driven decisions to align
with business objectives and regulatory requirements. This initiative was instrumental in helping the company gain
a deeper understanding of their user base, enabling them to continually refine their online presence to meet
evolving user expectations and needs.
Our initial step in this project involved the integration of a third-party software solution to meticulously
track each user's actions and manage logs. We established a comprehensive set of tags to map relevant events,
including pageviews, clicks on elements, and interactions with hover buttons. Subsequently, we constructed a
robust data pipeline utilizing AWS services to store raw data, perform preprocessing tasks such as data cleansing,
filtering, and formatting, and then deliver this refined data for visualization. These visualizations served as a
foundation for answering critical questions related to user access and pageviews, among others, and were primarily
driven by insights gathered from discussions with product managers and designers.
In addition to these efforts, I conducted an in-depth study employing process mining techniques to
elucidate the flow of events generated by users within the system. This analysis allowed us to pinpoint key flows,
bottlenecks, and the average time spent on each flow. As a result, we unearthed unexpected patterns in pageviews
that diverged from managerial expectations and identified potential reasons why certain pages were underutilized.
Moreover, we successfully clustered users based on their behavioral patterns, a pivotal development that enabled
us to craft a targeted action plan grounded in data.
One notable outcome of this project was the initiation of user interviews, strategically conducted with
representatives from each user cluster. These interviews aimed to gain a comprehensive understanding of
their event flows within the system, their goals, and any pain points they encountered. The insights gathered
from these interviews played a vital role in shaping subsequent actions and adjustments to enhance the user
experience within Seedz Mercator.
Furthermore, the meticulous process we established for tracking and analyzing user interactions served as a
valuable foundation for enabling our product managers and designers to conduct A/B tests effectively. A/B testing
played a pivotal role in refining our web systems, allowing us to assess the impact of design and content changes
on user engagement and satisfaction. I take immense pride in the outcomes we achieved through this project, as
they not only provided us with actionable insights but also drove significant improvements in user experience,
ultimately leading to enhanced user retention and business growth.
Rural Property Valuation
Artificial Intelligence to Support Rural Property Valuation
As Data Science Manager at Gaivota in 2022
and Seedz in 2023,
I led a project focused on rural property valuation. This process involves determining the price of
rural properties such as farms and agricultural land, which serves various purposes including
securing credit, property insurance, estate division, and property transactions.
To begin our project, we initiated by conducting a comprehensive literature review to gain valuable
insights into the field. During this phase, we identified pertinent research papers specifically
addressing this issue within the Brazilian context. These papers not only shed light on critical
features and methodologies but also illuminated valuable considerations for tackling this complex
problem.
Subsequent to the literature review, we proceeded with an extensive exploration of publicly
available datasets related to the valuation of bare land. It's important to note that our foremost
challenge at this stage was the absence of a comprehensive property price database. To overcome this
problem, we implemented a web crawler to extract pricing information and property specifications
(e.g., location, area) from real estate listing websites. Furthermore, we conducted geocoding to
obtain latitude and longitude coordinates, thereby enriching our dataset with geographic attributes
such as terrain and altitude. During the data distribution analysis, we identified a significant
data concentration in the Southeast region of Brazil. We implemented a rigorous feature selection
process, enabling us to retain only the most pertinent data while eliminating noise. Additionally,
we conducted meticulous data cleaning and outlier removal to ensure the quality of our dataset.
One key aspect of our analysis involved examining the correlation between the value of bare land
and other features, such as prices of commodities such as soybeans, corn, and cattle. Over the
period of our study, we observed a decrease in commodity prices throughout the year, accompanied
by an increase in rural land prices. This negative correlation was not a suitable variable to serve
as a reliable predictor in our model.
In our pursuit of predicting rural property values using the chosen features, we conducted a series
of experiments employing diverse machine learning methodologies. As a result, we attained outcomes
that can be characterized as reasonably promising. Nevertheless, an important lesson that emerged
from our endeavors was the necessity to enhance data quality and broaden data coverage nationwide.
These enhancements are essential to elevate the precision of our predictive model.
Furthermore, we deployed a Restful API to provide an interface for internal applications to access
our model. This project was quite challenging due to the limited data we had to work with. My
experience as a researcher was essential in helping us navigate and overcome the obstacles we faced.
I'm incredibly proud of the dedication and hard work our team put into this project.
Soybean Yield Forecast
Artificial Intelligence to Support Agribusiness
As Data Science Manager at Gaivota in 2021, I led
a important project early in my role. The project aimed to predict soybean yields for cities in Mato Grosso,
a renowned state in Brazil celebrated for its exceptional soybean productivity in the world. This topic is
an extensive problem in agriculture and one of the most interesting for farmers, industry, and government
stakeholders for strategies that will lead to increased crop production. That yield is related to the amount
of produced grains in a region. To increase this yield, farmers use several techniques, such as: selection
of high-quality seeds, adequate soil preparation, application of fertilizers, use of efficient irrigation
techniques and control of pests and diseases. Predicting agricultural productivity is a complex task that
involves the analysis of several factors, such as climate, soil, technology and agricultural management
practices. Predicting agricultural productivity is a multifaceted endeavor, involving the analysis of
multiple variables such as climate, soil attributes, technological advancements, and agricultural management
practices. Typically, mathematical and statistical models that account for these variables are employed.
However, it's vital to recognize that these forecasts are estimates, and producers must remain prepared
for unforeseen challenges, such as extreme weather events, diseases, and pest infestations.
Given that my team lacked prior experience in this domain, we conducted an extensive literature review of papers
published in the last five years (between 2016 and 2021). Our aim was to gain insights into methodologies for
predicting crop yields, critical information required for accurate predictions, and prevalent models used in
this context. The literature review adhered to a well-defined protocol, evaluating papers for key issues,
objectives, methodologies, case studies, and outcomes. Following a thorough analysis, we identified two papers
that stood out for their meaningful results and their provision of source code on GitHub (paper 1 and
paper 2).
Subsequently, we compiled a comprehensive dataset comprising temperature records, satellite imagery, land cover
and rainfall data, soil attributes, and municipal-level soybean production data. Our main goal was to forecast
soybean yields in kilograms per hectare for 89 cities in Mato Grosso, focusing exclusively on this specific
geographical area. Based on the selected papers, we implemented an ensemble model utilizing Deep Learning
techniques, including Convolutional Neural Networks (CNN) and Random Forests (RF). Through meticulous
experimentation with various data combinations and sliding window parameters, we achieved great results.
To evaluate and quantify the model performance, we relied on established regression metrics such as Mean
Absolute Percentage Error (MAPE) and Root Mean Square Error (RMSE).
Furthermore, we established an AWS-based pipeline to facilitate continuous data updates. This pipeline
encompassed data storage in raw format, data cleansing, model training, soybean yield forecasting for each
Mato Grosso city, and storage of these predictions in a relational database. We deployed a Restful API to
provide an interface for internal applications to access our data. This project represented a formidable
challenge, one that I'm particularly proud of. Its successful execution was a testament to the collaborative
efforts of a multidisciplinary team of data scientists. It's worth noting that this project not only contributed
to expanding our customer base but also showcased the company's product potential to a broader audience.
Digital Inspector
Artificial Intelligence for Equipment Inspection
Developed under my technical leadership in collaboration with Petrobras when I was a data scientist at ExACTa PUC-Rio in 2020. The problem was related to
the integrity of equipment in an oil and gas refinery, which is ensured by a set of standardized management
processes. One of these processes is equipment inspection, where equipment analyses are conducted by
inspection experts and documented in technical reports. Just like any other written text, these reports
can occasionally contain inaccuracies or gaps in critical information related to the management process.
Such inaccuracies in reports must be promptly identified and corrected to ensure an effective inspection
process, accurate information, and compliance with maintenance deadlines.
Our proposal, therefore, was to build an artificial intelligence model to assist inspectors in report
writing. To accomplish this, we started by constructing a database through the collection of reports
produced by all Petrobras refineries throughout the last few years. In essence, these reports consist
of determining whether the equipment is in proper working condition. If not, it was necessary to
describe the damage, its cause, the mechanism of damage, and the maintenance action required. We had
the assistance of a group of selected inspectors as consultants to clarify potential doubts and validate
our proposal.
After constructing our database, we performed meticulous text extraction, data structuring, cleaning,
and standardization. We made several discoveries after extensive data exploration and the application of
statistical methods. One of these discoveries was that the terms used in the reports were quite limited.
Due to this, we were able to classify all possible damages, causes, mechanisms, and actions present in the
reports. Furthermore, we identified a strong correlation between them. We also found a correlation between
the location of the refinery and the damages and causes, which made perfect sense since rust damage, for
example, was more frequent in refineries located near the sea.
Next, we built a decision tree model that, based on the damage and cause, suggests the mechanism and action
to be taken. This machine learning model was chosen for its interpretability. Based on our tests, the
harmonic mean of accuracy and precision (F1 score) reached a result greater than 95%. At the request of the
Petrobras team, we also conducted a test with Azure's AutoML service, which yielded a positive result very
close to our model's performance. Consequently, we decided to keep the AutoML model to facilitate the
solution's maintenance by the Petrobras team.
Subsequently, we deployed our model as a service called the "Digital Inspector," designed to detect
inaccuracies or gaps in the wording of reports, providing real-time alerts to inspectors and suggesting
the most appropriate terms for damages, causes, mechanisms, and actions in the report. This service was
made available through a RESTful API that allowed integration with the equipment inspection management
system used by Petrobras inspectors. Additionally, the digital inspector can improve its learning as more
reports are added to our database, validated by inspectors, and made available for continuous model
retraining.
This project received the Petrobras Inventor Award 2022 in recognition of the results achieved and the
impact on the efficiency of equipment maintenance in refineries across Brazil. Undoubtedly, the project
was challenging but provided ample opportunities to learn. I am so proud of the achieved results.
As a data scientist at ExACTa PUC-Rio in 2020, I executed a project focused on diagnosing and forecasting high levels of hydrogen sulfide (H2S), a poisonous, toxic, and hazardous gas frequently released in oil refineries from Petrobras. This problem posed a significant threat to the well-being and safety of a local community situated near a refinery in the northeast region from Brazil. This issue had garnered extensive coverage in various Brazilian news portals, including g1. The primary aim of this initiative was to provide essential information through an intuitive dashboard, enabling end-users to support their decision-making.
The project began with a thorough exploration and analysis of refinery historical data using
statistical methods and visualization techniques. Through these analyses, we uncovered significant
correlations and insights, particularly in conjunction with meteorological data, including wind
direction and speed, humidity, and temperature.
Then, we developed a decision tree model to provide prescriptive information regarding H2S emissions. This machine learning model was deliberately chosen for its interpretability, allowing us to elucidate its learning process and gain insights into its decision-making mechanisms. Our model was engineered to forecast H2S levels based on a threshold defined by refinery operators.
To effectively present our model's results, we adopted Power BI as the platform for developing our dashboard. The central visualization component featured a bar and line chart, providing a visual representation of the predicted H2S emission rates. Additionally, we included a "Cards" tab within the dashboard to display meteorological data and potential factors contributing to H2S levels at any given time. It is worth emphasizing that the dashboard's interface was a collaborative effort, involving feedback from refinery operators, ensuring an intuitive and user-friendly interface.
This project was particularly demanding due to the tight timeline for delivering a solution to mitigate the issue and protect the local population near the refinery. Fortunately, our model yielded favorable results, and the dashboard was well-received by refinery operators, facilitating their understanding of the factors contributing to high H2S levels. Subsequently, our solution was adopted across various Petrobras refineries throughout Brazil.
EDUVIS
An online tool to assemble dashboards based on
instructors’ preferences
This project was my Ph.D. research performed between 2016 and 2020, which we
shed light on how to support instructors in analyzing student logs from Virtual
Learning Environments. Our main goal in this project is enable Virtual Learning
Environments to assist instructors in gaining insights about both students’ behavior
and performance.
Firstly, we conducted interviews with instructors who work in Brazil and a
systematic mapping of the state-of-art about Education Data Mining and Learning
Analytics.
This study aimed to identify which kinds of information about students the
instructors regard as meaningful (e.g., performance, behavior, engagement); how
these kinds of information are gathered; and how they drive requirements for
improving their analyses.
Then, we analyzed logs from online courses offered in Brazil and compared our
findings with results presented in the literature.
We explored and analyzed these logs using statistical methods and machine learning
techniques.
Furthermore, we have not found in the literature works about instructors’
visualization preferences of student logs.
Therefore, we conducted a study to identify how much the instructors take into
account topics related to both students’ behavior and performance, as well as their
visualization preferences.
We also noted a lack of work showing models to support the development of learning
analytics tools.
In order to bridge this gap, we presented a model connecting both Visual Analytics
theories and models as well as instructors’ requirements, their visualization
preferences, literature guidelines and methods for analyzing student logs.
We developed EDUVIS as an open source online tool to assemble dashboards based on
our proposed model.
This tool was built using Python and Javascript.
In particular, all the visualizations (total of 141) were designed using Plotly,
which provides interactivity, such as zoom in, zoom out, pan, select, toggle spike
lines, and mouse hover.
We captured evidence of their acceptance of our proposal and obtained instructors’
feedback about the tool such as their both analysis and visualization preferences.
The combination of the answers to the research questions yields a framework to
enable Virtual Learning Environments to assist instructors in gaining insights about
both students’ behavior and performance.
We hope that our proposed model might be a guide to the development of new
dashboards and ground future research.
During my time as a data scientist at ECOA PUC-Rio in 2019, I initiated a project with the
objective of assisting prosecutors in locating legal texts with similar contextual content.
This project involved in-depth analysis and exploration of data obtained from both the
Public Prosecution and
Court of Law from Rio de Janeiro.
The documents, mainly in PDF format, underwent a text extraction process, followed by a meticulous
pattern identification and analysis. To construct a vector representation of these documents,
we explored various methodologies and ultimately opted for a pre-trained model using word embeddings specifically
designed for Portuguese text, developed by USP (Universidade de São Paulo).
We used cosine distance calculations to determine the similarity of texts and sorted the results accordingly.
To make this feature more accessible, we transformed it into a RESTful API service, which streamlined the process
for our users. The result is a collection of legal texts sorted in ascending order of cosine distance, making it
easier to identify and access documents with similar contextual content. This improvement has increased the
efficiency of the Public Prosecution of Rio de Janeiro.
Distribution of Court Proceedings
A recommendation system to assign prosecutors in court
proceedings
The objective of this project was to optimize the allocation of prosecutors to individual legal cases,
ensuring that each case was assigned to the most suitable prosecutor. To achieve this, we embarked on a
comprehensive data-driven initiative.
This project began by acquiring extensive historical lawsuit data, with a diverse array of metadata such
as case type, cause, and origin. This dataset tracked the legal journey of lawsuits from the Court of Law from Rio de Janeiro to the Public Prosecution.
To prepare this data for analysis, we meticulously executed a data cleansing process, which proved to be a
labor-intensive yet indispensable phase of the project.
Through statistical methods, we unearthed critical correlations within the dataset. These correlations were
instrumental in the development of a robust decision tree model to automate the task of matching the most
appropriate prosecutor to each lawsuit. This machine learning model was deliberately chosen for its
interpretability, allowing us to elucidate its learning process and gain insights into its decision-making
mechanisms.
We operationalized our model by deploying it as a RESTful API service. By doing so, we enabled seamless
integration into the existing workflow of the Public Prosecution, allowing them to input relevant case
features and receive instant recommendations for the most suitable prosecutor. This project not only
streamlined the allocation process but also improved the efficiency of the Public Prosecution from
Rio de Janeiro.
Ginga NCL
A middleware of the Japanese-Brazilian Digital TV
System
During my time as a researcher at TeleMidia Lab. - PUC-Rio, I engaged in
a significant project focused on enhancing the Ginga's reference code.
Ginga® is the middleware of the Japanese-Brazilian Digital TV System (ISDB-TB) and
ITU-T Recommendation for IPTV services.
Ginga-NCL Presentation Environment is the required logical subsystem of Ginga,
responsible for running NCL applications.
NCL is an XML application language that provides support for specifying
spatio-temporal synchronization among media objects, media content and presentation
alternatives, exhibition on multiple devices, and live producing of interactive
non-linear programs.
In this project, I worked mainly with rebuilding a video player that was used by Ginga.
It was quite a challenge, but I managed to develop a new video player using C++, GStreamer
for audio and video decoding, and Cairo for 2D rendering. I also enabled some audio and
video properties that were described on NCL, which included balanceLevel, trebleLevel,
bassLevel, and freeze. It was great to see the final result, and the figure below shows
Ginga playing an application using the new video player.
During my time as an M.Sc. candidate at LAWS lab. - UFMA between 2013 and 2015,
I undertook a significant project that formed the basis of my research.
This project aimed to support instructors in the creation of educational interactive
videos. Typically, developing such content necessitates a multidisciplinary team,
as it can be a complex, costly, and time-consuming process. Software developers
are required to write the source code, designers contribute to the visual identity,
education experts formulate and assess teaching goals, and at the heart of the team
lies the content specialist, often a teacher or tutor, who provides the subject matter
to be taught.
It is interesting to compare this scenario with contemporary content authoring on
the Web. In the early days, web pages were predominantly constructed by experts
proficient in markup languages and internet protocols. Over time, the web has
democratized, leading to the emergence of new roles, like web designers, whose
primary responsibility is to design and develop web pages. Today, a wide array of
web content is created by non-developers, such as blogs that can be authored and
managed by end users with no knowledge of web programming languages. Additionally,
various users, including journalists and writers, establish profiles on social
networks containing texts, videos, images, and various multimedia components.
The facilitation of content authoring on the web for end users is arguably a key
factor behind its widespread popularity.
Based on this scenario, we developed
Cacuriá, an authoring tool designed for the creation of educational content,
also known as learning objects, for web and Interactive Digital TV environments.
This content predominantly centers around educational videos, which can be enriched
by teachers directly, without the need for a programmer or designer, through the
addition of multimedia elements like images, audio, and text. Cacuriá was crafted
by teachers, who played an integral role in each step of the project, ensuring an
intuitive and user-friendly design.
As the principal developer of Cacuriá, I utilized C++ and the Qt framework to create
a versatile tool compatible with various operating systems, including Ubuntu, MAC OS,
and Windows. Figure below presents the current Cacuria's interface drew inspiration
from popular applications used by teachers, such as Microsoft PowerPoint, making it
particularly accessible to end users without specific programming knowledge.
One of Cacuriá's most exciting features is its integration with the iVoD service from
RNP, the National Research
and Educational Network responsible for advancing network development in Brazil. RNP is
dedicated to providing innovative applications and services that enhance the learning
experience, and Cacuriá stands as a shining example of this commitment. I am so proud
to have been a part of its development.
This project had a profound impact on my career as a researcher. Almost completing my
M.Sc. degree, I co-founded Mediabox Technologies with a friend in 2014. Our company's primary
focus is to promote Cacuriá in Brazil and provide ongoing maintenance for the tool to
support RNP. Additionally, we developed Mestrar, an online platform designed to provide
storage and delivery for students, content created by Cacuriá. The architecture is
illustrated below. Over two years, our company expanded to include seven more dedicated
colleagues. Notably, this venture marked my first experience as a manager. This experience
proved to be a significant challenge and served as my first foray into a leadership role.