5. Sprint 2 | Scrapping multiple pdfs, cleaning and consolidating one Excel file

Project Info

Created By Diego Oroza
Date 2022
Category Programming

Project Description

This is the third Data Source for the Data Pipeline Project. By using Python we designed a script to scrape hundreds of pdfs files that have the same layout, however some of its graphs generate unstructured data that we need to capture and refine.

Leveraging the power of some Python libraries we parse the content and enrich it to feed a data structure that finally consolidate into one single excel file that subsequently is loaded to our Cloud Schema.

Note: In order to access the Google Colab link below, you may need to use any Google account.

Review Code on Google Colab

DIEGO OROZA

DIEGO OROZA

DIEGO OROZA