DIEGO OROZA

DIEGO OROZA

5. Sprint 2 | Scrapping multiple pdfs, cleaning and consolidating one Excel file

Project Info

  • Created By Diego Oroza
  • Date 2022
  • Category Programming

Project Description

This is the third Data Source for the Data Pipeline Project. By using Python we designed a script to scrape hundreds of pdfs files that have the same layout, however some of its graphs generate unstructured data that we need to capture and refine.

Leveraging the power of some Python libraries we parse the content and enrich it to feed a data structure that finally consolidate into one single excel file that subsequently is loaded to our Cloud Schema. 

Note: In order to access the Google Colab link below, you may need to use any Google account.