Python Scraping, PDF2Text Conversion – first steps

At the beginning of this semester, I joined Manisha Goel, one of Pomona’s economics professors, to help with the technical side of her research. The project aims to analyze the effects of government actions on businesses and business management. To do this we needed to analyze the tone and diction of tens of thousands of listed businesses to search for hints of doubt or uncertainty within their business reports. In order to analyze the text, we first needed to gather the text and turn it into something that can be used later down the line of analysis. The business reports start in PDF format, but plain text was needed in order to process the language used.

There are many ways to transfer text on a pdf to plain text, but as I’ve found, some work better than others. Initially, my team was throwing around ideas of using a tool developed within ITS, an optical character reader tool (or OCR tool), but we eventually decided to just solve the problem through Python. I used the PyPDF2 library, while a fellow researcher used pdftotext. Both libraries have the same purpose, but the pdftotext implementation has had higher rates of accuracy compared to my implementation. This difference in accuracy could be explained by pdftotext being a stronger tool, but I think the real difference was in the experience of my colleague, Maxwell Rose. Regardless, I learned how to access directories and convert and create files in Python, useful tools for later research.

The next steps for this project are the actual analysis of the text files produced in this stage of the process, which will hopefully lend the result and insights we are looking for within the corpus we’ve collected. For me, I hope to revisit pdf conversion with a different package, pdfminer

By Sam Millette