Data Extraction from scanned pdf files

lucasc · Post by **lucasc** » Mon Feb 13, 2017 3:39 am

Hello everyone!

I would like to share with you my current problem and I would like that you guys tell me if Macro Scheduler could do this. In my office there are tons of documents that need to be converted in digital form. Not only that: that data should be transferred to a specific excel datasheet, to form a database. I've been looking for Optical Character Recognition (OCR) Programs that could do the first part, i.e. convert scanned files into searchable files. I believe that those can be found quite easily on Internet. However, I'm clueless about the second part. Does anyone have any hint how can I transfer information from a form into an Excel datasheet using Macro Scheduler?

PS: The scanned documents are a fixed template. That means that information should always be in the same position of the page. My goal is to extract informations from this form so that I can create a database containing Client/Type of Expenses/Expense Amount.

PS2: If MacroScheduler is able to achieve this even without converting scanned files into searchable files (maybe it has a built-in OCR?), that you be the best option.

Dorian (MJT support) · Post by **Dorian (MJT support)** » Wed Feb 15, 2017 11:51 pm

Hi,

I've written a few scripts for customers wishing to extract information from PDFs.

It usually goes something like this :

Use GetFileList to create a list of all the PDFs in a folder.

Open each PDF in turn using ExecuteFile.

Copy all, and assign that to a variable using GetClipBoard.

Now we use Regex to extract exactly which data we're looking for.

Once that's done, we use the native Excel functions to write it all to an Excel file.

Hopefully this will give you a good starting point once you've got the documents scanned. We're happy to help you, if you need us to. We can do this via our regular support department, or we can help you via our custom scripting service. Whichever works best for you.

Post by **Marcus Tettmar** » Thu Feb 16, 2017 7:56 am

You can also try using pdftotext. Loop through the files, shell to pdftotext to extract to text files, then you can open the text up and manipulate in script. This obviously requires the PDFs to actually contain text rather than just flat images. Try it and see.

Dorian (MJT support) · Post by **Dorian (MJT support)** » Thu Feb 16, 2017 11:24 am

That sounds like it would streamline things wonderfully. I'm downloading that now so I can have a play with it. I've found command line tools combined with Macro Scheduler to be a very powerful combination in the past.

CyberCitizen · Post by **CyberCitizen** » Fri Feb 17, 2017 11:13 am

Marcus Tettmar wrote:You can also try using pdftotext. Loop through the files, shell to pdftotext to extract to text files, then you can open the text up and manipulate in script. This obviously requires the PDFs to actually contain text rather than just flat images. Try it and see.

I was going to suggest this as well, great little application.

Data Extraction from scanned pdf files

Data Extraction from scanned pdf files

Re: Data Extraction from scanned pdf files

Re: Data Extraction from scanned pdf files

Re: Data Extraction from scanned pdf files

Re: Data Extraction from scanned pdf files