Hello everyone!
I would like to share with you my current problem and I would like that you guys tell me if Macro Scheduler could do this. In my office there are tons of documents that need to be converted in digital form. Not only that: that data should be transferred to a specific excel datasheet, to form a database. I've been looking for Optical Character Recognition (OCR) Programs that could do the first part, i.e. convert scanned files into searchable files. I believe that those can be found quite easily on Internet. However, I'm clueless about the second part. Does anyone have any hint how can I transfer information from a form into an Excel datasheet using Macro Scheduler?
PS: The scanned documents are a fixed template. That means that information should always be in the same position of the page. My goal is to extract informations from this form so that I can create a database containing Client/Type of Expenses/Expense Amount.
PS2: If MacroScheduler is able to achieve this even without converting scanned files into searchable files (maybe it has a built-in OCR?), that you be the best option.
Data Extraction from scanned pdf files
Moderators: Dorian (MJT support), JRL
- Dorian (MJT support)
- Automation Wizard
- Posts: 1389
- Joined: Sun Nov 03, 2002 3:19 am
- Contact:
Re: Data Extraction from scanned pdf files
Hi,
I've written a few scripts for customers wishing to extract information from PDFs.
It usually goes something like this :
Use GetFileList to create a list of all the PDFs in a folder.
Open each PDF in turn using ExecuteFile.
Copy all, and assign that to a variable using GetClipBoard.
Now we use Regex to extract exactly which data we're looking for.
Once that's done, we use the native Excel functions to write it all to an Excel file.
Hopefully this will give you a good starting point once you've got the documents scanned. We're happy to help you, if you need us to. We can do this via our regular support department, or we can help you via our custom scripting service. Whichever works best for you.
I've written a few scripts for customers wishing to extract information from PDFs.
It usually goes something like this :
Use GetFileList to create a list of all the PDFs in a folder.
Open each PDF in turn using ExecuteFile.
Copy all, and assign that to a variable using GetClipBoard.
Now we use Regex to extract exactly which data we're looking for.
Once that's done, we use the native Excel functions to write it all to an Excel file.
Hopefully this will give you a good starting point once you've got the documents scanned. We're happy to help you, if you need us to. We can do this via our regular support department, or we can help you via our custom scripting service. Whichever works best for you.
Yes, we have a Custom Scripting Service. Message me or go here
- Marcus Tettmar
- Site Admin
- Posts: 7395
- Joined: Thu Sep 19, 2002 3:00 pm
- Location: Dorset, UK
- Contact:
Re: Data Extraction from scanned pdf files
You can also try using pdftotext. Loop through the files, shell to pdftotext to extract to text files, then you can open the text up and manipulate in script. This obviously requires the PDFs to actually contain text rather than just flat images. Try it and see.
Marcus Tettmar
http://mjtnet.com/blog/ | http://twitter.com/marcustettmar
Did you know we are now offering affordable monthly subscriptions for Macro Scheduler Standard?
http://mjtnet.com/blog/ | http://twitter.com/marcustettmar
Did you know we are now offering affordable monthly subscriptions for Macro Scheduler Standard?
- Dorian (MJT support)
- Automation Wizard
- Posts: 1389
- Joined: Sun Nov 03, 2002 3:19 am
- Contact:
Re: Data Extraction from scanned pdf files
That sounds like it would streamline things wonderfully. I'm downloading that now so I can have a play with it. I've found command line tools combined with Macro Scheduler to be a very powerful combination in the past.
Yes, we have a Custom Scripting Service. Message me or go here
- CyberCitizen
- Automation Wizard
- Posts: 721
- Joined: Sun Jun 20, 2004 7:06 am
- Location: Adelaide, South Australia
Re: Data Extraction from scanned pdf files
I was going to suggest this as well, great little application.Marcus Tettmar wrote:You can also try using pdftotext. Loop through the files, shell to pdftotext to extract to text files, then you can open the text up and manipulate in script. This obviously requires the PDFs to actually contain text rather than just flat images. Try it and see.
FIREFIGHTER