[Bounty Won] Text Blob Line Counter Speed Challenge

Technical support and scripting issues

Moderators: Dorian (MJT support), JRL

User avatar
jpuziano
Automation Wizard
Posts: 1085
Joined: Sat Oct 30, 2004 12:00 am

[Bounty Won] Text Blob Line Counter Speed Challenge

Post by jpuziano » Sat Mar 28, 2009 10:46 am

Hi Everyone,

I need to process a huge text blob (~30000 lines) which will take quite some time with the user looking on. I can use a progressbar to provide some feedback to the user while this is happening... but in order to do that I need to know how many lines are in the text blob so I can update the progressbar as lines are processed.

The problem is, I need that line count really fast... or the user will think something is broken or locked up.

The text blob will be available in a dialog memo field. My first attempt to count the lines involved writing the blob to a file and then reading it back line by line and counting... but that was too slow.

So I tried three other ways of counting the lines without writing to a file:
  1. Separate>
  2. RegEx with an EasyPattern
  3. plain Regex
Using plain RegEx> was the fastest method I found so far... but can anyone do it faster?

10 points to anyone (except Marcus) who can pull this off.

Rules are:

//=========================================
The only code you can change is the code between
these two lines in the posted code below
//=========================================

- Having a faster PC than mine is not a solution

- Compiling the code below is not a solution

Maybe there's a Win32 API call that can beat the RegEx solution?

Marcus, if you have a way, please hold off for a few days... at least until the end of the month.

Thanks all and take care

Code: Select all

/*
Given a very large text blob entered into a dialog memo field
and each "line" is separated by CRLF... what is the absolute
fastest way to determine how many lines it contains?

Performance Data from my home PC:

Method: Separate> command
Elapsed time in seconds:94.11474609375
Lines: 30000

Method: RegEx> with EasyPattern
Elapsed time in seconds: 4.7568359375
Lines: 30000

Method: plain RegEx>
Elapsed time in seconds: 4.7177734375
Lines: 30000

Plain RegEx is the fastest so far... but can you beat it?
*/

VBSTART
VBEND

Dialog>MyDialog
   Caption=Text Blob Line Counter Speed Challenge
   Width=324
   Height=321
   Top=133
   Left=109
   Max=0
   Min=0
   Close=1
   Resize=1
   Memo=msMemo1,10,33,295,200, Still Initializing - Please be patient... 30000 lines will appear here soon
   Label=Memo field below contains 30000 lines,8,8,true
   Button=Count the Lines and Time how long it took,10,247,223,25,3
   Button=Exit,248,247,57,25,2
EndDialog>MyDialog

//Initialize a 30000 line text blob
Let>blob_300_lines=
Let>blob_30000_lines=

//build blob_300_lines
Let>line_num=0
Repeat>line_num
  Let>line_num=line_num+1
  ConCat>blob_300_lines,text line%SPACE%%line_num%%CRLF%
Until>line_num=300

//build blob_30000_lines
Let>line_num=0
Repeat>line_num
  Let>line_num=line_num+1
  ConCat>blob_30000_lines,blob_300_lines
  Message>Initializing... we'll be done at 100: %line_num%
Until>line_num=100

//Close Message> box
Press Enter

Show>MyDialog
Let>MyDialog.msMemo1=%blob_30000_lines%
ResetDialogAction>MyDialog

Label>ActionLoop
GetDialogAction>MyDialog,result
  If>result=2,End
  If>result=3,Go
Goto>ActionLoop

Label>Go
VBEval>Timer,startSeconds

//Method: Separate> command
//Separate>%MyDialog.msMemo1%,%CRLF%,returnvar

//Method: RegEx> with EasyPattern
//RegEx>[CRLF],MyDialog.msMemo1,1,matches_array,num_matches,0

//=========================================
//Method: plain RegEx>
RegEx>\r\n,MyDialog.msMemo1,0,matches_array,num_matches,0
//=========================================

VBEval>Timer-%startSeconds%,elapsedSeconds

//MDL>Method: Separate> command%CRLF%%CRLF%Elapsed time in seconds:%elapsedSeconds%%CRLF%%CRLF%Lines: %returnvar_count%

//MDL>Method: RegEx> with EasyPattern%CRLF%%CRLF%Elapsed time in seconds:%elapsedSeconds%%CRLF%%CRLF%Lines: %num_matches%

MDL>Method: plain RegEx>%CRLF%%CRLF%Lines Counted: %num_matches%%CRLF%Elapsed time in seconds:%elapsedSeconds%%CRLF%%CRLF%Can you make it count them any faster?

Label>End
Last edited by jpuziano on Sat Dec 01, 2012 8:57 pm, edited 2 times in total.
jpuziano

Note: If anyone else on the planet would find the following useful...
[Open] PlayWav command that plays from embedded script data
...then please add your thoughts/support at the above post - :-)

User avatar
JRL
Automation Wizard
Posts: 3526
Joined: Mon Jan 10, 2005 6:22 pm
Location: Iowa

Post by JRL » Sat Mar 28, 2009 6:10 pm

Have you tried the techniques from here?

User avatar
jpuziano
Automation Wizard
Posts: 1085
Joined: Sat Oct 30, 2004 12:00 am

Post by jpuziano » Sun Mar 29, 2009 12:49 am

JRL wrote:Have you tried the techniques from here?
Thanks for the link to that post... I re-read it all just now. No I haven't tried those methods because they're file based.

In my case, the text will be available in a dialog memo field... the user will paste it in there then click a button to process the blob... so I already have all the text available in one variable.

I believe the extra time it would take to write the text blob to a file and then read in back in, either line by line or even all at once... would really slow it down.
JRL wrote:I'm currently working with a file that is 250,000+ lines and the ReadFile/Separate method is taking waaaaaay too long. I'd be happy with anything that took less than a minute. I'm really hoping there might be a libfunc> method that works in the blink of an eye. In this case I won't/can't accept anything that requires a third party program but if you have one you might mention it for future reference. Actually there is one in scripts and tips DOS / Windows Utility - llc - Count lines in a Text (ASCII) that requires a third party executable.
I downloaded llc.exe (Larry's Line Counter) the third party executable mentioned in your post above JRL. Unfortunately "llc returns count of lines piped into command" so while it could be used to count the number of .txt files in a directory... it cannot count the number of lines within a file.

In your original post, you mentioned a "libfunc> method that works in the blink of an eye". That's what I was thinking of as well... did you ever find anything like that?

A third party exe or DLL would be fine... as long as its faster than what I can already do with the RegEx command.

Not to disparage the RegEx command mind you... it came in at almost 20 times faster than the Separate> command for this sort of thing in my tests so thanks Marcus... RegEx is fantastic and fine for most uses.

This is just a special case where every bit of speed matters. I'm looking for the fastest solution possible... whatever it takes. If anyone has some good ideas, I'd like to hear about them.

Thanks everyone and take care.
jpuziano

Note: If anyone else on the planet would find the following useful...
[Open] PlayWav command that plays from embedded script data
...then please add your thoughts/support at the above post - :-)

User avatar
JRL
Automation Wizard
Posts: 3526
Joined: Mon Jan 10, 2005 6:22 pm
Location: Iowa

Post by JRL » Sun Mar 29, 2009 5:45 am

me_again's VBScript from the post I mentioned earlier does this in half the time of regex. I get about .53 for regex and about .28 for VBScript.

Should you decide this is the solution worthy of points, they should go to me_again. :)

Code: Select all

/*
Given a very large text blob entered into a dialog memo field
and each "line" is separated by CRLF... what is the absolute
fastest way to determine how many lines it contains?

Performance Data from my home PC:

Method: Separate> command
Elapsed time in seconds:94.11474609375
Lines: 30000

Method: RegEx> with EasyPattern
Elapsed time in seconds: 4.7568359375
Lines: 30000

Method: plain RegEx>
Elapsed time in seconds: 4.7177734375
Lines: 30000

Plain RegEx is the fastest so far... but can you beat it?
*/

VBSTART
VBEND

Dialog>MyDialog
   Caption=Text Blob Line Counter Speed Challenge
   Width=324
   Height=321
   Top=133
   Left=109
   Max=0
   Min=0
   Close=1
   Resize=1
   Memo=msMemo1,10,33,295,200, Still Initializing - Please be patient... 30000 lines will appear here soon
   Label=Memo field below contains 30000 lines,8,8,true
   Button=Count the Lines and Time how long it took,10,247,223,25,3
   Button=Exit,248,247,57,25,2
EndDialog>MyDialog

//Initialize a 30000 line text blob
Let>blob_300_lines=
Let>blob_30000_lines=

//build blob_300_lines
Let>line_num=0
Repeat>line_num
  Let>line_num=line_num+1
  ConCat>blob_300_lines,text line%SPACE%%line_num%%CRLF%
Until>line_num=300

//build blob_30000_lines
Let>line_num=0
Repeat>line_num
  Let>line_num=line_num+1
  ConCat>blob_30000_lines,blob_300_lines
  Message>Initializing... we'll be done at 100: %line_num%
Until>line_num=100

//Close Message> box
Press Enter

Show>MyDialog
Let>MyDialog.msMemo1=%blob_30000_lines%
ResetDialogAction>MyDialog

Label>ActionLoop
GetDialogAction>MyDialog,result
  If>result=2,End
  If>result=3,Go
Goto>ActionLoop

Label>Go
VBEval>Timer,startSeconds

//Method: Separate> command
//Separate>%MyDialog.msMemo1%,%CRLF%,returnvar

//Method: RegEx> with EasyPattern
//RegEx>[CRLF],MyDialog.msMemo1,1,matches_array,num_matches,0

//=========================================
//Method: plain RegEx>
RegEx>\r\n,MyDialog.msMemo1,0,matches_array,num_matches,0
VBEval>Timer-%startSeconds%,RegEx_elapsedSeconds

VBEval>Timer,startSeconds

//VBScript from me_again in the post I mentioned above
//Writes the text to a file
//Counts the lines
//Deletes the file
//All in half the time regex does it
Let>filetest=%temp_dir%textblob.txt
WriteLn>filetest,wres,MyDialog.msMemo1
VBStart
Function linecount (filespec)
Dim FSO, f, num
Const ForReading = 1
Set FSO = CreateObject("Scripting.FileSystemObject")
On Error Resume Next
Set f = FSO.OpenTextFile(filespec, ForReading, False)
Do While f.AtEndOfStream = False
num = num + CInt(f.ReadLine)
Loop
linecount = f.line - 1
End Function
VBEnd

VBEval>linecount("%filetest%"),num_matches
DeleteFile>%filetest%

//=========================================

VBEval>Timer-%startSeconds%,elapsedSeconds

//MDL>Method: Separate> command%CRLF%%CRLF%Elapsed time in seconds:%elapsedSeconds%%CRLF%%CRLF%Lines: %returnvar_count%

//MDL>Method: RegEx> with EasyPattern%CRLF%%CRLF%Elapsed time in seconds:%elapsedSeconds%%CRLF%%CRLF%Lines: %num_matches%

MDL>Method: plain RegEx>%CRLF%%CRLF%Lines Counted: %num_matches%%CRLF%Elapsed time in seconds:%RegEx_elapsedSeconds%%CRLF%%CRLF%Can you make it count them any faster?%CRLF%me_again's VBScript time = %elapsedSeconds%

Label>End

User avatar
jpuziano
Automation Wizard
Posts: 1085
Joined: Sat Oct 30, 2004 12:00 am

Post by jpuziano » Sun Mar 29, 2009 9:34 am

Hi JRL,

Thanks for that... yes it seems about twice as fast.

I also learned about the unix wc (word count) command.

This link gave a bunch of links to unix tools on windows:
http://stackoverflow.com/questions/2472 ... on-windows

I installed the one available here: http://www.softwareonline.org/winxs42.html

I moved the wc.exe file it installed in the e:\Program Files\WinXs32\bin directory to the root of e:\ just to make things easier.

Then opening up a DOS command line... I type:

cd..
cd..

until my prompt is: E:\>

Then I can type:

wc -l filename.txt

and it comes back in under 1 second with:

220986 filename.txt

the -l is a switch to tell it just to count lines, otherwise you get words and character counts too. I figured just counting lines would be faster as that's all I need.

I can also type this...

wc -l E:\filename.txt > E:\lines.txt

...and it writes the results to the file "lines.txt".

However the following does not work for me:

Code: Select all

VBStart
VBEnd

VBEval>Timer,startSeconds

Run Program>command.com /c wc -l E:\filename.txt > E:\lines.txt

ReadLn>E:\lines.txt,1,result

VBEval>Timer-%startSeconds%,wc_elapsedSeconds

MDL>wc time = %wc_elapsedSeconds%%CRLF%%CRLF%wc Line count = %result%
It does create the E:\lines.txt file but the file is empty.

What am I doing wrong?
jpuziano

Note: If anyone else on the planet would find the following useful...
[Open] PlayWav command that plays from embedded script data
...then please add your thoughts/support at the above post - :-)

User avatar
Marcus Tettmar
Site Admin
Posts: 7395
Joined: Thu Sep 19, 2002 3:00 pm
Location: Dorset, UK
Contact:

Post by Marcus Tettmar » Sun Mar 29, 2009 11:28 am

The script isn't waiting for the command to complete. You need Let>RP_WAIT=1 before the Run command.
Marcus Tettmar
http://mjtnet.com/blog/ | http://twitter.com/marcustettmar

Did you know we are now offering affordable monthly subscriptions for Macro Scheduler Standard?

gdyvig
Automation Wizard
Posts: 447
Joined: Fri Jun 27, 2008 7:57 pm
Location: Seattle, WA

Sampling method

Post by gdyvig » Mon Mar 30, 2009 3:18 am

I believe you will only get very small improvements by using a faster libfunc. To get a big improvement you need a new approach.

Your requirement is to quickly give the user an estimate of how long it will take to process the blob. You don't need to count every CRLF to do that. A small sampling will usually suffice.

Here is my sampling code:

Code: Select all

//=========================================
//Method: 10 Percent Sample plain RegEx>
Length>MyDialog.msMemo1,num_blob_chars
Let>nLength={Trunc(%num_blob_chars%/10)}
MidStr>MyDialog.msMemo1,1,nLength,strSample
RegEx>\r\n,strSample,0,matches_array,num_matches,0
Let>num_matches=num_matches*10
Here are my results:
1 percent sample: 0 seconds
10 percent sample: 0 to 0.015 seconds
100 percent sample: 0.25 to 0.35 seconds

Note that my 100 percent sample is 0.3 seconds compared to your 4.7 seconds for the Plain RegEx method. I use a VMware client running on Windows XP with 1GIG memory. You may be using virtual memory which will slow you down a lot.

Gale

User avatar
jpuziano
Automation Wizard
Posts: 1085
Joined: Sat Oct 30, 2004 12:00 am

Re: Sampling method

Post by jpuziano » Mon Mar 30, 2009 10:37 am

Hi Marcus,
mtettmar wrote:The script isn't waiting for the command to complete. You need Let>RP_WAIT=1 before the Run command.
I added that but that didn't solve the problem, file was still empty.

It turned out that it couldn't find wc.exe unless I specified the full path. It started working when I used: e:\wc.exe
gdyvig wrote:I believe you will only get very small improvements by using a faster libfunc. To get a big improvement you need a new approach.

Your requirement is to quickly give the user an estimate of how long it will take to process the blob. You don't need to count every CRLF to do that. A small sampling will usually suffice.
Thanks for this idea... taking a sample and only counting CRLF's within the sample... then extrapolating that back up to something fairly close to what the real total might be. If I can't get enough speed by any other method, I could go to that.

However I was able to get some surprising speed increases. Me_again's VBScript method was twice as fast as the RegEx method.

But much faster yet was using wc.exe from WinXs Version 4.2 which was 10 times faster than the VBScript method and 20 times faster than the RegEx method.

If you run the script below and you want to see the speed of wc.exe, you'll have to download and install WinXs Version 4.2 and point the script at where wc.exe is on your machine.

20 times faster than the RegEx solution... that surprised me.

I didn't think a file based solution had a chance compared to a command just acting on a variable that's already in memory... but this proves otherwise.

This is only one windows implementation of the unix wc (word count) command. There are links to others here:
http://stackoverflow.com/questions/2472 ... on-windows

Perhaps another might be even faster.

To make it really interesting, I increased the size of the text blob from 30,000 to 100,000 lines. If anyone tries the code below, please post to say what speeds you achieved:

Code: Select all

/*
Compare the speed of three different methods
of counting the number of lines in a large text blob.

Typical performance data from my PC was:

Method: plain RegEx>
Lines Counted: 100000
Elapsed time in seconds: 12.498046875

Method: Me_again's VBScript
Lines Counted: 100000
Elapsed time in seconds: 6.19921875

Method: wc.exe from WinXs Version 4.2
Lines Counted: 100000
Elapsed time in seconds: 0.5810546875
*/

Let>MSG_HEIGHT=290

VBStart
Function linecount (filespec)
Dim FSO, f, num
Const ForReading = 1
Set FSO = CreateObject("Scripting.FileSystemObject")
On Error Resume Next
Set f = FSO.OpenTextFile(filespec, ForReading, False)
Do While f.AtEndOfStream = False
num = num + CInt(f.ReadLine)
Loop
linecount = f.line - 1
End Function
VBEnd

Dialog>MyDialog
   Caption=Text Blob Line Counter Speed Challenge
   Width=324
   Height=321
   Top=133
   Left=109
   Max=0
   Min=0
   Close=1
   Resize=1
   Memo=msMemo1,10,33,295,200, Still Initializing - Please wait until the 100000 lines appear here before clicking the button below to count them
   Label=Memo field below contains 100000 lines,8,8,true
   Button=Count the Lines and Time how long it took,10,247,223,25,3
   Button=Exit,248,247,57,25,2
EndDialog>MyDialog

//Initialize a 100000 line text blob
Let>blob_1000_lines=
Let>blob_100000_lines=

//build blob_1000_lines
Let>line_num=0
Repeat>line_num
  Let>line_num=line_num+1
  ConCat>blob_1000_lines,text "line"%SPACE%%line_num%%CRLF%
Until>line_num=1000

//build blob_100000_lines
Let>line_num=0
Repeat>line_num
  Let>line_num=line_num+1
  ConCat>blob_100000_lines,blob_1000_lines
  Message>Initializing... we'll be done at 100: %line_num%
Until>line_num=100

//Close Message> box
Press Enter

Show>MyDialog
Let>MyDialog.msMemo1=%blob_100000_lines%
ResetDialogAction>MyDialog

Label>ActionLoop
GetDialogAction>MyDialog,result
  If>result=2,End
  If>result=3,Go
Goto>ActionLoop

Label>Go

//Method: plain RegEx>
VBEval>Timer,startSeconds
RegEx>\r\n,MyDialog.msMemo1,0,matches_array,num_matches,0
VBEval>Timer-%startSeconds%,RegEx_elapsedSeconds
Let>regex_lines_counted=num_matches

//Create file to test file based methods below
Let>filetest=%temp_dir%textblob.txt
//delete file if it exists from a previous run
DeleteFile>filetest
//write the text blob to disk
Let>WLN_NOCRLF=1
WriteLn>filetest,wres,MyDialog.msMemo1

//Method: Me_again's VBScript
VBEval>Timer,startSeconds
VBEval>linecount("%filetest%"),vb_lines_counted
VBEval>Timer-%startSeconds%,VBScript_elapsedSeconds

Let>filelines=%temp_dir%lines.txt
//delete file if it exists from a previous run 
DeleteFile>filelines
Let>RP_WAIT=1
Let>RP_WINDOWMODE=2
//Method: wc.exe from WinXs Version 4.2 
VBEval>Timer,startSeconds
Run>cmd /c e:\wc.exe -l %filetest% > %filelines%
ReadLn>filelines,1,result
RegEx>[0-9]+,result,0,matches_array,num_matches,0
VBEval>Timer-%startSeconds%,wc_elapsedSeconds
Let>wc_lines_counted=matches_array_1

//Show results
MDL>Method: plain RegEx>%CRLF%Lines Counted: %regex_lines_counted%%CRLF%Elapsed time in seconds: %RegEx_elapsedSeconds%%CRLF%%CRLF%Method: Me_again's VBScript%CRLF%Lines Counted: %vb_lines_counted%%CRLF%Elapsed time in seconds: %VBScript_elapsedSeconds%%CRLF%%CRLF%Method: wc.exe%CRLF%Lines Counted: %wc_lines_counted%%CRLF%Elapsed time in seconds: %wc_elapsedSeconds%

Label>End
gdyvig wrote: Here is my sampling code:

Code: Select all

//=========================================
//Method: 10 Percent Sample plain RegEx>
Length>MyDialog.msMemo1,num_blob_chars
Let>nLength={Trunc(%num_blob_chars%/10)}
MidStr>MyDialog.msMemo1,1,nLength,strSample
RegEx>\r\n,strSample,0,matches_array,num_matches,0
Let>num_matches=num_matches*10
Here are my results:

1 percent sample: 0 seconds
10 percent sample: 0 to 0.015 seconds
100 percent sample: 0.25 to 0.35 seconds

Note that my 100 percent sample is 0.3 seconds compared to your 4.7 seconds for the Plain RegEx method. I use a VMware client running on Windows XP with 1GIG memory. You may be using virtual memory which will slow you down a lot.

Gale
This is a 450 MHz Pentium III with 512 MB ram - that probably accounts for most of the speed difference... I'm just glad it runs XP. :lol:

Thanks Gale for the ideas and example code... much appreciated. Its great to have you here on the forums.

Marcus... no need to hold back any longer. Do you know of anything faster than using wc.exe above? Possibly the elusive "libfunc> method that works in the blink of an eye" method?

If there is no such magical win32 API call that could be used... could a command be purpose-built to count lines in a variable? Surely something that operates on data "already in memory" could be faster... perhaps much faster than something that has to read the data from disk. Your thoughts?
Last edited by jpuziano on Thu Apr 02, 2009 12:32 am, edited 1 time in total.
jpuziano

Note: If anyone else on the planet would find the following useful...
[Open] PlayWav command that plays from embedded script data
...then please add your thoughts/support at the above post - :-)

User avatar
Marcus Tettmar
Site Admin
Posts: 7395
Joined: Thu Sep 19, 2002 3:00 pm
Location: Dorset, UK
Contact:

Post by Marcus Tettmar » Mon Mar 30, 2009 11:50 am

I haven't compared it to the previous suggestions, but here is another method to throw into the mix:

Code: Select all

VBSTART
VBEND

VBEval>Timer,startTime

Let>filename=c:\documents\testfile.txt

Let>RP_WAIT=1
Run>cmd.exe /c findstr /E "$" "%filename%" | find /C ":" > "%TEMP_DIR%\_count.txt"

ReadFile>%TEMP_DIR%\_count.txt,the_count

VBEval>Timer-%startTime%,elapsed

MessageModal>the_count - %elapsed% seconds

DeleteFile>%TEMP_DIR%\_count.txt
Marcus Tettmar
http://mjtnet.com/blog/ | http://twitter.com/marcustettmar

Did you know we are now offering affordable monthly subscriptions for Macro Scheduler Standard?

gdyvig
Automation Wizard
Posts: 447
Joined: Fri Jun 27, 2008 7:57 pm
Location: Seattle, WA

Slice and Dice method

Post by gdyvig » Mon Mar 30, 2009 3:43 pm

Here is a fast method based on the sampling method that requires no libfuncs, file transfers, or the like:

Code: Select all

//Method: Slice and Dice plain RegEx>
//RegEx on \n instead of\r\n to prevent boundry problems between slices
Let>num_matches=0
Let>nSlice=0
Length>MyDialog.msMemo1,num_blob_chars
Let>nLength={Trunc(%num_blob_chars%/10)}
Repeat>nSlice
Let>nSlicePos=nSlice*nLength
Let>nSlicePos=nSlicePos+1
Let>nSlice=nSlice+1
MidStr>MyDialog.msMemo1,nSlicePos,nLength,strSample
RegEx>\n,strSample,0,matches_array,num_slice_matches,0
Let>num_matches=num_matches+num_slice_matches
Until>nSlice=10
Let>nLeftOver=num_blob_chars-10*nLength
//=========================================

Note that I am assuming that \n never appears without \r preceding it. If that is a bad assumption, it is necessary to make sure we don't slice a \r\n in half. Also there may be up to 10 characters left over that need to be processed.

The idea is to keep your slices small enough to prevent paging, or to keep the slices small enough where RegEx is efficient. Whichever is the issue.

My performance for 30,000 lines ranges from 0.06 to 0.11 seconds most of the time.

Gale

gdyvig
Automation Wizard
Posts: 447
Joined: Fri Jun 27, 2008 7:57 pm
Location: Seattle, WA

Please test

Post by gdyvig » Mon Mar 30, 2009 7:52 pm

Hi John,

Could you test the "Slice and Dice" method?

It processed the entire blob except for a few left over characters at the end.

Also check Task Manager to determine if peformance problems are due to CPU speed or physical memory limitations.

Thanks,

Gale

User avatar
jpuziano
Automation Wizard
Posts: 1085
Joined: Sat Oct 30, 2004 12:00 am

Post by jpuziano » Tue Mar 31, 2009 8:05 am

mtettmar wrote:I haven't compared it to the previous suggestions, but here is another method to throw into the mix:

Code: Select all

VBSTART
VBEND

VBEval>Timer,startTime

Let>filename=c:\documents\testfile.txt

Let>RP_WAIT=1
Run>cmd.exe /c findstr /E "$" "%filename%" | find /C ":" > "%TEMP_DIR%\_count.txt"

ReadFile>%TEMP_DIR%\_count.txt,the_count

VBEval>Timer-%startTime%,elapsed

MessageModal>the_count - %elapsed% seconds

DeleteFile>%TEMP_DIR%\_count.txt
Hi Marcus,

Thanks for that but after placing my file at...

c:\documents\testfile.txt

...I could not get it to run as posted. Well, it would run, but always produced a line count of 0. Are there too many \ chars in there? The value stored in the TEMP_DIR system variable already has a slash \ at the end.

In any case, after a bit of Googling, I found this:

http://brett.batie.com/dosshell/count-n ... using-dos/

And from that I came up with the following which did work. Its very similar to yours except findstr is looking for "^" instead of "$" and uses some different switches:

Code: Select all

VBSTART
VBEND

VBEval>Timer,startTime

Let>filename=e:\documents\testfile.txt

Let>RP_WAIT=1

//Run>cmd.exe /c findstr /E "$" "%filename%" | find /C ":" > "%TEMP_DIR%\_count.txt"
Run>cmd.exe /c findstr /R /N "^" "%filename%" | find /C ":" > "%TEMP_DIR%count.txt"

//ReadFile>%TEMP_DIR%\_count.txt,the_count
ReadFile>%TEMP_DIR%count.txt,the_count

VBEval>Timer-%startTime%,elapsed

//MessageModal>the_count - %elapsed% seconds
MessageModal>Lines Counted: %the_count%%CRLF%Time: %elapsed% seconds

//DeleteFile>%TEMP_DIR%\_count.txt
DeleteFile>%TEMP_DIR%count.txt
Interesting that DOS can do that... but here's the performance on my old PC:
  • Method: DOS findstr command piped to find command
    Lines Counted: 100000
    Time: 245.26318359375 seconds
That's the slowest method I've seen so its not looking very promising. Again, the fastest so far is:
  • Method: wc.exe from WinXs Version 4.2
    Lines Counted: 100000
    Elapsed time in seconds: 0.5810546875
Does anyone have anything any faster?

Gale, I can see the "slice and dice" extrapolation being fairly accurate if the lines are of similar length. But if they aren't, the extrapolation could be way off.

For instance, what if the first line was so long it was 15% of the entire file... and the rest of the file had thousands of short lines. If I took just the first 10% of the file... I would find there are no CRLF's in there... zero... how would I extrapolate that up to the right number?

That's why I'm looking for the fastest method possible. Once I find it... that may be all the speed I need... and I'll be able to pull the right number of lines so I can properly update the progressbar as the lines are processed.

I do appreciate your concept though, very creative and outside-the-box.


Marcus, the wc.exe I used was not a free solution... but I did find a DLL called grc32v4.dll here: http://basic.wikispaces.com/DLLs
they wrote:Simple Line/Record Counting DLL by Cassio Ferreira. 7KB DLL designed to count lines/records within text files more quickly than in native Liberty BASIC (as of v4.03). The speed improvement becomes significant when the count is in the thousands.
I downloaded it and took a look but can't quite figure out how to call this DLL. Do I have to use LibLoad> first?

Can you please take a peek and let me know how I would call it from a Macro Scheduler script... if its possible?

Thanks Marcus and take care...
jpuziano

Note: If anyone else on the planet would find the following useful...
[Open] PlayWav command that plays from embedded script data
...then please add your thoughts/support at the above post - :-)

gdyvig
Automation Wizard
Posts: 447
Joined: Fri Jun 27, 2008 7:57 pm
Location: Seattle, WA

Slice and Dice does not extrapolate

Post by gdyvig » Tue Mar 31, 2009 12:30 pm

Hi John,

The Slice and Dice method does not extraoplate. It counts all of the CRLF's. The extrapolation method takes one slice and extrapolates it. The Slice and Dice method counts CRLFs in each slice and then adds up the slice counts for a total count.

Why is this method faster then just counting all of the CRLFs at one time? There are two reasons.

The first reason is RegEx processes small blobs at a faster line/sec rate then large blobs when the entire blob is in physical memory.

The second reason is large blobs and associated temporary memory requirements can be larger than the physical memory available to process them. So the OS is forced to use the file system to create the extremely slow virtual memory. This reason is less likely because your blob size is small. But it is still possible if you normally run in virtual memory.

The Slice and Dice method may need some tweaking to make sure it is 100% accurate.

Give it a try with some realistic blobs.

Here are some interesting results on my machine:

100,000 line 1.5meg blob
Plain RegEx: 0.9 sec
EasyPatterns: 1.0 sec
JRL's vbscript: 0.9 sec
Slice and Dice 10, 100, or 1000 slices: 0.3 sec
wc.exe: 0.3 sec

Here are some really interesting results:
1,000,000 line 15 meg blob
Plain RegEx: 53 sec !!!
EasyPatterns: 59 sec !!!
JRL's vbscript: 7 sec
Slice and Dice 10 100,000 line slices: 3.78 sec
Slice and Dice 100 10,000 line slices: 2.41 sec
Slice and Dice 1000 1,000 line slices: 3.00 sec
wc.exe: 1.65 sec

It would be interesting to see how the third party tool (XS) handles the 15meg superblob.

No virtual memory was required on my machine.

Gale

Gale
Last edited by gdyvig on Wed Apr 01, 2009 4:42 am, edited 1 time in total.

User avatar
jpuziano
Automation Wizard
Posts: 1085
Joined: Sat Oct 30, 2004 12:00 am

Post by jpuziano » Tue Mar 31, 2009 4:59 pm

Hi Gail,

Sorry, I see now that your regex slice and dice does not extrapolate... but just counts in smaller chunks. Here's the code I used to time the leading methods with your new method included:

Code: Select all

Let>MSG_HEIGHT=350

VBStart
Function linecount (filespec)
Dim FSO, f, num
Const ForReading = 1
Set FSO = CreateObject("Scripting.FileSystemObject")
On Error Resume Next
Set f = FSO.OpenTextFile(filespec, ForReading, False)
Do While f.AtEndOfStream = False
num = num + CInt(f.ReadLine)
Loop
linecount = f.line - 1
End Function
VBEnd

Dialog>MyDialog
   Caption=Text Blob Line Counter Speed Challenge
   Width=324
   Height=321
   Top=133
   Left=109
   Max=0
   Min=0
   Close=1
   Resize=1
   Memo=msMemo1,10,33,295,200, Still Initializing - Please wait until the 100000 lines appear here before clicking the button below to count them
   Label=Memo field below contains 100000 lines,8,8,true
   Button=Count the Lines and Time how long it took,10,247,223,25,3
   Button=Exit,248,247,57,25,2
EndDialog>MyDialog

//Initialize a 100000 line text blob
Let>blob_1000_lines=
Let>blob_100000_lines=

//build blob_1000_lines
Let>line_num=0
Repeat>line_num
  Let>line_num=line_num+1
  ConCat>blob_1000_lines,text "line"%SPACE%%line_num%%CRLF%
Until>line_num=1000

//build blob_100000_lines
Let>line_num=0
Repeat>line_num
  Let>line_num=line_num+1
  ConCat>blob_100000_lines,blob_1000_lines
  Message>Initializing... we'll be done at 100: %line_num%
Until>line_num=100

//Close Message> box
Press Enter

Show>MyDialog
Let>MyDialog.msMemo1=%blob_100000_lines%
ResetDialogAction>MyDialog

Label>ActionLoop
GetDialogAction>MyDialog,result
  If>result=2,End
  If>result=3,Go
Goto>ActionLoop

Label>Go

//Method: plain RegEx>
VBEval>Timer,startSeconds
RegEx>\r\n,MyDialog.msMemo1,0,matches_array,num_matches,0
VBEval>Timer-%startSeconds%,RegEx_elapsedSeconds
Let>regex_lines_counted=num_matches

//Create file to test file based methods below
Let>filetest=%temp_dir%textblob.txt
//delete file if it exists from a previous run
DeleteFile>filetest
//write the text blob to disk
Let>WLN_NOCRLF=1
WriteLn>filetest,wres,MyDialog.msMemo1

//Method: Me_again's VBScript
VBEval>Timer,startSeconds
VBEval>linecount("%filetest%"),vb_lines_counted
VBEval>Timer-%startSeconds%,VBScript_elapsedSeconds

Let>filelines=%temp_dir%lines.txt
//delete file if it exists from a previous run 
DeleteFile>filelines
Let>RP_WAIT=1
Let>RP_WINDOWMODE=2
//Method: wc.exe from http://www.highend3d.com 
//the following wc.exe on c: is from dos-unix-x86-v1.0-.zip and available at:
//http://www.highend3d.com/downloads/tools/os_utils/download-76.html?loc=dos-unix-x86-v1.0-.zip
VBEval>Timer,startSeconds
Run>cmd /c c:\wc.exe -l %filetest% > %filelines%
ReadLn>filelines,1,result
RegEx>[0-9]+,result,0,matches_array,num_matches,0
VBEval>Timer-%startSeconds%,wc_elapsedSeconds
Let>wc_lines_counted=matches_array_1

//Method: Gail - Slice and Dice plain RegEx>
//RegEx on \n instead of\r\n to prevent boundry problems between slices
VBEval>Timer,startSeconds
Let>num_matches=0
Let>nSlice=0
Length>MyDialog.msMemo1,num_blob_chars
Let>nLength={Trunc(%num_blob_chars%/10)}
Repeat>nSlice
Let>nSlicePos=nSlice*nLength
Let>nSlicePos=nSlicePos+1
Let>nSlice=nSlice+1
MidStr>MyDialog.msMemo1,nSlicePos,nLength,strSample
RegEx>\n,strSample,0,matches_array,num_slice_matches,0
Let>num_matches=num_matches+num_slice_matches
Until>nSlice=10
//Let>nLeftOver=num_blob_chars-(10*nLength)
  Let>new=10*nLength
  Let>nLeftOver=num_blob_chars-new
VBEval>Timer-%startSeconds%,Gail_elapsedSeconds

//Show results
MDL>Method: plain RegEx>%CRLF%Lines Counted: %regex_lines_counted%%CRLF%Elapsed time in seconds: %RegEx_elapsedSeconds%%CRLF%%CRLF%Method: Me_again's VBScript%CRLF%Lines Counted: %vb_lines_counted%%CRLF%Elapsed time in seconds: %VBScript_elapsedSeconds%%CRLF%%CRLF%Method: wc.exe from http://www.highend3d.com%CRLF%Lines Counted: %wc_lines_counted%%CRLF%Elapsed time in seconds: %wc_elapsedSeconds%%CRLF%%CRLF%Method: Gail - Slice and Dice plain RegEx>%CRLF%Lines Counted: %num_matches%%CRLF%Elapsed time in seconds: %Gail_elapsedSeconds%%CRLF%nLeftOver = %nLeftOver%

Label>End
timing results from the above wrote:Method: plain RegEx>
Lines Counted: 100000
Elapsed time in seconds: 13.51171875

Method: Me_again's VBScript
Lines Counted: 100000
Elapsed time in seconds: 6.70703125

Method: wc.exe from http://www.highend3d.com
Lines Counted: 100000
Elapsed time in seconds: 0.421875

Method: Gail - Slice and Dice plain RegEx>
Lines Counted: 100000
Elapsed time in seconds: 2.2734375
nLeftOver = 0
Gail, your method certainly does achieve a speed increase over plain RegEx but still not as fast as wc.exe was.

Note that I found a wc.exe that I believe is free, link in above code.

Two issues with slice and dice:

1) What if nLeftOver is greater than zero? You have no code to count the CRLFs within the nLeftOver chunk. Not sure if it would add too much more time on to do that... perhaps not.

2) There is a tuning aspect to this wrt the number of slices. I tried a few different numbers but they were slower so just left it at 10 slices. Considering that the input text blob may be anything from tiny to enormous... might there be a way for it to pick a proper number of slices by itself? For instance, say the input blob had 4 chars... is dividing by 10 going to work in that case?

And a question for Marcus... perhaps a silly question but I'll ask anyway:

The line below from Gail did not work for me, I had to replace it with the following 2 lines instead:

//Let>nLeftOver=num_blob_chars-(10*nLength)
Let>new=10*nLength
Let>nLeftOver=num_blob_chars-new

Is something wrong with the syntax? When I added the following line after the problem line above:

MDL>nLeftOver

The MessageModal> box displayed the following:

num_blob_chars-(10*nLength)

Why wasn't the number displayed?
gdyvig wrote:It would be interesting to see how the third party tool (XS) handles the 15meg superblob.
It sure would be... why not give it a try and post your results... you can download a free ws.exe available here:

http://www.highend3d.com/downloads/tool ... -v1.0-.zip
Last edited by jpuziano on Thu Apr 02, 2009 12:33 am, edited 3 times in total.
jpuziano

Note: If anyone else on the planet would find the following useful...
[Open] PlayWav command that plays from embedded script data
...then please add your thoughts/support at the above post - :-)

gdyvig
Automation Wizard
Posts: 447
Joined: Fri Jun 27, 2008 7:57 pm
Location: Seattle, WA

wc.exe tested on Gale's machine.

Post by gdyvig » Wed Apr 01, 2009 4:48 am

I updated my results in the previous post so you can compare it with the other methods.

wc.exe method:
100,000 lines: 0.3 sec (same as slice and dice)
1,000,000 lines: 1.65 sec.(twice as fast as slice and dice)

This includes the time required to write the file and read it.

This method is very scalable for large blobs.
RegEx is not scalable.

Gale

Post Reply
Sign up to our newsletter for free automation tips, tricks & discounts