Removing the first character in a very large file

terencepjf · Post by **terencepjf** » Thu Nov 21, 2013 12:58 pm

Hi; I have a MS script that dumps contents from an Oracle database via sqlplus and for some reason it puts a funky character at the start of the file that make the contents unreadable.

-I042,I042:I(A),AMPS,IA,4,07/21/2013 06:00 AM,88.812897
I042,I042:I(B),AMPS,IB,4,07/21/2013 06:00 AM,75.329751
(The - really is an extended ASCII code 254 or the likes of it)

I subsequently extended the script to read & strip out this char and write to a new file - this works if the file is small but as the data grows in size it's not efficient!

How can I just delete/replace this ONE char without all the read/write overhead?

Thanks

Rain · Post by **Rain** » Thu Nov 21, 2013 3:18 pm

Have you tried Windows Powershell? I've tested the script below with 100k and 1 Million lines:
100K lines took roughly 18 seconds.
1 Million lines took roughly 174 seconds.

Code: Select all

Timer>StartTimer
Let>InputFile=%DESKTOP_DIR%\temp.txt
Let>OutpuFile=%DESKTOP_DIR%\out.txt
Let>RP_WINDOWMODE=0
Let>RP_WAIT=1
Run>powershell.exe Get-Content %InputFile% | ForEach-Object {$_ -replace '-', ''} | Set-Content %OutpuFile%
Timer>EndTimer
Let>SecElapsed={(%EndTimer%-%StartTimer%)/1000}
mdl>SecElapsed

Maybe someone has a faster solution.

terencepjf · Post by **terencepjf** » Thu Nov 21, 2013 4:46 pm

Thanks Rain; I've steered clear of Powershell as we are still on XP Pro, but it's time to dive right in based on this need..

JRL · Post by **JRL** » Thu Nov 21, 2013 8:23 pm

This took 52 seconds on a million plus (1075076) line file. Could be better or just a computer difference. I can't run Rain's because I can't find powershell.exe.

It reads the first line of the file, uses midstr> to remove the first character, writes that line to a new output file. then uses DOS "type | find" to write the rest of the input file to the output file.

Code: Select all

Timer>StartTimer
Let>InputFile=%DESKTOP_DIR%\temp.txt
Let>OutputFile=%DESKTOP_DIR%\out.txt

ReadLn>InputFile,1,res
MidStr>res,2,999999,res
WriteLn>OutputFile,wres,res

Let>RP_Windowmode=0
Let>RP_Wait=1

RunProgram>cmd /c type "%InputFile%" | find /v "%res%" >> "%OutputFile%"

Timer>EndTimer
Let>SecElapsed={(%EndTimer%-%StartTimer%)/1000}
mdl>SecElapsed

terencepjf · Post by **terencepjf** » Tue Nov 26, 2013 7:05 pm

Thanks to all for the suggestions - I got powershell to work!

hagchr · Post by **hagchr** » Wed Nov 27, 2013 1:38 pm

Hi, I was curious to see if one could use RegEx to solve it. Not sure if there are any upper limits when the file gets much larger but for one million lines it will complete it in around 1 second.

Code: Select all

Let>InputFile=C:\Users\Christer\Documents\testfile.txt
Let>OutputFile=C:\Users\Christer\Documents\resfile.txt

Timer>StartTimer

ReadFile>InputFile,strInput
RegEx>(?s)(?<=-).+,strInput,0,Matches,NumMatches,0,,
WriteLn>OutputFile,nWLNRes,Matches_1

Timer>EndTimer
Let>SecElapsed={(%EndTimer%-%StartTimer%)/1000}
mdl>SecElapsed