Replace tab with ',' and search for duplicates

weklica · Post by **weklica** » Thu Jun 03, 2010 12:54 am

This post turned into more of a project than I anticipated. if nothing more, what is the best way to replace tabs with commas in this case and searching/identifying duplicates? Thanks .....

I am scraping several screens from an application and writing it to a text file. The file is essentially a tab delimited file. I figured it would be easier to work with if i replaced all of the tabs with commas, but I can't get it to work. This shows each row and replaces the tab with a '`', but it errors when i try a comma. Is there a way to do this?

Code: Select all

Let>k=1
Label>start
ReadLn>c:\testwrite.txt,k,line
If>line=##EOF##,finish
//Message>line
Let>string=%line%
StringReplace>string,   ,`,vbEscapedString
Message>%vbEscapedString%
wait>.5
Let>k=k+5
Goto>start
Label>finish

Below is an example of one of the thousands of rows of text. I would like to grab the last four letters of WAKLEY, the first three letters of JESSE, the M, AND the whole DATE. I assumed it would somehow be easier to achieve this if it was comma delimited, but perhaps not. Anybody have a thought of how I can read through this file and make a UID in this case (per line) which would be (for this particular line) WAKLJESM01/01/2000? Then, with that UID, I would like it to delete ALL rows which are NOT duplicates. So, if WAKLJESM01/01/2000 only shows up once, then it is gone. Otherwise, all instances of it would remain.

123456 WAKLEY JESSE M 01/01/2000 Active OUT

[/code]

adroege · Post by **adroege** » Thu Jun 03, 2010 1:18 am

but it errors when i try a comma. Is there a way to do this?

Code: Select all

Let>comma=,
StringReplace>string,%TAB%,%comma%,result

adroege · Post by **adroege** » Thu Jun 03, 2010 1:33 am

If your data looks like my sample below, then this would be a simple way to build the UID that you are looking for.

[code]
Let>comma=,
Let>data=123456,WAKLEY,JESSE,M,01/01/2000,Active,OUT
Separate>data,comma,arrItems

MidStr>arrItems_2,1,4,mylast
MidStr>arrItems_3,1,3,myfirst
Let>UID=%mylast%%myfirst%%arrItems_4%%arrItems_5%

[/code]

weklica · Post by **weklica** » Fri Jun 04, 2010 4:23 pm

Actually, I just realized this is the same project but basically a whole new post...I will post it fresh.

TEMP0000005437,wakley,jesse,M,11/09/1956,Active,OUT

What I just realized is that sometimes the line may have a middle initial, and other times it does not. That throws the whole arrItems number out the door. Last name and First name never change, but since there is sometimes a middle initial between first and gender, is there a way to have it search the string for the date. We will use the date, to create part of the UID ... but, it would also allow me to let the script understand that the Gender is ALWAYS one field before the date. Therefore, the constants for creating the UID would be %arrItems_2%, %arrItems_3%, %1 field before date%, %date%.

That would polish this whole project up if it is possible. Above is an edited line item from the actual file. It represents one of tens of thousands.

Thanks!

adroege · Post by **adroege** » Fri Jun 04, 2010 5:20 pm

TEMP0000005437,wakley,jesse,M,11/09/1956,Active,OUT

If the data is following some sort of standard then a missing field should be represented by two commas following each other.

i.e.

LAST,FIRST,MI,GENDER,DATE

So if MI was present we would have:

DOE,JOHN,P,M,11/09/1956

And if MI was missing then we would have:

DOE,JOHN,,M,11/09/1956

So you are saying the data does NOT follow this standard?

Back to what you suggested about finding the field before the date...
Yes it should be possible to find the "/" slash character which makes up the date and then substring what you want using the commas to delimit the data. Can use RegEx (Regular Expressions) or simple MidStr and Position commands to accomplish this.

--Another idea -- Why create a UID to use for matching purposes?
Why not just use the entire fields? If the goal is to remove rows which only occur once in the data?

adroege · Post by **adroege** » Fri Jun 04, 2010 6:07 pm

I think I hit upon a simple solution... Since Gender has only a couple of known values I can just test one of the array elements for these values, and from this information figure out if Middle Initial was included in the data or not.

[code]

Let>comma=,
//Data without Middle Initial
Let>data=123456,WAKLEY,JESSE,M,01/01/2000,Active,OUT
//Data WITH Middle Initial
//Let>data=123456,WAKLEY,JESSE,G,M,01/01/2000,Active,OUT

Separate>data,comma,arrItems

MidStr>arrItems_2,1,4,mylast
MidStr>arrItems_3,1,3,myfirst

If>{(%arrItems_5%="M") OR (%arrItems_5%="F") OR (%arrItems_5%="O")}
//MessageModal>Middle Initial exists!
Let>UID=%mylast%%myfirst%%arrItems_5%%arrItems_6%
Else
//MessageModal>Middle Initial Not exists!
Let>UID=%mylast%%myfirst%%arrItems_4%%arrItems_5%
Endif

[/code]

But please consider the questions I raised in my other post about even needing a UID. Is this is so you can load the data into a relational SQL database and perform queries to remove the rows you want? If so, I believe this might be overkill...... performing a sort on the data and then a simple matching algorithm will find occurances of unique rows allowing you to delete them.

Replace tab with ',' and search for duplicates

Replace tab with ',' and search for duplicates

Additional validation for date