Array From A ReadLn

Post by **Marcus Tettmar** » Mon Mar 16, 2009 11:01 am

Those capture variables are only available to the RegEx command and would be used for doing replacements etc, as you have done. I'm not sure what the implications are of having RegEx return those to MacroScript variables. I'll have to look into it.

You're not going to let this go are you ;-) Parsing CSV with RegEx or EasyPatterns is going to need two parses to be reliable. Period.

Here's the EP version of my original two-parse RegEx solution:

Code: Select all

//Easypatterns:
Let>text=sally,"1,2500",fred
Let>pattern=[CSVField]
RegEx>pattern,text,1,fields,num,0

//Loop through each field
Let>pattern=[(lineStart, doubleQuote) or (doubleQuote, lineEnd)]
Let>k=0
Repeat>k
  Let>k=k+1
  Let>the_field=fields_%k%

  //RegEx to remove starting and ending quotes if present
  RegEx>pattern,the_field,1,matches,n,1,,the_field

  //Display the field
  MessageModal>the_field
Until>k=num

I think it's small price to pay to put up with the quotes and remove them if needed when that particular field is used. As I showed earlier trying to do it in one go is problematic. I don't believe it's possible to do it 100% reliably without making some assumptions about the format of the data.

jpuziano · Post by **jpuziano** » Wed Mar 18, 2009 6:04 am

Hi Marcus,

The T-shirt arrived today... thank you very much... I always wanted one of these.

mtettmar wrote:Those capture variables are only available to the RegEx command and would be used for doing replacements etc, as you have done. I'm not sure what the implications are of having RegEx return those to MacroScript variables. I'll have to look into it.

When you look into things, good things often happen so yes please do. Here's what prompted me to ask... see Perl example found here:

blog post on 'Regex Capture Groups In Python and Perl' wrote:Perl Example:

$foo = '11/14/2007';

if ($foo =~ m^[0-9]{2}/[0-9]{2}/([0-9]{4})^) {
print $1;
}

output:

2007

* Note the string we captured ended up in the special variable $1

Perl can refer to $1, $2, etc. capture group variables directly later in the code so if its technically possible... I think it would be a nice addition to round out your PCRE implementation within MacroScript.

If you were to add it, I would assume every time the RegEx> command was called:

any and all MS capture group variables created the last time the RegEx> command was called would be cleared (not set to the empty string but removed, no longer visible in the watch list)
new MS capture group variables would be created if any were specified in the regex pattern or EasyPattern

Again, easy to say but perhaps difficult to implement. Please let us know what you decide and thanks for investigating and considering this.

mtettmar wrote:You're not going to let this go are you

Are you saying I'm stubborn?

The difficult problems are the most interesting after all...

mtettmar wrote:Parsing CSV with RegEx or EasyPatterns is going to need two parses to be reliable. Period.

You may be right...

mtettmar wrote:Here's the EP version of my original two-parse RegEx solution:
Code: Select all
//Easypatterns:
Let>text=sally,"1,2500",fred
Let>pattern=[CSVField]
RegEx>pattern,text,1,fields,num,0

//Loop through each field
Let>pattern=[(lineStart, doubleQuote) or (doubleQuote, lineEnd)]
Let>k=0
Repeat>k
  Let>k=k+1
  Let>the_field=fields_%k%

  //RegEx to remove starting and ending quotes if present
  RegEx>pattern,the_field,1,matches,n,1,,the_field

  //Display the field
  MessageModal>the_field
Until>k=num
I think it's small price to pay to put up with the quotes and remove them if needed when that particular field is used. As I showed earlier trying to do it in one go is problematic. I don't believe it's possible to do it 100% reliably without making some assumptions about the format of the data.

Speaking of the format of the data, the one assumption I would like to make is that the CSV data is "properly formatted". To find out what that means, I checked here on Wikipedia: http://en.wikipedia.org/wiki/Comma-separated_values

A few rules I found there were interesting:

rules wrote:In some CSV implementations, leading and trailing spaces or tabs, adjacent to commas, are trimmed. This practice is contentious and in fact is specifically prohibited by RFC 4180, which states, "Spaces are considered part of a field and should not be ignored."

Then I realized I would be willing to accept as "standard" whatever Excel did to the data when I chose Save as type: CSV (MS-DOS)(*.csv)

- so I opened a new Excel spreadsheet (with Microsoft Office Excel 2003)
- I used "Format Cells" to set all cells to "Text" so it wouldn't mess with the data I entered... but just display what I entered
- I entered the following data values on one line:

more challenging data wrote:element 1:sally
element 2:1,2500
element 3: fred
element 4:111,222,333
element 5:
element 6:""
element 7:red
element 8:embedded"doublequote
element 9:done

Note that element 5 is empty (nothing entered in that spreadsheet cell) and element 6 is two double quote characters.

When I saved it to a CSV (MS-DOS) file, the line appeared as:

Microsoft Office Excel 2003 CSV (MS-DOS) format wrote:sally,"1,2500", fred ,"111,222,333",,"""""",red,"embedded""doublequote",done

So that proved that Excel was leaving the leading and trailing spaces around fred alone... and our existing regex methods also preserved those spaces when we extracted the elements, no problem there.

Then there are these two rules:

rules wrote:Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes.

If a field's value contains a double quote character it is escaped by placing another double quote character next to it.

And the following rule which is just the above two at the same time:

rules wrote:Fields with embedded double-quote characters must be enclosed within double-quote characters, and each of the the embedded double-quote characters must be represented by a pair of double-quote characters.

If you look at the line that Excel produced, it followed the above rules exactly.

Element 6 being two double quote characters... first had each of those escaped by doubling them... then surrounded by another pair of double quote characters so it is properly represented in the CSV file as six double quote characters in a row.

Element 8 had the same treatment, the internal " was escaped to be "" and the whole thing surrounded by another pair of double quote characters.

We know exactly what the values were we entered into the cells. We know Excel created a CSV file according to the "rules". Can we parse that line from the CSV file and populate an array with the exact values we had entered?

I found problems with the original regex pattern you supplied when it came to handling escaped double quotes. The [CSVField] EasyPattern must indeed be doing some heavy lifting with a very complicated regex pattern (Simon referred to it as horrendous) because it handles the escaped double quotes... but I found that even this EasyPattern did not convert the data back to its original form as entered into the spreadsheet (and perhaps its designers did not attempt to do that).

Please run the examples below to see what I mean:

Code: Select all

MDL>this regex pattern cannot handle doubleQuotes inside the data elements

//More challenging CSV line
Let>text=sally,"1,2500", fred ,"111,222,333",,"""""",red,"inside""doublequote",green

//RegEx to extract each field to an array
Let>pattern="[^"\r\n]*"|[^,\r\n]*
RegEx>pattern,text,0,fields,num,0

//Loop through each field
Let>k=0
Repeat>k
  Let>k=k+1
  Let>the_field=fields_%k%

  //RegEx to remove starting and ending quotes if present
  RegEx>^"|"$,the_field,0,matches,n,1,,the_field

  //Display the field
  MessageModal>the_field
Until>k=num



MDL>EasyPatterns handles it but note that doubleQuote chars that were escaped (doubled) are left alone

//More challenging CSV line
Let>text=sally,"1,2500", fred ,"111,222,333",,"""""",red,"inside""doublequote",green

//EasyPattern to extract each field to an array
Let>pattern=[CSVField]
RegEx>pattern,text,1,fields,num,0

//Loop through each field
Let>pattern=[(lineStart, doubleQuote) or (doubleQuote, lineEnd)]
Let>k=0
Repeat>k
  Let>k=k+1
  Let>the_field=fields_%k%

  //RegEx to remove starting and ending quotes if present
  RegEx>pattern,the_field,1,matches,n,1,,the_field

  //Display the field
  MessageModal>the_field
Until>k=num



MDL>If the goal is to load our array with the data values represented in the CSV file...
MDL>...then we need to convert any escaped (doubled) doubleQuote chars back to a single doubleQuote

//More challenging CSV line
Let>text=sally,"1,2500", fred ,"111,222,333",,"""""",red,"inside""doublequote",green

//EasyPattern to extract each field to an array
Let>pattern1=[CSVField]
RegEx>pattern1,text,1,fields,num,0

//Loop through each field
Let>pattern2=[(lineStart, doubleQuote) or (doubleQuote, lineEnd)]
Let>pattern3=[doubleQuote, doubleQuote]
Let>column=0
Repeat>column
  Let>column=column+1
  Let>the_field=fields_%column%

  //RegEx to remove starting and ending quotes if present
  RegEx>pattern2,the_field,1,matches,n,1,,the_field

  //RegEx to convert any escaped (doubled) doubleQuote chars into a single doubleQuote char
  RegEx>pattern3,the_field,1,matches,n,1,",the_field

  //Display the field
  MessageModal>the_field
Until>column=num

The final modification above allows us to load the array with the exact data values that were entered into the spreadsheet... since I believe this was the goal.

Question to all: What is Excel's problem when I enter the following value in a cell:

'abc

It does not show the leading ' in the cell, though I can see its really still there as it appears in the formula line (by the fx) however, Excel drops the ' when saving to a CSV file and only saves abc as the value.

Me_again · Post by **Me_again** » Wed Mar 18, 2009 1:36 pm

jpuziano wrote:Hi Marcus,Question to all: What is Excel's problem when I enter the following value in a cell:

'abc

It does not show the leading ' in the cell, though I can see its really still there as it appears in the formula line (by the fx) however, Excel drops the ' when saving to a CSV file and only saves abc as the value.

A single quote is Excel the text indicator (going back to Lotus or earlier IIRC), try

'12345

in a field and you will see what I mean.

As someone who has dealt with many thousands of "comma delimited" files from third party sources over the years I think you missed the most important sentences of the wiki article "No general standard specification for CSV exists. Variations between CSV implementations in different programs are quite common and can lead to interoperation difficulties." In other words there is no "properly formatted" and you can't make that assumption.

Frequently programs generate "CSV" files that don't follow the Excel standard leading to confusion when Joe User opens them directly in Excel and finds that the data, particularly dates, are screwed up. The trick is to rename to .txt and open in Excel as a delimited file which gives the user control over field formatting.

jpuziano · Post by **jpuziano** » Wed Mar 18, 2009 5:02 pm

Me_again wrote:A single quote is Excel the text indicator (going back to Lotus or earlier IIRC), try

'12345

in a field and you will see what I mean.

Yes I see now, I open Excel and without changing anything with "Format Cell" I enter the following value:

12,1234

And Excel in its wisdom thinks I made a mistake so it changes what I entered to this:

121,234

Ouch. So if I want to tell Excel to just listen to me and take the values I give it without assuming I made a data entry error, I could enter it like this:

'12,1234

The leading ' is the same as using "Format Cell" and changing it to "Text" mode for that cell... so Excel shows the following in the cell:

12,1234

which is what I am trying to get into the cell... thanks for that.

Me_again wrote:As someone who has dealt with many thousands of "comma delimited" files from third party sources over the years I think you missed the most important sentences of the wiki article "No general standard specification for CSV exists. Variations between CSV implementations in different programs are quite common and can lead to interoperation difficulties." In other words there is no "properly formatted" and you can't make that assumption.

Which is why I tried to focus on "how Excel would do it". If the CSV files were generated by Excel, I would expect the CSV data to be formatted in a consistant way and I could be fairly sure how to read those values. Of course, if an unknown program generated the CSV files then as you say, there may be interoperation difficulties. Perhaps even different versions of Excel might behave differently... unknown until tested.

Me_again wrote:Frequently programs generate "CSV" files that don't follow the Excel standard leading to confusion when Joe User opens them directly in Excel and finds that the data, particularly dates, are screwed up. The trick is to rename to .txt and open in Excel as a delimited file which gives the user control over field formatting.

More good advice, thanks Me_again and take care.