Hi Marcus,
The T-shirt arrived today... thank you very much... I always wanted one of these.
mtettmar wrote:Those capture variables are only available to the RegEx command and would be used for doing replacements etc, as you have done. I'm not sure what the implications are of having RegEx return those to MacroScript variables. I'll have to look into it.
When you look into things, good things often happen so yes please do. Here's what prompted me to ask... see Perl example found
here:
blog post on 'Regex Capture Groups In Python and Perl' wrote:Perl Example:
$foo = '11/14/2007';
if ($foo =~ m^[0-9]{2}/[0-9]{2}/([0-9]{4})^) {
print $1;
}
output:
2007
* Note the string we captured ended up in the special variable $1
Perl can refer to $1, $2, etc. capture group variables directly later in the code so if its technically possible... I think it would be a nice addition to round out your PCRE implementation within MacroScript.
If you were to add it, I would assume every time the RegEx> command was called:
- any and all MS capture group variables created the last time the RegEx> command was called would be cleared (not set to the empty string but removed, no longer visible in the watch list)
- new MS capture group variables would be created if any were specified in the regex pattern or EasyPattern
Again, easy to say but perhaps difficult to implement. Please let us know what you decide and thanks for investigating and considering this.
mtettmar wrote:You're not going to let this go are you
Are you saying I'm stubborn?
The difficult problems are the most interesting after all...
mtettmar wrote:Parsing CSV with RegEx or EasyPatterns is going to need two parses to be reliable. Period.
You may be right...
mtettmar wrote:Here's the EP version of my original two-parse RegEx solution:
Code: Select all
//Easypatterns:
Let>text=sally,"1,2500",fred
Let>pattern=[CSVField]
RegEx>pattern,text,1,fields,num,0
//Loop through each field
Let>pattern=[(lineStart, doubleQuote) or (doubleQuote, lineEnd)]
Let>k=0
Repeat>k
Let>k=k+1
Let>the_field=fields_%k%
//RegEx to remove starting and ending quotes if present
RegEx>pattern,the_field,1,matches,n,1,,the_field
//Display the field
MessageModal>the_field
Until>k=num
I think it's small price to pay to put up with the quotes and remove them if needed when that particular field is used. As I showed earlier trying to do it in one go is problematic. I don't believe it's possible to do it 100% reliably without making some assumptions about the
format of the data.
Speaking of the format of the data, the one assumption I would like to make is that the CSV data is "properly formatted". To find out what that means, I checked here on Wikipedia:
http://en.wikipedia.org/wiki/Comma-separated_values
A few rules I found there were interesting:
rules wrote:In some CSV implementations, leading and trailing spaces or tabs, adjacent to commas, are trimmed. This practice is contentious and in fact is specifically prohibited by RFC 4180, which states, "Spaces are considered part of a field and should not be ignored."
Then I realized I would be willing to accept as "standard" whatever Excel did to the data when I chose Save as type:
CSV (MS-DOS)(*.csv)
- so I opened a new Excel spreadsheet (with Microsoft Office Excel 2003)
- I used "Format Cells" to set all cells to "Text" so it wouldn't mess with the data I entered... but just display what I entered
- I entered the following data values on one line:
more challenging data wrote:element 1:sally
element 2:1,2500
element 3: fred
element 4:111,222,333
element 5:
element 6:""
element 7:red
element 8:embedded"doublequote
element 9:done
Note that element 5 is empty (nothing entered in that spreadsheet cell) and element 6 is two double quote characters.
When I saved it to a
CSV (MS-DOS) file, the line appeared as:
Microsoft Office Excel 2003 CSV (MS-DOS) format wrote:sally,"1,2500", fred ,"111,222,333",,"""""",red,"embedded""doublequote",done
So that proved that Excel was leaving the leading and trailing spaces around fred alone... and our existing regex methods also preserved those spaces when we extracted the elements, no problem there.
Then there are these two rules:
rules wrote:Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes.
If a field's value contains a double quote character it is escaped by placing another double quote character next to it.
And the following rule which is just the above two at the same time:
rules wrote:Fields with embedded double-quote characters must be enclosed within double-quote characters, and each of the the embedded double-quote characters must be represented by a pair of double-quote characters.
If you look at the line that Excel produced, it followed the above rules exactly.
Element 6 being two double quote characters... first had each of those escaped by doubling them... then surrounded by another pair of double quote characters so it is properly represented in the CSV file as six double quote characters in a row.
Element 8 had the same treatment, the internal " was escaped to be "" and the whole thing surrounded by another pair of double quote characters.
We know exactly what the values were we entered into the cells. We know Excel created a CSV file according to the "rules". Can we parse that line from the CSV file and populate an array with the exact values we had entered?
I found problems with the original regex pattern you supplied when it came to handling escaped double quotes. The [CSVField] EasyPattern must indeed be doing some heavy lifting with a very complicated regex pattern (Simon referred to it as horrendous) because it handles the escaped double quotes... but I found that even this EasyPattern did not convert the data back to its original form as entered into the spreadsheet (and perhaps its designers did not attempt to do that).
Please run the examples below to see what I mean:
Code: Select all
MDL>this regex pattern cannot handle doubleQuotes inside the data elements
//More challenging CSV line
Let>text=sally,"1,2500", fred ,"111,222,333",,"""""",red,"inside""doublequote",green
//RegEx to extract each field to an array
Let>pattern="[^"\r\n]*"|[^,\r\n]*
RegEx>pattern,text,0,fields,num,0
//Loop through each field
Let>k=0
Repeat>k
Let>k=k+1
Let>the_field=fields_%k%
//RegEx to remove starting and ending quotes if present
RegEx>^"|"$,the_field,0,matches,n,1,,the_field
//Display the field
MessageModal>the_field
Until>k=num
MDL>EasyPatterns handles it but note that doubleQuote chars that were escaped (doubled) are left alone
//More challenging CSV line
Let>text=sally,"1,2500", fred ,"111,222,333",,"""""",red,"inside""doublequote",green
//EasyPattern to extract each field to an array
Let>pattern=[CSVField]
RegEx>pattern,text,1,fields,num,0
//Loop through each field
Let>pattern=[(lineStart, doubleQuote) or (doubleQuote, lineEnd)]
Let>k=0
Repeat>k
Let>k=k+1
Let>the_field=fields_%k%
//RegEx to remove starting and ending quotes if present
RegEx>pattern,the_field,1,matches,n,1,,the_field
//Display the field
MessageModal>the_field
Until>k=num
MDL>If the goal is to load our array with the data values represented in the CSV file...
MDL>...then we need to convert any escaped (doubled) doubleQuote chars back to a single doubleQuote
//More challenging CSV line
Let>text=sally,"1,2500", fred ,"111,222,333",,"""""",red,"inside""doublequote",green
//EasyPattern to extract each field to an array
Let>pattern1=[CSVField]
RegEx>pattern1,text,1,fields,num,0
//Loop through each field
Let>pattern2=[(lineStart, doubleQuote) or (doubleQuote, lineEnd)]
Let>pattern3=[doubleQuote, doubleQuote]
Let>column=0
Repeat>column
Let>column=column+1
Let>the_field=fields_%column%
//RegEx to remove starting and ending quotes if present
RegEx>pattern2,the_field,1,matches,n,1,,the_field
//RegEx to convert any escaped (doubled) doubleQuote chars into a single doubleQuote char
RegEx>pattern3,the_field,1,matches,n,1,",the_field
//Display the field
MessageModal>the_field
Until>column=num
The final modification above allows us to load the array with the exact data values that were entered into the spreadsheet... since I believe this was the goal.
Question to all: What is Excel's problem when I enter the following value in a cell:
'abc
It does not show the leading ' in the cell, though I can see its really still there as it appears in the formula line (by the fx) however, Excel drops the ' when saving to a CSV file and only saves abc as the value.