Strip out specific characters from string

Technical support and scripting issues

Moderators: Dorian (MJT support), JRL

Post Reply
RNIB
Macro Veteran
Posts: 186
Joined: Thu Jan 10, 2008 10:25 am
Location: London, UK

Strip out specific characters from string

Post by RNIB » Mon Sep 23, 2024 10:17 am

I'm trying to find a way to find the title within the metadata of an HTML file. The title always appears within the following tag:

Code: Select all

<meta name="dc:title" content="Warriors:"/>
The tag is always formatted in the same way but the title, in this case Warriors:, is always different. The tag also doesn't always appear in the same place within the HTML file.

So far my code is:

Code: Select all

Let>SourcePath=E:\Data1\Temp Double Disc No Link\704149
Let>DiscNum=0
Let>t=0
Label>TitleLoop
//Search for Kbyte in metadata
If>DiscNum=0
ReadLn>%SourcePath%\ncc.html,%t%,strTitlePos
Else
ReadLn>%SourcePath%\Part%DiscNum%\ncc.html,%t%,strTitlePos
Position>dc:title,strTitlePos,1,nPos,TRUE
If>nPos>0,Finish
Let>t=t+1
Goto>TitleLoop
Label>Finish
RegEx>(?<=content=").*?(?="/>),strTitlePos,0,Matches,nm,0
ExtractFileName>Matches_1,strTitle
However, using this strTitle is empty and yet when I look at the Matches array I see:
Matches_1=Warriors:

So I don't understand why it's not working. Also I want to remove any non alpha numeric characters and also replace & with the word and. I know I can do this with stringreplace but is there a more elegant way of doing that?

User avatar
Dorian (MJT support)
Automation Wizard
Posts: 1379
Joined: Sun Nov 03, 2002 3:19 am
Contact:

Re: Strip out specific characters from string

Post by Dorian (MJT support) » Mon Sep 23, 2024 11:04 am

I think I'd do this in stages and take the approach of grabbing everything between content= and />, use StringReplace to replace the &, then Regex to remove all non alphanumeric characters. But I'm no Regex expert - there may be others here who can do it all in one go.

Code: Select all

Let>TheHTML=<meta name="dc:title" content="Warri??ors & W!arl>>ords:"/>

//New Command 15.0.23
//Between>TheHTML,content=,/>,matches,num

//Prior to 15.0.23
RegEx>(?<=content=).*?(?=/>),TheHTML,0,matches,nm,0

StringReplace>matches_1,&,and,matches_1

Let>pattern=[^a-zA-Z0-9 -]
RegEx>pattern,matches_1,0,matches,num,1,,result

mdl>result
Yes, we have a Custom Scripting Service. Message me or go here

User avatar
Dorian (MJT support)
Automation Wizard
Posts: 1379
Joined: Sun Nov 03, 2002 3:19 am
Contact:

Re: Strip out specific characters from string

Post by Dorian (MJT support) » Mon Sep 23, 2024 11:08 am

I'd also be inclined use ReadFile instead of ReadLn and then Between/Regex to get anything between dc:title and />
Yes, we have a Custom Scripting Service. Message me or go here

hagchr
Automation Wizard
Posts: 331
Joined: Mon Jul 05, 2010 7:53 am
Location: Stockholm, Sweden

Re: Strip out specific characters from string

Post by hagchr » Mon Sep 23, 2024 6:03 pm

Hi, expanding on Dorian's suggestion:

For Regex, \K means forget what you have matched already and continue, so in the example you match
dc:title until content=" then forget everything (\K) and continue and match subsequent characters that are not ", ie Warriors:.

Hope it helps.

Code: Select all

//Read source file into variable
Let>Source=E:\Data1\Temp Double Disc No Link\704149\ncc.html
ReadFile>Source,strFileContents

//Extract Title
Let>pattern=dc:title.+?content="\K[^"]+
RegEx>pattern,strFileContents,0,Matches,nm,0

//Replace & with and
Let>pattern=&
RegEx>pattern,Matches_1,0,Matches,nm,1,and,result

//Remove non-alpha
Let>pattern=[^a-zA-Z0-9 -]
RegEx>pattern,result,0,Matches,nm,1,,result

MDL>result

User avatar
Dorian (MJT support)
Automation Wizard
Posts: 1379
Joined: Sun Nov 03, 2002 3:19 am
Contact:

Re: Strip out specific characters from string

Post by Dorian (MJT support) » Mon Sep 23, 2024 6:21 pm

Many thanks. I'm always very grateful to the Regex gurus in here.
Yes, we have a Custom Scripting Service. Message me or go here

RNIB
Macro Veteran
Posts: 186
Joined: Thu Jan 10, 2008 10:25 am
Location: London, UK

Re: Strip out specific characters from string

Post by RNIB » Mon Sep 23, 2024 6:53 pm

Ahh, okay. That makes more sense. Very much a newbie to RegEx and by newbie I mean, someone who copied a bit of code they found in the blog and amended it to suit my needs. :roll: :mrgreen: But I am starting to understand it more....slowly.

Thank you both for your suggestions and help. Very much appreciated.

RNIB
Macro Veteran
Posts: 186
Joined: Thu Jan 10, 2008 10:25 am
Location: London, UK

Re: Strip out specific characters from string

Post by RNIB » Tue Sep 24, 2024 4:47 pm

hagchr wrote:
Mon Sep 23, 2024 6:03 pm
Hi, expanding on Dorian's suggestion:

For Regex, \K means forget what you have matched already and continue, so in the example you match
dc:title until content=" then forget everything (\K) and continue and match subsequent characters that are not ", ie Warriors:.

Hope it helps.

Code: Select all

//Read source file into variable
Let>Source=E:\Data1\Temp Double Disc No Link\704149\ncc.html
ReadFile>Source,strFileContents

//Extract Title
Let>pattern=dc:title.+?content="\K[^"]+
RegEx>pattern,strFileContents,0,Matches,nm,0

//Replace & with and
Let>pattern=&
RegEx>pattern,Matches_1,0,Matches,nm,1,and,result

//Remove non-alpha
Let>pattern=[^a-zA-Z0-9 -]
RegEx>pattern,result,0,Matches,nm,1,,result

MDL>result
This is working really well as I've found a few instances where there was a space added before the closing /> which my code missed. However, I've found one thing it's doing which I don't know how to resolve. If a title has a ' in it e.g. The Devil's Lair, it replaces it with andampapos so that the title then become The Devilandampaposs Lair. Is there a way to simply remove the ' ?

User avatar
Dorian (MJT support)
Automation Wizard
Posts: 1379
Joined: Sun Nov 03, 2002 3:19 am
Contact:

Re: Strip out specific characters from string

Post by Dorian (MJT support) » Tue Sep 24, 2024 9:13 pm

Not being a Regex guru, my approach would simply be to let that happen and then change it back again with another replace. I do realise that's somewhat of a bodge.

* "it's not wrong if it works" philosophy.
Yes, we have a Custom Scripting Service. Message me or go here

User avatar
Grovkillen
Automation Wizard
Posts: 1131
Joined: Fri Aug 10, 2012 2:38 pm
Location: Bräcke, Sweden
Contact:

Re: Strip out specific characters from string

Post by Grovkillen » Wed Sep 25, 2024 10:35 am

I suspect the "-" isn't correct RegEx...

Code: Select all

[^a-zA-Z0-9 \-]
Let>ME=%Script%

Running: 15.0.27
version history

hagchr
Automation Wizard
Posts: 331
Joined: Mon Jul 05, 2010 7:53 am
Location: Stockholm, Sweden

Re: Strip out specific characters from string

Post by hagchr » Fri Oct 04, 2024 5:07 pm

Hi, sorry for the late reply, but in case you have not resolved it... The code should normally take care of the ' as well, see example below. Is something else happening? You probably have something like "&amp;apos;" that needs to be handled. Maybe you can paste a couple of HTML lines around and including the problem text?

Code: Select all

Let>strFileContents=<meta name="dc:title" content="Devil's Lair & Warlor??ds:"/>

//Read source file into variable
//Let>Source=E:\Data1\Temp Double Disc No Link\704149\ncc.html
//ReadFile>Source,strFileContents

//Extract Title
Let>pattern=dc:title.+?content="\K[^"]+
RegEx>pattern,strFileContents,0,Matches,nm,0

//Replace & with and
Let>pattern=&
RegEx>pattern,Matches_1,0,Matches,nm,1,and,result

//Remove non-alpha
Let>pattern=[^a-zA-Z0-9 -]
RegEx>pattern,result,0,Matches,nm,1,,result

MDL>result

Post Reply
Sign up to our newsletter for free automation tips, tricks & discounts