ISO Text Processing Help
Jan. 6th, 2008 03:16 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
I have 208 HTML files. I need to find the first occurrence of text between H1 Tags - like so:
and then drop the text between the TITLE tags in the HEAD region. Yes, the sample text I need to grab is always on the line after the first H1 tag, and is always the only text on that line. The H1 tag is always early in the BODY region. I would love to automate this - I've got Perl, Python, and the standard Unix command-line text processing tools.
Anyone have any suggestions, magic invocations, or whatever? I know this can be done in Perl, probably fairly easily - but I don't do enough Perl to write it myself, and I can't conceptualize how to make the processing go backwards using the standard Unix tools.
<H1 ALIGN=CENTER>
sample text
</H1>
and then drop the text between the TITLE tags in the HEAD region. Yes, the sample text I need to grab is always on the line after the first H1 tag, and is always the only text on that line. The H1 tag is always early in the BODY region. I would love to automate this - I've got Perl, Python, and the standard Unix command-line text processing tools.
Anyone have any suggestions, magic invocations, or whatever? I know this can be done in Perl, probably fairly easily - but I don't do enough Perl to write it myself, and I can't conceptualize how to make the processing go backwards using the standard Unix tools.
no subject
Date: 2008-01-06 09:39 pm (UTC)Once you test this on a few things, you can remove the ".rewritten" in the second "open" to just overwrite the original file.
Usage: scriptname *.html
#!/usr/bin/perl
undef $/;
# give filenames on command line
foreach $fn (@ARGV)
{
open(I,$fn) or next;
$contents = ;
close(I);
unless ($contents =~ m#
Once you test this on a few things, you can remove the ".rewritten" in the second "open" to just overwrite the original file.
Usage: scriptname *.html
#!/usr/bin/perl
undef $/;
# give filenames on command line
foreach $fn (@ARGV)
{
open(I,$fn) or next;
$contents = <I>;
close(I);
unless ($contents =~ m#<H1[^>]*>(.+?)</h1#si)
{
print "text not found in file $fn\n";
next;
}
$text = $1;
$contents =~ s#<title>(.*?)</title>#<title>$text</title>#si;
open(O,">$fn.rewritten");
print O $contents;
close(O);
}
no subject
Date: 2008-01-06 10:35 pm (UTC)no subject
Date: 2008-01-07 01:21 am (UTC)Anyway, glad it worked. I kept it short by making some assumptions, but apparently they were acceptable ones.
no subject
Date: 2008-01-07 01:32 pm (UTC)This makes me very happy!
no subject
Date: 2008-01-07 03:31 am (UTC)