sraun: portrait (Default)
[personal profile] sraun
I have 208 HTML files. I need to find the first occurrence of text between H1 Tags - like so:

<H1 ALIGN=CENTER>
sample text
</H1>


and then drop the text between the TITLE tags in the HEAD region. Yes, the sample text I need to grab is always on the line after the first H1 tag, and is always the only text on that line. The H1 tag is always early in the BODY region. I would love to automate this - I've got Perl, Python, and the standard Unix command-line text processing tools.

Anyone have any suggestions, magic invocations, or whatever? I know this can be done in Perl, probably fairly easily - but I don't do enough Perl to write it myself, and I can't conceptualize how to make the processing go backwards using the standard Unix tools.

Date: 2008-01-06 09:39 pm (UTC)
From: [identity profile] johnridley.livejournal.com
This should work. It assumes that the section already exists. If that's not the case let me know and I'll fix it.

Once you test this on a few things, you can remove the ".rewritten" in the second "open" to just overwrite the original file.

Usage: scriptname *.html

#!/usr/bin/perl

undef $/;

# give filenames on command line
foreach $fn (@ARGV)
{
open(I,$fn) or next;
$contents = ;
close(I);

unless ($contents =~ m#
[Error: Irreparable invalid markup ('<h1[^>') in entry. Owner must fix manually. Raw contents below.]

This should work. It assumes that the <title>...</title> section already exists. If that's not the case let me know and I'll fix it.

Once you test this on a few things, you can remove the ".rewritten" in the second "open" to just overwrite the original file.

Usage: scriptname *.html

#!/usr/bin/perl

undef $/;

# give filenames on command line
foreach $fn (@ARGV)
{
open(I,$fn) or next;
$contents = <I>;
close(I);

unless ($contents =~ m#<H1[^>]*>(.+?)</h1#si)
{
print "text not found in file $fn\n";
next;
}
$text = $1;
$contents =~ s#<title>(.*?)</title>#<title>$text</title>#si;
open(O,">$fn.rewritten");
print O $contents;
close(O);
}

Date: 2008-01-07 01:21 am (UTC)
From: [identity profile] johnridley.livejournal.com
Cool. I would have stayed away from I and l and such had I known it needed to be human readable. I'll keep that in mind in the future.

Anyway, glad it worked. I kept it short by making some assumptions, but apparently they were acceptable ones.

Date: 2008-01-07 03:31 am (UTC)
From: [identity profile] backrubbear.livejournal.com
There are two bigger hammers you could get out for this: DOM and XSLT. DOM would let you manipulate the tree and the contents. XSLT does the same thing but has a bigger learning curve. XSLT will be a lot pickier about the input HTML being good conformant XML as well.

Profile

sraun: portrait (Default)
sraun

November 2024

S M T W T F S
     12
3456789
10111213141516
17 181920212223
24252627282930

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jun. 11th, 2025 11:34 am
Powered by Dreamwidth Studios