sraun | ISO Text Processing Help

You're viewing

sraun's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

sraun

I have 208 HTML files. I need to find the first occurrence of text between H1 Tags - like so:

<H1 ALIGN=CENTER>
sample text
</H1>

and then drop the text between the TITLE tags in the HEAD region. Yes, the sample text I need to grab is always on the line after the first H1 tag, and is always the only text on that line. The H1 tag is always early in the BODY region. I would love to automate this - I've got Perl, Python, and the standard Unix command-line text processing tools.

Anyone have any suggestions, magic invocations, or whatever? I know this can be done in Perl, probably fairly easily - but I don't do enough Perl to write it myself, and I can't conceptualize how to make the processing go backwards using the standard Unix tools.

Current Mood: frustrated
Current Location: work
Current Music: Carmina Burana

Flat | Top-Level Comments Only

From:

johnridley.livejournal.com

This should work. It assumes that the section already exists. If that's not the case let me know and I'll fix it.

Once you test this on a few things, you can remove the ".rewritten" in the second "open" to just overwrite the original file.

Usage: scriptname *.html

#!/usr/bin/perl

undef $/;

# give filenames on command line
foreach $fn (@ARGV)
{
open(I,$fn) or next;
$contents = ;
close(I);

unless ($contents =~ m#

[Error: Irreparable invalid markup ('<h1[^>') in entry. Owner must fix manually. Raw contents below.]

This should work. It assumes that the <title>...</title> section already exists. If that's not the case let me know and I'll fix it.

Once you test this on a few things, you can remove the ".rewritten" in the second "open" to just overwrite the original file.

Usage: scriptname *.html

#!/usr/bin/perl

undef $/;

# give filenames on command line
foreach $fn (@ARGV)
{
open(I,$fn) or next;
$contents = <I>;
close(I);

unless ($contents =~ m#<H1[^>]*>(.+?)</h1#si)
{
print "text not found in file $fn\n";
next;
}
$text = $1;
$contents =~ s#<title>(.*?)</title>#<title>$text</title>#si;
open(O,">$fn.rewritten");
print O $contents;
close(O);
}

From:

sraun

Thank you! That did just what I wanted it to do! After I got my ones and ells and text and texy fixed. The trials of not having Internet access right now on the machine that's actually running the script - I had to transcribe it by hand.

From:

johnridley.livejournal.com

Cool. I would have stayed away from I and l and such had I known it needed to be human readable. I'll keep that in mind in the future.

Anyway, glad it worked. I kept it short by making some assumptions, but apparently they were acceptable ones.

From:

sraun

Since this is the last step in turning a .kml e-book into an .html e-book, I know what the files look like - the HEAD section is a skeleton I created, and it has TITLE tags (actually with text between them, but you wrote it such that that didn't matter). And the .kml is a sufficiently consistent format such that the first H1 tag is always the title. Next step is automating the whole process start to finish - I should be able to add your script at the end of my existing shell script.

This makes me very happy!

From:

backrubbear.livejournal.com

There are two bigger hammers you could get out for this: DOM and XSLT. DOM would let you manipulate the tree and the contents. XSLT does the same thing but has a bigger learning curve. XSLT will be a lot pickier about the input HTML being good conformant XML as well.

Flat | Top-Level Comments Only

Profile

sraun

November 2024

S	M	T	W	T	F	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Page Summary

Active Entries

Style Credit

Style: Sea and Salt for Nouveau Oleanders by branchandroot
Resources: OpenClipart and Oceanside Twilight

Expand Cut Tags

No cut tags

Page generated Jun. 11th, 2025 11:34 am

Another of Scott's Homes on the Web

ISO Text Processing Help

ISO Text Processing Help

no subject

no subject

no subject

no subject

no subject

Profile

November 2024

Most Popular Tags

Page Summary

Active Entries

Style Credit

Expand Cut Tags