Scripting News for 4/7/2007

7 Apr

Scripting News for 4/7/2007

Posted April 7, 2007 by Dave Winer in Scripting News. 32 Comments

Is Google the root of all pain?

Washington Post: “If all of the newspapers in America did not allow Google to steal their content, how profitable would Google be?’ Zell said during the question period after his speech. ‘Not very.'”

Believe it or not I think I understand what he’s saying even though what he literally said makes no sense.

He’s thinking that publishing their content on the web is a money-loser for his papers. It has nothing to do with Google, but in his mind there is no separation between Google and the web (and he has a point, most of the money being made on the web, market cap-wise is being made by Google).

He’s looking at the balance sheets of the papers he’s bought and wondering “Why the fuck are we on the web?”

Good question, btw. Not saying it would be good for us if they pulled out of the web, and ultimately it probably wouldn’t do anything to help the financial condition of his papers.

Look a little closer

Some people are saying that Day 1 of Scripting News was nothing more than a collection of links to sites I had visited. I have a couple of responses to that.

1. That’s what weblogs were in the early days, it’s only later that the the title-link-description model became the norm. Wes Felter, an early blogger, laments that change, and says it has more to do with the software than the way people blog. I agree, in fact that’s why sites like del.icio.us came along to reinvent what blogs were in the first place! 🙂

2. Look a little closer, the last link on the page is to a DaveNet piece, which was more like what you’d think of as a blog post today. For people who weren’t around back then, DaveNet was a series of essays I started in 1994, they were shipped via email, published on the web, and often quoted in the press (thanks!). I humbly put the link at the bottom of the page in the beginning because I didn’t want to seem egotistic, but later I got over it, and put it at the top, where it belonged. Eventually I stopped writing DaveNets because mail became clogged with spam, and I wanted to encourage people do use the web more. Nowadays, Scripting News is like DaveNet, so things have come pretty much full circle in the almost thirteen years since all this michegas started.

Today’s links

NY Times: “The popular image of a heart attack is all wrong.”

Rafe Needleman: “Jaiku is another nanoblogging service, much like Twitter.”

Question for htaccess experts

I have some documents in a site with no suffixes, like this:

http://somesite.com/utah
http://somesite.com/georgia
http://somesite.com/arizona

And I want to redirect them to:

http://somesite.com/utah.html
http://somesite.com/georgia.html
http://somesite.com/arizona.html

I tried the obvious thing in an htaccess file:

Redirect /utah http://somesite.com/utah.html

But it generates an infinite series of redirects. Makes sense when you think about it — Apache is not exactly redirecting urls, but patterns.

I thought maybe RedirectMatch would be the right way to go, but it’s not clicking.

How do you handle a situation like this??

Please post a comment if you know. Thanks!!

Why lawyers are special

In a comment on yesterday’s Lawyers essay, Dan Stoval says lawyers are “no better, no worse than any other profession.”

Of course other professions and trades have unethical and incompetent people, and of course there are lawyers who are principled and competent, and there are lawyers who are good some days and bad others.

But — people are scared of “outing” lawyers who misbehave because when a lawyer gets mad, he or she can destroy you. I suppose a doctor can too, but that’s a little too scary to contemplate — let’s hope that doesn’t actually happen. But lawyers try to destroy people as a matter of everyday business. Non-lawyers just accept it with a shoulder-shrug.

My point is that now it’s time to go through that, and use our new tools to at least let each other know which lawyers are the good ones.

Is Microsoft dead? Feh.

Paul Graham posits that Microsoft is dead and the cause of death is:

1. Murder by Google.

2. Oh who cares, it’s all bullshit.

In fact, Microsoft is not dead, because (come on get real) it’s a company, and companies aren’t living, and they don’t die.

In 1983 I wanted to develop for the Mac and I had investors advising me, older guys who had been in the tech business probably about as long as I’ve been in it now. Everyone said that Apple was dead. They asked what Apple’s sales were. About a billion dollars. They said it was safe to develop for them, because billion dollar companies don’t go away. Same with Microsoft today.

What’s happening with MS is not death, but being pulled back to earth by gravity. It’s the cycle of tech companies, and it’s like the cycle of world powers. You have a vast natural resource to exploit, your population grows, the air gets clogged, the resource starts to run out and you’re left with a large population. You go from optimism and huge growth to reality and flat, even negative growth. It’s completely natural and predictable. It’s going to happen to Google too, bet on it.

BTW, Microsoft’s natural resource was people who don’t have personal computers. And that’s what they’re running out of now. So they have to sell people their fifth and sixth PC. They will. And they will suck. Like everything else does. And Microsoft will be a mediocre huge company, again like every other huge company.

Sorry, Graham has no clue about the cycles of technology. You never should fear the incumbent, any more than you fear the IRS. Keep your distance, unless you’re trying to be the next one, in which case good luck to you.

Emailing with Ole Eichorn about this (I think he used to work at Intuit) — he wonders if MS has become irrelevant. I volunteer that of course they are irrelevant. It’s been going on for a long time. My diatribe continues.

Geez, it’s as if he (Graham) discovered something new!

I would say MS jumped the shark right around the time of “write once run anywhere.”

They fought that. Oooops. Mistake.

They also tried to bury the web to protect Office.

Instead the web just routed around them.

Google took advantage. For a while.

BIG FUCKING DEAL.

PS: I could use some help with Apache htaccess files. 🙂

32 responses to this post.

Posted by Matt Gifford on April 7, 2007 at 10:46 am

Try this, Dave:

RedirectMatch /utah$ http://somesite.com/utah.html

Reply
Posted by JY on April 7, 2007 at 10:48 am

Would that work ?

RewriteEngine on
RewriteRule ^([^.]+)$ /$1.html [R]

(you need mod_rewrite activated)

Reply
Posted by Dave Winer on April 7, 2007 at 10:57 am

Matt, it didn’t work. What’s the theory?

JY, not sure if mod_rewrite is on, but I don’t own the server, so I can’t do anything like that.

Reply
Posted by Dave Winer on April 7, 2007 at 10:58 am

JY, but I’m going to try that mumbo jumbo. Hope that’s literally what’s supposed to be in the file, because this is all nonsense to me! 🙂

Reply
Posted by JY on April 7, 2007 at 11:01 am

It redirects everything that looks like foo (with no . inside) to foo.html, so be careful if there are other files / apps in there.

Reply
Posted by JY on April 7, 2007 at 11:13 am

Or listing them all if you have a finite list you already know..

RewriteEngine on
RewriteRule ^foo$ /foo.html [R,L]
RewriteRule ^bar$ /bar.html [R,L]

But that’s equivalent to the RedirectMatch

Reply
Posted by Jason Lefkowitz on April 7, 2007 at 11:15 am

Dave,

assuming you have mod_rewrite installed (and I’d complain to your host if you don’t — there’s no reason they shouldn’t offer it), what you want to do should be pretty straightforward.

Here’s some mod_rewrite docs to get you up to speed on the theory:

* mod_rewrite reference
* mod_rewrite user’s guide

In a nutshell, all you’re doing is defining regular expressions, and then giving instructions to Apache on how to modify any URLs that match the regexp. So JY’s example, for instance, says “take any URL without a dot in the name, and append ‘.html’ to it at the end.”

Reply
Posted by JY on April 7, 2007 at 11:17 am

This one works with apache2 here :

RedirectMatch /utah$ /utah.html

Thqt’s weird if it does not with your Apache.

Reply
Posted by Dave Winer on April 7, 2007 at 11:18 am

JY, your RewriteRule approach worked!!!

Happy happy happy!

🙂 🙂 🙂 🙂 🙂 🙂 🙂 🙂 🙂

Reply
Posted by Matt Gifford on April 7, 2007 at 11:24 am

/utah$ would match URLs that end in /utah. Not sure why that didn’t work for you. JY’s use of mod_rewrite is a better solution, in my opinion. I just suck at it.

Reply
Posted by Dave Winer on April 7, 2007 at 11:25 am

JY, I tried the simpler method you said should work, and it did work here.

http://cyber.law.harvard.edu/rss/.htaccess

You can see what I’m doing, redirecting the old Manila-based URLs to ones that are more Apache-friendly. We haven’t actually flipped the switch yet for the static site, but we’re getting closer. 🙂

Thanks soooo much, everyone, for the help.

And if you have a minute or two to read the spec, and look for broken links, it would be much appreciated. Post any comments here.

Reply
Posted by heavyboots on April 7, 2007 at 11:32 am

Re: What’s the theory

The Official Apache 1.3 URL Rewriting Guide:
http://httpd.apache.org/docs/1.3/misc/rewriteguide.html

There are some decent tutorials to get you started with regex here:
http://www.regular-expressions.info/

BareBones Software’s TextWrangler is free and also color encodes the expressions which is nice for testing before you try and take a regex “live”.

And I think a specific keyword-only replacement solution would be something like:
RewriteEngine On
RewriteRule ^/(utah|georgia|arizona)$ http://somesite.com/$1.html

But I could be wrong–it’s been about 3 years since I had to write one! 🙂

Reply
Posted by Chris Weekly on April 7, 2007 at 11:37 am

Hi
Some mod_rewrite (and regex) basics:

RewriteRule directives say “match the left side against the request URI, and if there’s a match, rewrite the URI to whatever’s on the right side.”

The reason your first attempt didn’t work is that “/utah” will match on “/utah”, “/utahfoo” and “/utah.html”. What you want (and what others above correctly noted) is to match on [beginning of URI] + /utah + [end of URI].
The special characters you care about here are thus:
^ [match beginning of URI] and $ [match end of URI].
so a good rule would be

RewriteRule ^/utah$ /utah.html

Note if you don’t specify a fully-qualified URL in the rewrite target (the right side) it will default to the same protocol (http) and host (somedomain) as the original request.

So
RewriteRule ^/utah$ /utah.html
does the exact same thing to http://www.somesite.com/utah
as
RewriteRule ^/utah$ http://www.somesite.com/utah.html

— the difference being, a fully-qualified target will force a redirect — meaning a “302” HTTP response is sent which forces the browser to send another request. This increases latency for the user and server load for your web host (bad) but puts the new URL in the browser location bar, which may be what you want.

Finally, there are some optional flags you can append at the end of your RewriteRule to modify the behavior.

[R] will force a redirect
[NC] will make the pattern match case-insensitive
[L] will tell apache to send its response immediately and ignore any potential further rewriting rules further down in your htaccess or httpd.conf config file.

There’s lots more magic to mod_rewrite but I think that covers your case.

It’s ok to email me w/ questions if this isn’t clear.

HTH someone
/Chris

Reply
Posted by Dave Winer on April 7, 2007 at 11:47 am

I have another question.

I’ve now got all the redirects programmed for pages that have names (the ones that are being transitioned). Now I’d like to program a redirect for all other pages to the About page, that explains the transition that took place.

In other words, I want to have a local 404 page, one just for this folder on the server.

Is there a way to do that??

Reply
Posted by Ole Saalmann on April 7, 2007 at 12:00 pm

Take a look at this. And then put

ErrorDocument 404 about.html

in your .htaccess file.

Reply
Posted by JY on April 7, 2007 at 12:01 pm

ErrorDocument 404 /about.html in the .htaccess

Reply
Posted by Chris Weekly on April 7, 2007 at 12:04 pm

If I understood you correctly, where the requirement is
“After handling the specific URLs (e.g. /utah -> /utah.html etc) which have new pages set up, for all other URLs in a specific directory [what directory?], redirect them to an about page.” Is that right?

If so, this should work:

# Enable mod_rewrite directives
RewriteEngine On

# forward the updated pages to their new .html URLs,
# redirecting with a “moved permanently” (301) response code:
RewriteRule ^/(utah|georgia|arizona)$ /$1.html [R=301,L]

# forward any pages without new .html versions to a custom message pg:
# (but forward silently, internal to apache, w/out redirection)
RewriteRule ^/.+[^\.html]$ /about.html [L]

This last rule says, “for any URL that does NOT end in “.html”, silently forward it to the /about.html page”.

I didn’t test this last rule so it’s possible there’s a syntax error — if it doesn’t work let me know and I’ll fire up local apache and tweak it.

Reply
Posted by Chris Weekly on April 7, 2007 at 12:09 pm

sorry that last rule, if it’s only for a certain directory, would be

RewriteRule ^/somedirectory/.+[^\.html]$ /about.html [L]

Reply
Posted by Chris Weekly on April 7, 2007 at 12:12 pm

JY is right and his approach is simpler (if you don’t mind adding per-directory .htaccess files.)

My apache config experience comes from editing httpd.conf directly (or config files included directly in it) and not from managing multiple .htaccess files in various directories, so I’m biased towards centralizing all the rules in one place… but in typical hosting situations that’s not an option so yeah, follow JY’s advice for the custom error pg for that directory. =)

Reply
Posted by Dave Winer on April 7, 2007 at 12:21 pm

Thanks Chris, that’s incredibly helpful, and not just for me — there’s a real lack of good docs on the web about Apache, and I’ve had to hunt around for every little bit of knowledge (as I’m sure many others have). These threads are little gems when you find them, and I’ve made sure to use all the right keywords in my description of the problem, to make sure it shows up in searches for people who come later. I now understand a hundred percent more about redirecting in Apache. Thanks again!!

Reply
Posted by Paul Ding on April 7, 2007 at 3:01 pm

It’s amazing what you can do with the Apache redirect directives and rewrite engine, but Apache ends up opening an .htaccess file every access, parsing the .htaccess directives each time, and then applying each test to every file.

The straightforward solution is not only a lot easier to understand, but it’s a lot less work for the server. Let the file system do what the file system is good at doing:

mv utah utah.html
mkdir utah
ln -s utah/index.html utah.html

That way, requests for http://somesite.com/utah and requests for http://somesite.com/utah.html are quietly, efficiently, and correctly served!

You can set up a shell script or Perl script to handle this, if you have hundreds of filenames missing the .html

Reply
Posted by Peter Breuls on April 7, 2007 at 3:43 pm

Dave, the Apache directive MultiViews makes files accessible without having to know the extensions. When you ask the server for /utah, it will try to find a file with that name + any extension, so it will match utah.html.

To enable:

Options +MultiViews

This adds MultiViews to the options already enabled.

Reply
Posted by Chris Conway on April 7, 2007 at 4:17 pm

Paul, you’re assuming that the HTML file doesn’t have any relative URLs in it. mod_rewrite routes around this problem by separating URLs from physical paths.

Reply
Posted by Alek on April 7, 2007 at 7:17 pm

Easiest way:

Options +MultiViews

it will work for jpegs, gifs, php, html, everything.

Reply
Posted by Jacob Levy on April 8, 2007 at 11:59 am

An important addition:

If you’re going to use mod_rewrite, you need to add a rule disposition clause at the end of every rule.

There are a few possibilities:

* [L] means this is the last rule and no other matches will be tried. Obviously this is very important for getting good performance on high traffic sites.
* [C] means case-insensitive, I think

There’s a few others…

HTH

Reply
Posted by Steve Wake on April 8, 2007 at 12:38 pm

If you’re also going ot be handing things through on the QueryString (eg ..&UserName=Fred then don’t forget to put [QSA] on the end of your rewrite rules so they get passed through too.

As for the possible load imposition of per directory .htaccess files, the same directives can instead be put into the main httpd.conf file for the Directory/Virtual Host which ensures the rules are only loaded when the server starts, rather than when a directory is accessed.

This means you have to be able to edit the main httpd.conf, or in some cases you will have a config file you can edit which only affects your sites on the shared server, depending upon your hosting service.

Cheers,
Steve

Reply
Posted by Paul Ding on April 8, 2007 at 3:24 pm

As for the possible load imposition of per directory .htaccess files, the same directives can instead be put into the main httpd.conf file for the Directory/Virtual Host which ensures the rules are only loaded when the server starts, rather than when a directory is accessed.

Dave indicated that he didn’t own the server, so I assumed any httpd.conf solution was unavailable.

Paul, you’re assuming that the HTML file doesn’t have any relative URLs in it. mod_rewrite routes around this problem by separating URLs from physical paths.

If that’s a problem, using sed to add a BASE HREF to each of the files solves the problem without the overhead of .htaccess, much less the rewrite engine.

I don’t mean to imply that opening an .htaccess file is a BIG load. It isn’t. On the other hand every little bit adds up. Apache docs, for instance, recommend against opening files and parsing them for SSI directives unless there are actually SSI directives to be found there.

Reply
Posted by Mark Groen on April 8, 2007 at 4:43 pm

It’s negligible, here’s some benchmark testing:

http://simon.net.nz/articles/benchmarking-htaccess-performance/

Reply
Posted by Phillip Pearson on April 8, 2007 at 5:28 pm

MultiViews is great… I use it a lot. (e.g. for coffee.gen.nz – /join is actually /join.php, but Apache figures it out).

Another useful thing to do with mod_rewrite and RewriteRule is put in some exclusions. WordPress has this all worked out – the .htaccess file in the root of any WP install has some RewriteCond statements that exclude any files that actually exist from being rewritten. This helps prevent loops, and lets you write very general rules without fear of rewriting ‘too far’ 🙂

Reply
Posted by Chris Weekly on April 11, 2007 at 6:51 pm

Correction about mod_rewrite rule disposition flags: they are *not* required, and in some cases it’s very useful to omit them, which allows you to daisy-chain multiple rules. The way to do this is simply not to use any flags that would otherwise end rule processing. You can use this to create pseudo-variables, e.g.:

RewriteRule ^/oldfoo$ /NEWFOO
RewriteRule ^/oldbar$ /NEWFOO
RewriteRule ^/oldbaz$ /NEWFOO
RewriteRule ^/NEWFOO /finaltarget [R,L]

that way if the target “/finaltarget” needs to change, you can change it in just one place.

Reply
Posted by Chris Weekly on April 11, 2007 at 7:05 pm

Another note about RewriteRule flags:

[R] = “Redirect”. Defaults to 302 response header, but you can can override the response code via e.g. R=301). Fully-qualified rewriting targets like this one:
RewriteRule ^/foo$ http://somedomain.com/bar
issue a 302 redirect by default even if no [R] flag is appended, even if the target URL domain is the same as the original request. With local targets like:
RewriteRule ^/foo$ /bar
if no [R] is specified, apache will internally forward (sans redirect) the request.

[L] = “Last”. By default apache will continue processing directives (thus potentially transforming the URL multiple times per multiple rules) until it reaches the end of the config file, OR until an [L] flag tells it to stop and simply return the URL in its current state.

[NC] = “No Case”. Make the pattern match case-insensitive.

[QSA] = “Query String Append”. IF the target URL has NO query string, the default is to append any query string in the original URL. If there is a query string in the target, by default it will REPLACE the entire original query string. If you wish to concatenate the original query string with the target’s, use [QSA]. (It gets the delimiters right automatically). Remember, in case this didn’t sink in: your regexp pattern matches are against the base URI only and NOT matching against any query string. (You have to use a RewriteCond condition explicitly checking for the query string to do that).

[NE] = “No Encoding”. By default during rewriting mod_rewrite will URL-enode any special (non URL-safe) chars it finds in the target.

[PT] = “Pass Through”. If you’re using apache to front e.g. a JBoss app server (which uses the mod_jk plugin) and you rewrite a URL to a target URL that should be processed by a module other than mod_rewrite, you need to tell mod_rewrite to “pass through” the request to the other handler instead of trying to find a file on the webserver that matches the request. So e.g.
RewriteRule ^/vanityURL$ /myStrutsAction.do
would return a 404 if apache can’t find a webserver file named “/myStrutsAction.do” — but this rule will rewrite the URL and pass it to JBoss:
RewriteRule ^/vanityURL$ /myStrutsAction.do [PT]

Reply
Posted by Chris Weekly on April 11, 2007 at 7:13 pm

Also a note about the special chars in rewrite directives:
on the LEFT side (the pattern-matching side) you can group parts of your regexp by putting them in parentheses. This creates an apache variable named $1 for the first group, $2 for the second, etc. for substitution in the RIGHT side so e.g.:
RewriteRule ^/(foo|bar)/baz/(.*) /newpath/$1/$2 [L]
would rewrite
/foo/baz/thisiseasy => /newpath/foo/thisiseasy
and would rewrite
/bar/baz/okimdonenow => /newpath/bar/okimdonenow

HTH someone =)
/Chris

Reply