Google has a realtime feed of Twitter

Posted 2009-07-04 in WWW by Johann.

I’ve just discovered that Google has a realtime feed of Twitter.

I twittered this:

Testing something highly interesting http://invx.com/a

…and seconds later, Googlebot showed up to grab the URL at invx.com:

66.249.65.83 invx.com [04/Jul/2009:21:49:39 +0200] "GET /a HTTP/1.1" 404 136 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"

You don’t need to check, the IP address does belong to Google. I twittered another URL before and that too was crawled, so it’s not that Google crawls the public timeline.

Some of my thoughts:

One more sign that “nofollow” doesn’t mean anything.
I wonder if Google pays Twitter for the feed.
Maybe twittering URLs can even get you indexed?

5 comments

Google Analytics Search Engine Extension

Posted 2009-06-23 in JavaScript by Johann.

Google Analytics comes with its own list of search engines

Unfortunately, Google is missing many smaller search engines, as well as local versions of certain search engines.

By default, Google Analytics’ JavaScript supports the following search engines: Daum, Eniro, Naver, images.google.com, Google, Yahoo, MSN, Bing, AOL, Lycos, Ask, Altavista, Netscape, CNN, About.com, Mamma, Alltheweb, Voila, Virgilio, Live, Baidu, Alice, Yandex, Najdi, Mama, Seznam, Search.com, Szukaj, Onet, Szukacz, Yam, PCHome, Kvasir, Sesam, Ozu, Terra, MyNet, Ekolay and Rambler.

The following JavaScript adds support for 60 70 sites to Google Analytics, among those caching domains (bingj.com), social web search engines (IceRocket, Twitter, Technorati) and many ISP portals.

// Extends Google Analytics with a ton of smaller search engines for more accurate search tracking.
// Written by Johann Burkard mailto:johann@johannburkard.de <https://johannburkard.de>
// MIT license.

for(var e="abcsok,alot,aolsvc,auone,babylon,bigseekpro,bingj,blingo,blogsearch.google,charter,comcast,conduit,cuil,com.com,daemon-search,dir.mobi,duckduckgo,earthlink,friendfeed,home.nl,icerocket,icq,incredimail,info.co.uk,mail.ru,metager2,mycricket,mytelus,nifty,ninemsn,oneview,orange.co,peoplepc,reddit,reference,semager,startpagina,sweetim,t-online,twitter,uluble,vinden,virginmedia,wibeez+q|ixquick,netzero,qip.ru,tiscali+query|att.net+string|orange.es+buscar|aliceadsl,rr.com+qs|alicesuche,gougou,technorati+search|avantfind+Keywords|mister-wong+keywords|delicious+p|info.com+qkw|sky+term|myspace+qry|gmx,web.de+su|soso+w|ukr.net+search_query|opendns+url|baidu,niuhu+bs|baidu,niuhu+word".split("|"), l=e.length;l--;)for(var p=e[l].split(/[,+]/),k=p.length-1,m=k;m--;)pageTracker._addOrganic(p[m],p[k]);

To use the Google Analytics search engine extension, insert the following line before the call to _gat._getTracker:

<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("XXX");

for(var e="abcsok,alot,aolsvc,auone,babylon,bigseekpro,bingj,blingo,blogsearch.google,charter,comcast,conduit,cuil,com.com,daemon-search,dir.mobi,duckduckgo,earthlink,friendfeed,home.nl,icerocket,icq,incredimail,info.co.uk,mail.ru,metager2,mycricket,mytelus,nifty,ninemsn,oneview,orange.co,peoplepc,reddit,reference,semager,startpagina,sweetim,t-online,twitter,uluble,vinden,virginmedia,wibeez+q|ixquick,netzero,qip.ru,tiscali+query|att.net+string|orange.es+buscar|aliceadsl,rr.com+qs|alicesuche,gougou,technorati+search|avantfind+Keywords|mister-wong+keywords|delicious+p|info.com+qkw|sky+term|myspace+qry|gmx,web.de+su|soso+w|ukr.net+search_query|opendns+url|baidu,niuhu+bs|baidu,niuhu+word".split("|"), l=e.length;l--;)for(var p=e[l].split(/[,+]/),k=p.length-1,m=k;m--;)pageTracker._addOrganic(p[m],p[k]);

pageTracker._trackPageview();
} catch(err) {}</script>

If you are serious about keyword statistics and data mining, give it a try. Please email me if you have additions or comments.

lighttpd In-Memory gzip Compression

Posted 2009-04-20 in WWW by Johann.

gzip compression improves website performance by decreasing the size of files. The efficiency of gzip can be increased by using an in-memory file system.

Web Application Performance Tuning Targets

There are three goals when tuning the performance of websites:

Making less requests. Requests can be saved by inlining CSS and CSS sprites.
Avoiding duplicate requests. This is what caching headers are used for.
Reducing the amount of data. Either by removing whitespace or unnecessary code or by compressing the data.

Configuring lighttpd

lighttp supports transparent gzip compression out-of-the-box with the mod_compress module. Enable it in /etc/lighttpd/lighttpd.conf like this:

server.modules = (
    "mod_accesslog",
    "mod_access",
    "mod_redirect",
    "mod_rewrite",
    "mod_evhost",
    "mod_proxy",
    "mod_compress",
    "mod_expire"
)

The next step is to set up a directory in an in-memory file system that caches the compressed version of the files.

# mkdir -p /dev/shm/lighttpd/compress
# chown -R www-data:www-data /dev/shm/lighttpd

Configuring mod_compress consists of specifying the cache directory and the MIME types of the files to compress. In my example, I’m compressing plain text files, static web pages, CSS style sheets and JavaScript files.

compress.cache-dir   = "/dev/shm/lighttpd/compress"
compress.filetype    =
 ( "text/plain", "text/html", "text/css", "text/javascript" )

After restarting lighttpd, gzip compression is active. Of course, only user-agents that ask for gzip-compressed content using an Accept-Encoding HTTP header will be served the gzip compressed files.

To make sure the in-memory directory exists after reboot, add the commands above to /etc/rc.local.

Real-Life Examples

Here are two real-life examples how much data can be saved using gzip compression:

nebel.org

NEBEL consists of very few elements.

/index.html was compressed to 2.6 KB from 15.6 KB.
https://johannburkard.de/resources/css/r5.css was compressed to 2.4 KB from 8.3 KB.

The total bandwidth saving is 18.9 KB.

2 comments

Garbage Collector Pitfalls

Posted 2009-03-04 in Java by Johann.

Garbage by Editor B, some rights reserved.

If your Java VM crashes, your application might leak memory or the memory settings might not be optimal.

Whatever the cause might be, the following tips will help you fix the problem quickly.

Don’t increase Maximum Memory

Many developers’ first reaction if the Java VM isn’t stable is to increase the maximum memory with the -Xmx switch. The problems of this are:

Garbage collections take longer since more memory needs to be cleaned, leading to a lower application performance.
Memory leaks take longer to crash the VM. If there is a memory leak, having a larger Java heap means it will take longer until the VM crashes, making testing more difficult.

Enable GC logging

Keeping an eye on what the garbage collector is doing is important. In my experience, the runtime overhead is small so I have this enabled at all times. If logging is not enabled and a crash happens, it might take a long time to reproduce it.

For the Sun VM, add the following switches:

-Xloggc:<file> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

You can analyze the garbage collector log file using GCViewer or a variety of other tools.

For the IBM VM, add the following switches:

-Xverbosegclog:<file>

The output of the IBM VM is written in an XML format. Pay attention to the nursery and tenured elements and their freebytes attribute. A complete description is available in the Diagnosis documentation.

Adapt the ratio of “New” to “Old” space

Without diving too deep into the Java memory model, let me just say there is a “new” space for newly created objects and an “old” space for objects that have been used for a while.

Web applications usually create many objects that are only used for a short time during a request. If the ratio of “new” to “old” is too low, many garbage collections will occur, decreasing the application performance.

An easy way to test this is to watch the garbage collector log file (tail –F <log file>) and click on a button. If you see more than one garbage collection, there might not be enough “new” space. In this case, increasing the ratio of “new” to “old” might help a lot.

For the Sun VM, the switch is:

-XX:NewRatio=<ratio>

I have this set on this server to -XX:NewRatio=2, making “new” one third the size of “old.”

For the IBM VM, the switches are:

-Xmnx=<maximum new size> -Xmx=<maximum overall Java heap size>

Did you have problems with garbage collection? Post a comment how you solved them!

5 comments

Pages

Page 1 · Page 2 · Page 3 · Page 4 · Page 5 · Page 6 · Page 7 · Next Page »

RSS 2.0, Atom or subscribe by Email.

Google has a realtime feed of Twitter

5 comments

Google Analytics Search Engine Extension

lighttpd In-Memory gzip Compression

Web Application Performance Tuning Targets

Configuring lighttpd

Real-Life Examples

nebel.org

2 comments

Garbage Collector Pitfalls

Don’t increase Maximum Memory

Enable GC logging

Adapt the ratio of “New” to “Old” space

5 comments

Pages

Subscribe

Top Posts

Categories

Navigation