[ start | index | login ]
Thursday, 04. March 2010

Japanese TWIMPACT beta

masason vs. 555hamako impact (click to enlarge)

We are now running a beta site for TWIMPACT for Japan only. It only works with japanese tweets and works quite well. What is interesting is the battle between a former politician >>555hamako and >>masason the president of >>Softbank, a large telecommunications company in Japan.

First, >>masason started out december 2009 with a quick rise to the top. Then >>555hamako followed beginning of 2010 (the year of elections) with an even steeper rise to take the crown. Also, it looks like masason has not managed to attract the same size of an audience as before, his TWIMPACT stalls a little at the end. Maybe he was just watching the winter Olympics.

I guess we will be starting to adapt the >>TWIMPACT rating to degrade over time to provide a better view of the current impact a user has. Even though it is hard to keep on rising one keeps its TWIMPACT at the moment. This effect is much more visible on the >>global site where you can find a lot of not-so-spammy-spam twitterers that rise quickly and should fall down over time again after they hit the ceiling. PermaLink

no comments | post comment

Thursday, 01. October 2009

NoSQL: MongoDB performance testing (part 2: counting)…

After my insert tests last time I decided to look at some count queries as we do count a lot at >>twimpact.com. As a first result I can say that without any index count makes no sense with a database of this size.

I have used the database left over from my last insert test and added a few indexes which takes around 30-40 minutes per index. I did not check in more detail about the time it takes as we tend to create the index while working on the database anyway.

Now for todays results. The queries are quite simple, but in our case practical. I get a cursor for 1.000.000 documents as a result of a simple query and count the amount of documents that have the value of one of the documents properties:

def cursor = db.find().limit(1000000)
// alternative: query one of the indexed properties
// def cursor = db.find(new BasicDBObject("property", new BasicDBObject("\$ne", null))).limit(1000000)

cursor.each { doc -> def value = doc.get("property") def count = db.getCount(new BasicDBObject("property", value)) }

MongoDB query test (click to enlarge)

The time was taken for each of the "db.getCount()" calls and it turns out that around 40-50% of all queries result in negligible query time (< 1ms) which is the smallest time frame I can measure right now. This needs to be taking into account when evaluating the graphs as they only show the queries with at least 1ms duration (log scale plot).

In the plot you see query time versus the result of getCount(). As expected higher counts may take longer,

Some explanation is necessary for the plots. random means that I get some documents and count one of the properties (the same for all documents). I do not know the order in which the documents come, so they are unrelated to the property I am counting. correlated is the counting if I query the documents using an index and the count the property that was indexed. The assumption here was that it might be easier for the database to count all documents having a certain property value if I previously queried all documents having a non-null property value.

This holds true for the long index but not for the string index. The latter behaves about the same as my random counts.

The results show that count queries are very fast, but only if indexed.

What we also need for >>twimpact.com are some more advanced queries. I assume that the results for those also depend on how we design our documents to fit our needs. The design will take some time and I will get back with results of design and advanced queries at a later date. PermaLink

2 comments (by rbtst) | post comment

Friday, 25. September 2009

NoSQL : MongoDB performance testing (part 1: insert)…

The >>twimpact.com project currently uses a >>PostgreSQL. This is all well, except that it does not scale too well in our environment. Removing some indexes actually improved the performance but I can foresee that the amount of data coming in will slow the application down again.

That is a reason I am looking at non-SQL alternatives. The list includes >>redis, the >>Cassandra Project and >>MongoDB.

I do admit, I only looked shortly at redis, but this is due to the fact that it is a very simple key/value store and we do need some query functionality. Some playing with Cassandra and the Java driver was awkward and in the end I had MongoDB up and running in no time.

The setup is as follows:

  • 4GB MacBook, 2.4Ghz Intel Core 2 Duo, slow disk
  • MongoDB: mongodb-osx-x86_64-2009-09-19
  • (i had to work in parallel, so there might be some swapping)
Currently the database on a remote server has about 38.000.000 tweets stored. At the start of my testing it contained about 35.000.000. The procedure to do the insert test was to copy over batches of 10.000 tweets like the following pseudo code shows:

// initialize MongoDB (started with a complete new one for each test)
def db = new Mongo("twimpact")
DBCollection coll = db.getCollection("twimpact");
// coll.createIndex(new BasicDBObject("retweet_id", 1)) // long index
// coll.createIndex(new BasicDBObject("from_user", 1))  // short string index

def offset = 0 def limit = 10000 def rowCount = sql.count("tweets")

while(offset < rowCount) { // get batch of tweets form PostgreSQL server def data = sql.rows("SELECT * FROM tweets OFFSET ${offset} LIMIT ${limit}") // convert each row into a document and insert data.each { row -> BasicDBObject info = new BasicDBObject(); row.each { key, value -> info.put(key, value); } coll.insert(info); } offset += data.size() }

The time was taken for requesting the data from the SQL data (not shown in the graphs) and for the row loop. In case of the bulk insert test the row loop first stored 5000 new documents in a pre-allocated array and then inserted them:

…
  DBObject[] bulk = DBObject[5000]
  … loop …
  // two times as 10000 was too big for the driver
  coll.insert(bulk)
...

The documents we created were not that big, but have some real-world importance to use with their structure. They might be changed to adapt to the non-schema world though. Here is a good example:

{
  "id": 3551935825,
  "user_id": 1657468,
  "retweet_id": 15965974 ,
  "from_user": "thinkberg", 
  "from_user_id": 6190551, 
  "to_user": null , 
  "to_user_id": null, 
  "text": "RT @Neurotechnology interesting post, RT @chris23 Augmented Reality Meets Brain-Computer Interface >>http://bit.ly/3fg9OG", 
  "iso_language_code": "en", 
  "source": "<a href=">>http://adium.im" rel="nofollow">Adium</a>", 
  "created_at": "Wed Aug 26 2009 06:49:09 GMT+0200 (CEST)",
  "updated_at": "Wed Aug 26 2009 06:50:11 GMT+0200 (CEST)",
  "version": 0,
  "retweet_user_id": null
}

And now for the results. Just like expected there is a downgrade in performance as soon as a certain size of the database is reached. MongoDB took about 2.8GB of my RAM and had to create new data files during the process.

mongo.stat.small

The first insert test did not create or update any index so there is a sustained performance over the whole time. There are remarkable dips which probably happened whenever I unlocked the laptop or switched from one application to another.

Looking at the insert with a number (long) index it appears that the performance degrades slightly and stabilizes shortly after about 20.000.000 inserts. I guess this might be the point where RAM shortness comes into play as you can see similar behavior in the string and bulk/string index tests.

A dramatic performance boost had the bulk inserting. Unfortunately I had to insert each batch in two bulks of 5.000 tweets each as the driver reported that the object was too big" when using an array of 10.000 tweets. While single inserts stabilize around 1000 tweets/s at the end, the bulk insert still reached about 1500-2000 tweets/s.

Looking at where the insert performance started and where it ended might let you conclude that this is going to be slow, but from my experience with a much smaller PostgreSQL database (~4.000.000 tweets) on this laptop I am impressed. Being able to insert around 1000 tweets/s is way faster than what we experience with the current system at >>twimpact.com where we accumulate an analyzer backlog. Given the fact that this test was performed on my laptop and not a production system it is to be expected that the reality looks much better :-)

But inserting is not all, even though this is what we do a lot. Next I am going to take the database and do some query testing to see whether it fits our needs. PermaLink

no comments | post comment

Wednesday, 29. July 2009

twimpact.com - trends by citation

It feels good to code a little again. Again, social software but this time from the analysis point of view. Check out >>twimpact.com to see the trends of the last hour bubble up.

All done in >>grails, which I love. PermaLink

no comments | post comment

Friday, 01. May 2009

Re-use replaced backup harddisks

Now you have RAID system. It runs perfectly, but it also runs full as all storages do over time. You buy new 1.5TB harddisks, replacing the old 500GB ones. Now what do you do with those old ones? They are still perfectly healthy disks.

Well, you buy an >>external SATA dock!

Then you can do off-RAID backup to the disks. Those disks probably last longer than your DVD backups. PermaLink

one comment (by rbtst) | post comment

Saturday, 25. April 2009

The next Backup iteration

Finally I have a backup strategy for my server too. Not actually perfect, but it works for me. I even added backup of some data from my home RAID system and vice versa to it. The data is backed up to two different locations (>>rsync.net and >>Amazon S3) and additionally to the RAID. Some data, like photos is transferred from the RAID to the Server and from there to Amazon S3. All Laptops backup to the RAID. That is too much data to be stored at either offsite location price-wise.

All data transfer is encrypted. The data files are encrypted at either offsite backup but not on the RAID for easy access.

backup PermaLink

no comments | post comment

Friday, 27. March 2009

Twitter - what?

In contrast to my last post, I am using twitter now. More for telling the world what we do, than what I personally do for my leisure. It is the only valid way for me. Giving an idea of what's happening in research. PermaLink
no comments | post comment

Monday, 23. March 2009

New Job - Industry Liaison Manager

I have changed jobs and moved away from the >>Fraunhofer Society to take a post as >>Industry Liaison Manager for a >>Machine Learning and >>Neurotechnology Research group at the Berlin Institute of Technology.

My main focus now will be to manage our industry relations, organize talks and seminars and work on technology transfer. The research project works on >>non-invasive neurotechnology to improve sensors, data analysis and apply the results in neuro-usability and other applications related to man-machine interaction.

This is going to be a challenging and most interesting job. PermaLink

no comments | post comment

Friday, 23. January 2009

Amazon S3 / WebDAV proxy updated

I took the liberty to check out my old code and work on it to finally fix some of the problems. It now correctly uses the last-modified time and the cache handling as well as lazy download from S3 is implemented. To really work with the server it will need better cache handling. After many tests the basic and copymove finally run through repeatedly without failure.

Still a long way to go.

Update: (2009-01-28) In the meantime I implemented the property handling which only fails for some strange UTF-8 property values. Now the litmus test runs 99% through. Using MacOS X Finder to test looks promising. PermaLink

no comments | post comment

Sunday, 18. January 2009

twitter: the public chat

I have been following a few friends >>twitter messages via >>Google Reader and I get the impression that it works much like a group chat system. The conversations are similar to cross-linked comments in weblogs and have a similar publicity.

Unlike these friends I never really started to use twitter and even deleted my account there, as well as in a few other social networking systems. I give away so much already so I don't want to make the harvesting too easy. What strikes me though is, why a service like twitter has taken away the public chat room from classic instant messaging systems. It works much like >>IRC (Internet Relay Chat) where you can just join into an open chat. However, it looks crude that you have to read the others chat to actually communicate.

I guess the real advantage of twitter is the simple user interfaces on loads of different systems that heavy weight instant messaging systems failed to provide until now. PermaLink

no comments | post comment

Thursday, 08. January 2009

Amazon EU

Well, it is one big continent (plus the little island). I ordered at >>amazon.co.uk and when my package arrived by "Deutsche Post" it turns out it was even sent from Bad Hersfeld, Germany. Actually, Amazon should get real and drop the delivery fee from UK to Germany. Seems to be a penalty for ordering in their UK shop. PermaLink
no comments | post comment

Saturday, 20. December 2008

Nothing to read …

I don't know why. There are about a thousand books in my little library, but I cannot find one to read. PermaLink
no comments | post comment

Wednesday, 29. October 2008

Logitech Harmony Support: Excellent

I have had the best telephone support experience ever. To make it easier for my family to operate all the gadgets crammed underneath out TV I have a Logitech Harmony Universal Remote (>>model 885). This device works quite well, unless ..., unless you leave it uncharged for about one year. Then the battery is gone and reviving it almost impossible.

Anyway, when I came back from Korea I called the free hotline and even though I did not expect it, they immediately tested the device online and filed an exchange for me. That was the first good thing.

Now, last week I bought a >>Dreambox, which is a nice little tv receiver with built-in disk. Took me a few hours to get my smartcard running as german cable tv is encrypted. The Harmony remote works okay if you use the default device settings found in the database. However, some keys react slowly and some seem to emit double signals so it always skips. Reading in some forums I found that I should get a copy of a special Dreambox profile into my user account.

Calling the hotline again, I had a helpful and very friendly person on the phone within less than 30 seconds and after presenting serial numbers was escalated to second level support in Canada where another friendly technician copied the profile into my account in no time.

That is service! No fuzz, friendlyness and last but not least speed. PermaLink

no comments | post comment

Sunday, 28. September 2008

The End and The Beginning

In three days my assignment to Korea ends. It was a good time, a stressful time, we did meet new friends. I am grateful.

In three days I will be back in Germany. It will be a good time, great changes show their signs at the horizon. I am looking forward to it. PermaLink

one comment (by rbtst) | post comment

Monday, 15. September 2008

The quiet Tokyo


View Larger Map

This is a map of all buddhist temples I have visited during my short holiday trip to Tokyo. Out of the 22 temples I visited I can prove 16 through my pilgimage book stamped and nicely signed by the priests. At first I choose the temples randomly but then decided to read a bit more. Here is a nice >>explanation of Buddhism in Japan. From >>this site I then selected temples by pilgrimage to give my walks more of a sense. However, since the >>Six Amida Pilgrimage (see bottom) can take a while when on foot, I decided for a bycicle. You can rent one just outside Kamakura station.

The best trip though was my trip to the Izu peninsula. Here you can either just visit the Shuzenji onsen or visit the similarily named temple and then hike along the Hiragana path to Okonuin temple. To get there I took the Shinkansen to Mishima and then the local railroad to Shuzenji. It then only takes ten minutes to the temple by bus.

It is an excellent way to experience this city and its surroundings.

20080909_08-33-57_1676 20080913_15-56-00_1719 20080911_14-04-27_1703

20080909_19-35-14_1685 20080910_13-03-43_1691 20080913_13-02-16_1711 20080913_14-49-51_1670 20080913_15-44-52_1717 20080913_15-56-38_1720

PermaLink

no comments | post comment

Saturday, 19. July 2008

Neue Technik, Alte Technik

Irgendwie ist das schon eine komische Sache mit der modernen Technik. Sie verspricht komplizierte Dinge einfacher zu machen. Aber in letzter Zeit komme ich mir vor, wie ein Ingenieur vor hundert Jahren. Wenn etwas nicht funktioniert, nochmal etwas Öl dran, ein ordentlicher Tritt und dann ging es meistens.

Ich hab das jetzt ein paarmal durch. Zum einen wollte Outlook partout nicht mit meinem IMAP Server und mir ist bis heute nicht klar, warum. Ein paarmal virtuell drantreten und nochmal Schmiere in die Konfiguration und plötzlich gings. Das gleiche hatte ich gerade mit einer WLAN Konfiguration. Allerdings hatte die Einrichtung eher was von einem Kurbelstart. Ein paar Umdrehungen und ab gings. Auch hier wieder das gleiche: theoretisch weiss ich wie es geht und praktisch hab ich keine Ahnung warum es erst nach x-mal umkonfigurieren und reset funktioniert hat.

Naja, nur keine Panik und ein kleiner Schubser und alles geht :-) PermaLink

one comment (by rbtst) | post comment

Saturday, 05. July 2008

Secure Online Banking?

I think I wrote about this before, but it annoys me every week. Internet banking here in Korea is only possible using a PC with Microsoft Windows. Not only that, it is only possible using Internet Explorer. Still not done, I can only do it by installing ActiveX plugins that employ >>trojan horse like technology to protect me from other trojans and >>key loggers. Actually, I had to install about 3-4 different plugins before I even see the login, which in itself is a plugin that manages the certificats.

Fortunately I have a Parallels virtual machine to protect me, but it does not protect my online account. I wonder why on earth only here in Korea I have to do it. I have had accounts in different parts of the world and I was always able to use standard web browsers of different kinds to do the online banking. PermaLink

no comments | post comment

Sunday, 08. June 2008

i600 = M6200

Looks like the Samsung Blackjack sold in Korea is identical in most parts with the European version i600. When I tried to update the phone using Samsungs MITs Upgrade Wizard the process stopped at 89% for some reason and I was left with a non-functional phone. Fortunately there are lots of adventurous people around who write about their experience flashing phones. While trying to get it back I decided to give it a try and flash WM6.

First I had to find the USB flash mode, which can be enabled by pressing and holding the green "Receiver" button and the power button. This is quite different from what you find on the net elsewhere. But then it all works as >>advertised.

Important, though, is to run the MITs wizard first in emergency mode to get the original flash files, just in case. Now the phone has a working WM6 with all its pros and cons. One drawback, however, is that the phones buttons change. In the european version, the number buttons are located in the middle and not the left side of the keyboard. However, that is something I can live with. PermaLink

no comments | post comment

Monday, 31. March 2008

The Random Pick

How do I get music like >>Teranoid? Whenever I am in Japan, I enter one of the big book, video and music stores and look randomly through the shelfes. I usually end up in the "Hardcore" section, where you find lots of fun stuff to listen to. If you cannot read what you're about to buy this is the way to go for me :-) PermaLink
no comments | post comment

Rauschkapsel

After month without a car I started driving in Seoul end of last year. The traffic is terrible, just like most drivers experience. When I enter the Gangbyeon Expressway near Hannam-dong I usually put myself into a sound capsule. Right now this is >>Teranoid Overground Edition. Some japanese techno stuff that simply drives you through.

I tried The Prodigy, but it does not work as well. PermaLink

4 comments (by arte, dirk, rbtst) | post comment

[subscribe to thinkberg]

    Logged in Users: (0)
    … and 2 Guests.
    14 users and 274 docs.
    Emerged 6 years and 82 days ago

    Connections:
    >>Stephans Blog
    >>USA Erklärt
    >>DUHBLOG
    >>Der König
    >>drrockit.com
    >>sofa. rites de passage
    >>langreiter.com
    >>henso.com

    Current Gaming:
    New Super Mario Bros. Dr. Kawashima's Brain Training

    Ohloh profile for Matthias L. Jugel

    < March 2010 >
    SunMonTueWedThuFriSat
    123456
    78910111213
    14151617181920
    21222324252627
    28293031

    Portlet 1
    thinkberg
    subconscious opinions
    Copyright © 2005-2008 Matthias L. Jugel | SnipSnap 1.0b3-uttoxeter