In december 2011, the social network EPOS4 announced the end of their services on 1.1.2012. At this time, I was very interested in webcrawling. I decided to try crawling some from this community.
In three weeks, I developed three working webcrawler:
- Status-Message Crawler
- Profile-Statistics Crawler
- Profile-Image Crawler
I used NodeJS with node-htmlparser and node-soupselect to parse the response data and DateJS to parse bad formatted date values. It was very helpful, that EPOS4 gave their users ascending numbers as ID’s and also for the pagination from the page with all status-messages. After parsing the data, i saved it with node-mongodb-native into a MongoDB. Many users on EPOS4 hasn’t a profile image. I checked each image, if it’s the default image until saving it on my harddrive.
After many hours of crawling, I had ~ 150 MB of (~650k) status messages, ~25 MB of (~50k) profile statistics and a lot of profile-images.
Two month later, I build a small interface with NodeJS, Socket.IO and jQuery Templates to visualize the data from the MongoDB. It works very fine and the guys from Uberspace helped me to setting it up on their servers.
This prototype is not public, because of copyright by EPOS4. So i provide only a screenshot:
I did also some analytics with the status messages. E.g. which words often used in the status messages. For this case I build an MongoDB-JSON to “real JSON” converter and analytics script, both with NodeJS. More about the script in this blogentry.
