Tuesday, February 28, 2012

Gantter - Free Gantt Gantt Charting Tool

So I had to build a Gantt chart to plan the orchestration of a monthly data load. I scoured the web for tools and found a couple of options.

One you can use the "Gantt Chart Gadget" inside Google Spreadsheets (Google Docs). The other is to use a cloud based tool specifically for Gantt charts. There are several out there and I found Gantter was the best free tool that gets the job done.

The Google gadget option was too limiting. For example, you'd have to re-write all predecessors for your tasks when you insert a new task in the middle of the task list. Also, the updates to the Gantt chart aren't real-time. Overall not the best solution.

Gantter on the other hand worked real well and also provided me an option to export to pdf. I'm real happy with it. It also supports WBS (Work Breakdown Structure). Really liking this free tool.


Talend tReplaceList is order insensitive

For example if you want to replace cat with hat and then later replace all hat with mat, tReplaceList does not guarantee it. The reason behind this is the usage of Java hashmap's itterate function. They functions does not return keys in the order in which they are inserted.

To go around this problem you have to write you own implementation of the tReplaceList using a sorted map.

Monday, December 5, 2011

Machine Learning Stanford Class

Few months back my friend Vasanth asked me if I'd like to enroll in the machine learning class offered by Stanford university (http://www.ml-class.org/). I'm in my 9th week in to it now and it's been great experience so far. I was always interested in machine learning and never had a chance to take a course in Data Mining.

This course explains the core concepts of machine learning. Andrew Ng, the instructor, is absolutely phenomenal. I was impressed witth his teaching style and how effective it was.

Some of the topics we covered in the class are
1. Linear Regression
2. Logistic Regression
3. Regularization
4. Neural Networks
5. Support Vector Machines
6. Clustering
7. Dimensionality Reductions
8. Anomaly Detection

The exercises took 6-8 hours each week. Most of the exercises are in Octave and the code is heavily commented to help understand the core concepts. Overall it was an amazing class and a great introduction to machine learning.

So I was keen to know how often statisticians and machine learners use these core algorithms and tweak them up to manipulating stuff at the matrix level. I spoke to a friend who was working at an internet startup doing R for customer/market segmentation and he said he never went down to the matrix level. Seems like R is pretty sophisticated in abstracting the stuff under the hood. Maybe one day I'll get to try it.


Thursday, October 6, 2011

Been a while

It's been a while. A long time. Four months to be precise. I haven't written a single blog entry! As some of you may already know that I moved to San Francisco. I took an offer from Demandbase, a B2B marketing company. I'm working as a Data Engineer. I'm getting my hands dirty with linux (shell scripting is always fun!) and Ruby.

I've been using Ruby for little over a week now and it is fantabulous! Very easy to learn and the code is concise and up to the point. Look out for more posts on Ruby and data handling using Ruby.


Wednesday, June 8, 2011

Integrating Unstructured Data in a Data Warehouse

I've been meaning to write about this topic for a while. Here's a succinct excerpt of my thoughts.


Traditional data warehouses are generally relational and fed by back-end systems which contain structured data. Most often than not this data is generated by internal source systems or arrives as external data from partners. But what about the web? There's tons of data out there. Possibly about your company, competitor or a business trend about your industry..and the list goes on. As we share more data on the web, this list is expanding every day. The traditional warehouse is not designed to handle such unstructured data thereby limiting the locus of control of your decision support system.

Search engines are good at handling both structured and unstructured information in various formats (e.g. database tables, XML, PDF, DOC, etc.). Case in point - We helped a client index almost 75GB of unstructured data stored in .pdf, .doc, text files going as far back as 70 years. The ancient files were scanned pdfs which were later OCRed.  This is massively helpful not just from a pure enterprise search standpoint but in the terms of opening up the data to other parts of the organization in an easily accessible fashion. So how does this tie to a warehouse? Well, the search index in itself is a warehouse. 


So how to access it? - With your existing BI systems. If your tweak your BI tool, you can make REST based HTTP calls to a web server. In goes your query and within a second out comes your data. Search engines are inherently fast! You can use this data for discovering relationships you never thought existed. Obviously  this works better with certain types of data than other. I foresee a large interest in this area in the future as  enterprises explore more potentially crawl-able publicly available data sources. 







Tuesday, May 24, 2011

Using Audit Logs for Data Integration

Recently we integrated two applicatons with Solr in near real-time: Siebel and Documentum. We achieved this by monitoring the Audit Logs of the applications. Audit logs work great for data integration where you have to push a change in a business object to another system. Audit logs for mature applications like Siebel and Documentum can be configured to write update/delete events. Usually, this is done by reading from the audit log table (without locking to prevent performance issues) to sniff out interesting events/changes to business objects.

Risks

  • The auditing mechanism must be bug-free and consistently record events - which is mostly the case with mature COTS software systems.
  • The audit log is generated by the application. So every-time you need to test your integration solution, you have to make changes in the application to see if they come through. Sometimes, developers don't have access to front-end apps which may cause problems. 
  • Also, thorough testing is required from the application front-end perspective. Care must be taken to capture all event log signatures a particular business action in the application  can generate.

Solr Java Service Wrapper restarts after commit

We noticed our Solr instance was restarting every few days after optimizing. We use the Java Service Wrapper from Tanuki Software to run our instance as a service. The wrapper has a property for Ping Timeout. Basically, the wrapper pings the JVM to see if it is still alive every 30 seconds (default). If it does not get a response from the JVM, the wrapper restarts itself.

Solr optimization is a resource intensive process. I've noticed it used 100% of the CPU for over a minute. This caused the JVM to restart. To go around this issue, I increased the wrapper ping timeout to 120 seconds. This is a quick fix but will not work when the index gets large enough to the point that it takes more than a couple of minutes to optimize. But in that case you are better off looking at some other strategies to speed optimization altogether.  

Link Extraction From Tweets using Java Function

One of the common problems in analyzing tweets is extracting the links / URLs out of tweets. You can do all sorts of analytics on the links such as determine most popular links for a given day, etc. This can be done with Regex (see RegexParser below) but most of the links shared on Twitter are URL-shortened using services like bit.ly, fb.me, etc. To get a better understanding of the links shared in the tweet, you need to resolve the links and get the actual link they are pointing to. In short, reverse shorten or expand it! The following code gives two functions that allows you to extract and resolve the URL to its final destination URL in JAVA.


 
package routines;
import java.sql.Date;
import java.text.ParseException;
import java.util.regex.*;
import java.net.*;
public class Parsers{
public static String RegexParser(String stringToParse, String regexPattern) {
// Create a pattern to match url
Pattern p = Pattern.compile("((https?://)?([-\\w]+\\.[-\\w\\.]+)+\\w(:\\d+)?((/)?([-\\w/_\\.]*(\\?\\S+)?)?)*)");
Matcher m = p.matcher(stringToParse);
if (m.find())
return m.group(1);
else
return "";
}
public static String ExpandURL(String urlString) {
      String resolvedURL = urlString;   
      try {
            //Open connection and retry till no longer redirected
            HttpURLConnection connection = (HttpURLConnection) new URL(urlString).openConnection();
            connection.setInstanceFollowRedirects(false);
            connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.205 Safari/534.16");
            while (connection.getResponseCode() / 100 == 3) {
                resolvedURL = connection.getHeaderField("location");
                connection = (HttpURLConnection) new URL(resolvedURL).openConnection();
            }
      } catch (Exception e) {
      }
      return resolvedURL; }
}

Wednesday, April 20, 2011

SolrMeter - Benchmarking and Monitoring for Solr

Recently we started using SolrMeter for simulating production queries on our test Solr servers. We quickly found out that the tool can be used for more than just simulating queries. It gives very interesting view of your Solr health including graphs of caches, query performance, etc. In order to feed it a good set, you need to cleanse your production Solr request logs to get the select statements only. SolrMeter expects the queries in a certain format. More precisely if your query were
http://localhost:8080/solr/select?q=type:document AND name:SomeName
SolrMeter expects type:document AND name:SomeName in the text file.

To grab and cleanse the production logs, I built a simple utility in .NET that did some regex to extract the required text from the log file. It is usually the text between the ? and &? in the Solr request log. I then fed this newly generated cleansed log file to SolrMeter, set the location of the solr instance, set the query throughput rate (in minutes) and voila! You have a stress test tool ready!

Tuesday, April 19, 2011

Problems with Solr 1.4.1 Highlighting Query - Running Slow

I noticed that queries were running really slow on our 1.4.1 Solr instance which we use as Drupal backend for search. Some queries would take as high as 20 seconds!

So I started taking off parameters from the slow queries one by one until I saw a noticeable difference in query time. I noticed removing the hit highlighting was doing the trick. After a lot of digging around on the internet I found this article: http://www.mail-archive.com/solr-user@lucene.apache.org/msg28731.html

The problem is the algorithm 1.4 uses for hit highlighting. It is particularly exacerbated when the field you're trying to hit highlight on has large amounts of data. You can work around this issue by creating a copy of the field and restrict the number of characters to 20,000 so that you get a 40K odd field over which Solr will hit highlight. The performance of this new trimmed field for the purposes of the algorithm will be fine. As soon as I made the change by adding a highlight specific field the query performance improved from 20 something seconds to less than a second!

This doesn't seem to be a problem in 4.x Solr. We have a 4.x in production and it seems to be working fine even for large fields.