By Fernando Doglio

When using Pig to query and transform the information stored on our HDFS, we might need functions that are not part of the default arsenal of Piglatin (the language used by Pig). But the cool thing about this tool, is that we can actually extend the language using UDFs (or User Defined Functions).

So what’s a UDF?

A UDF is a function that you write in order to extend the language you’re using (in this case: Piglatin).
Even though Pig is written in Java, you can use different languages to create your own UDFs. For this specific example, we’ll discuss creating such functions using the Python programming language.

By default, Pig allows you to define your UDFs on the following languages: JAVA, Python and Javascript. Currently JAVA UDFs are the ones that have the most extensive support, specially since they have access to additional interfaces (such as the Algebraic Interface and the Accumulator Interface) but you can achieve a lot of things using python and JS as well.

If you want more details on how to write UDFs using one of these languages, you can visit the official documentation here

The python code

For our example, we’ll write a UDF that will transform a human readable date into a number that can be sorted easily.

Something that will take:  Sun Sep 13 00:00:00 UYT 2009 and transform it into 20090913

Lets get coding then!

The code of a UDF is simply  the code of the function, with some optional annotations:

Lets go over important lines of the code above:

Pig uses Jython (a Python interpreter that runs on the JVM) to interpret our python code. Make sure you install the correct version of jython (should be the sameone that your installed version of pig uses).

An annotation that specifies the output of our function. In our case, we are returning an integer called “date”.

There are other annotations that can be used (from the official documentation):

  • outputSchema – Defines schema for a script UDF in a format that Pig understands and is able to parse.
  • outputFunctionSchema – Defines a script delegate function that defines schema for this function depending upon the input type. This is needed for functions that can accept generic types and perform generic operations on these types. A simple example is square which can accept multiple types. SchemaFunction for this type is a simple identity function (same schema as input).
  • schemaFunction – Defines delegate function and is not registered to Pig.

If we don’t specify any decorator, Pig will assume that our output is a bytearray.

In our case, our date to be changed is a string, and it’ll be passed as a bytearray to our function, that is why we need to convert it back to a string.

Other than that, the code is pretty straightforward, we use a nice little regexp to get the parts of the date that we want, join them up and return them as an integer.

Easy right?

Registering your UDFs on piglatin

So, we now have the UDF, how do we use it on our piglatin script? Simple! Add the following line to your script:

Some considerations:

  • Your python file should be on the same folder as your pig script, otherwise, you should use the correct path to find it.
  • All of the functions your define on your file, will be accesible insde the “myfuncs” namespace. So in our case, in order to access our simplifyDate function, we would have to call it like this: myfuncs.simplifyDate.


That’s it! You have your UDF working on your piglatin script, congrats!

Conclusion

Pig is a very powerful data-flow language, and the ability it provides to extend it makes it even more powerful.

There is also a place where users contribute their JAVA UDFs to the community, called PiggyBank, and is available to all. According to the oficial documentation, right now, only Java UDFs can be contributed to the bank, but support for Python and Javascript UDFs is on its way.

If you want to access the bank, you can checkout the repo with the source code like so (from the oficial documentation page):

To build a jar file that contains all available UDFs, follow these steps:

  • Checkout UDF code: svn co http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank
  • Add pig.jar to your ClassPath: export CLASSPATH=$CLASSPATH:/path/to/pig.jar
  • Build the jar file: from directorytrunk/contrib/piggybank/java run ant. This will generate piggybank.jar in the same directory.

To obtain javadoc description of the functions run ant javadoc from directory trunk/contrib/piggybank/java. The documentation is generate in directory trunk/contrib/piggybank/java/build/javadoc.
To use a function, you need to determine which package it belongs to. The top level packages correspond to the function type and currently are:

  • org.apache.pig.piggybank.comparison – for custom comparator used by ORDER operator
  • org.apache.pig.piggybank.evaluation – for eval functions like aggregates and column transformations
  • org.apache.pig.piggybank.filtering – for functions used in FILTER operator
  • org.apache.pig.piggybank.grouping – for grouping functions
  • org.apache.pig.piggybank.storage – for load/store functions

(The exact package of the function can be seen in the javadocs or by navigating the source tree.)

That’s it for now, but if you want to checkout the official docs, go to this url: http://pig.apache.org/docs/r0.9.1/udf.html