How to Reverse an arrayList in Java using Collections and Recursion ?

To reverse a List in Java e.g. ArrayList or LinkedList, you should always use the Collections.reverse() method. It's safe and tested and probably perform better than the first version of the method you write to reverse an ArrayList in Java
In this tutorial, I'll show you how to reverse an ArrayList of String using recursion as well. 

In a recursive algorithm, a function calls itself to do the job. After each pass problem becomes smaller and smaller until it reaches to the base case 

In order to reverse a List using recursion our base case is a list of one element. If your list contains one element then the reverse of that list is the list itself, so just return it. Now, on each pass we need to add the last element on the list into a new list called reversed. Once the program reaches to base case and starts winding down, we end up all the elements in the reverse order. To accommodate that, we have  used the addAll() method of java.util.Collection class.

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
public class TestSolution {
 public static void main(String args[]) {
List<String> books = new ArrayList<>(); 
books.add("Beautiful Code");
 books.add("Clean Code");
 books.add("Working Effectively with Legacy Code");
System.out.println("Original order of List: " + books);
System.out.println("The reversed List: " + books);
 // Now, let's try to reverse a List using recursion List<String> output = reverseListRecursively(books); 
System.out.println("Reversed list reversed again: " + output); } 
/** * A recursive algorithm to reverse a List in Java * * @param list * @return */
 private static List<String> reverseListRecursively(List<Stringlist) {
 if (list.size() <= 1) { 
return list;
 List<String> reversed = new ArrayList<>();
 reversed.add(list.get(list.size() - 1)); // last element 
reversed.addAll(reverseListRecursively(list.subList(0list.size() - 1)));
 return reversed; 

Output Original order of List: [Beautiful Code, Clean Code, Working Effectively with Legacy Code] The reversed List: [Working Effectively with Legacy Code, Clean Code, Beautiful Code] Reversed list reversed again: [Beautiful Code, Clean Code, Working Effectively with

Built-in Functions In Python - Letter E

enumerate(sequence, start=0)

Return an enumerate object. sequence must be a sequence, an iterator, or some other object which supports iteration. The next() method of the iterator returned by enumerate() returns a tuple containing a count (from start which defaults to 0) and the values obtained from iterating over sequence:

>>> seasons = ['Spring', 'Summer', 'Fall', 'Winter']
>>> list(enumerate(seasons))
[(0, 'Spring'), (1, 'Summer'), (2, 'Fall'), (3, 'Winter')]
>>> list(enumerate(seasons, start=1))
[(1, 'Spring'), (2, 'Summer'), (3, 'Fall'), (4, 'Winter')]
Equivalent to:

def enumerate(sequence, start=0):
    n = start
    for elem in sequence:
        yield n, elem
        n += 1
New in version 2.3.

eval(expression[, globals[, locals]])

The arguments are a Unicode or Latin-1 encoded string and optional globals and locals. If provided, globals must be a dictionary. If provided, locals can be any mapping object.

The expression argument is parsed and evaluated as a Python expression (technically speaking, a condition list) using the globals and locals dictionaries as global and local namespace. If the globals dictionary is present and lacks ‘__builtins__’, the current globals are copied into globals before expression is parsed. This means that expression normally has full access to the standard __builtin__ module and restricted environments are propagated. If the locals dictionary is omitted it defaults to the globals dictionary. If both dictionaries are omitted, the expression is executed in the environment where eval() is called. The return value is the result of the evaluated expression. Syntax errors are reported as exceptions. Example:

>>> x = 1
>>> print eval('x+1')
This function can also be used to execute arbitrary code objects (such as those created by compile()). In this case pass a code object instead of a string. If the code object has been compiled with 'exec' as the mode argument, eval()‘s return value will be None.

Hints: dynamic execution of statements is supported by the exec statement. Execution of statements from a file is supported by the execfile() function. The globals() and locals() functions returns the current global and local dictionary, respectively, which may be useful to pass around for use by eval() or execfile().

See ast.literal_eval() for a function that can safely evaluate strings with expressions containing only literals.

execfile(filename[, globals[, locals]])

This function is similar to the exec statement, but parses a file instead of a string. It is different from the import statement in that it does not use the module administration — it reads the file unconditionally and does not create a new module. [1]

The arguments are a file name and two optional dictionaries. The file is parsed and evaluated as a sequence of Python statements (similarly to a module) using the globals and locals dictionaries as global and local namespace. If provided, locals can be any mapping object. Remember that at module level, globals and locals are the same dictionary. If two separate objects are passed as globals and locals, the code will be executed as if it were embedded in a class definition.

If the locals dictionary is omitted it defaults to the globals dictionary. If both dictionaries are omitted, the expression is executed in the environment where execfile() is called. The return value is None.

Note The default locals act as described for function locals() below: modifications to the default locals dictionary should not be attempted. Pass an explicit locals dictionary if you need to see effects of the code on locals after function execfile() returns. execfile() cannot be used reliably to modify a function’s locals.

Built-in Functions In Python - Letter D

delattr(object, name)
The arguments are an object and a string. The string must be the name of one of the object’s attributes. The function deletes the named attribute, provided the object allows it. For example, delattr(x, 'foobar') is equivalent to del x.foobar.
class dict(**kwarg)
class dict(mapping, **kwarg)
class dict(iterable, **kwarg)
Create a new dictionary. The dict object is the dictionary class. See dict and Mapping Types — dict for documentation about this class.
For other containers see the built-in list, set, and tuple classes, as well as the collections module.                                                                                                                                                                   
Without arguments, return the list of names in the current local scope. With an argument, attempt to return a list of valid attributes for that object.
If the object has a method named __dir__(), this method will be called and must return the list of attributes. This allows objects that implement a custom __getattr__() or __getattribute__() function to customize the way dir() reports their attributes.
If the object does not provide __dir__(), the function tries its best to gather information from the object’s __dict__ attribute, if defined, and from its type object. The resulting list is not necessarily complete, and may be inaccurate when the object has a custom __getattr__().
The default dir() mechanism behaves differently with different types of objects, as it attempts to produce the most relevant, rather than complete, information:
If the object is a module object, the list contains the names of the module’s attributes.
If the object is a type or class object, the list contains the names of its attributes, and recursively of the attributes of its bases.
Otherwise, the list contains the object’s attributes’ names, the names of its class’s attributes, and recursively of the attributes of its class’s base classes.
The resulting list is sorted alphabetically. For example:
>>> import struct
>>> dir()   # show the names in the module namespace
['__builtins__', '__doc__', '__name__', 'struct']
>>> dir(struct)   # show the names in the struct module
['Struct', '__builtins__', '__doc__', '__file__', '__name__',
 '__package__', '_clearcache', 'calcsize', 'error', 'pack', 'pack_into',
 'unpack', 'unpack_from']
>>> class Shape(object):
        def __dir__(self):
            return ['area', 'perimeter', 'location']
>>> s = Shape()
>>> dir(s)
['area', 'perimeter', 'location']
Note Because dir() is supplied primarily as a convenience for use at an interactive prompt, it tries to supply an interesting set of names more than it tries to supply a rigorously or consistently defined set of names, and its detailed behavior may change across releases. For example, metaclass attributes are not in the result list when the argument is a class.                   
divmod(a, b)
Take two (non complex) numbers as arguments and return a pair of numbers consisting of their quotient and remainder when using long division. With mixed operand types, the rules for binary arithmetic operators apply. For plain and long integers, the result is the same as (a // b, a % b). For floating point numbers the result is (q, a % b), where q is usually math.floor(a / b) but may be 1 less than that. In any case q * b + a % b is very close to a, if a % b is non-zero it has the same sign as b, and 0 <= abs(a % b) < abs(b).
Changed in version 2.3: Using divmod() with complex numbers is deprecated.
enumerate(sequence, start=0)
Return an enumerate object. sequence must be a sequence, an iterator, or some other object which supports iteration. The next() method of the iterator returned by enumerate() returns a tuple containing a count (from start which defaults to 0) and the values obtained from iterating oversequence:
>>> seasons = ['Spring', 'Summer', 'Fall', 'Winter']
>>> list(enumerate(seasons))
[(0, 'Spring'), (1, 'Summer'), (2, 'Fall'), (3, 'Winter')]
>>> list(enumerate(seasons, start=1))
[(1, 'Spring'), (2, 'Summer'), (3, 'Fall'), (4, 'Winter')]
Equivalent to:
def enumerate(sequence, start=0):
    n = start
    for elem in sequence:
        yield n, elem
        n += 1

Difference between method Overriding and Overloading in Java

Method overloading is used to increase the readability of the program.
Method overriding is used to provide the specific implementation of the method that is already provided by its super class.
Method overloading is performed within class.
Method overriding occurs in two classes that have IS-A (inheritance) relationship.
In case of method overloading, parameter must be different.
In case of method overriding, parameter must be same.
Method overloading is the example of compile time polymorphism.
Method overriding is the example of run time polymorphism.
In java, method overloading can't be performed by changing return type of the method only. Return type can be same or different in method overloading. But you must have to change the parameter.
Return type must be same or covariant in method overriding.
Java Method Overloading example

  1. class OverloadingExample{  
  2. static int add(int a,int b){return a+b;}  
  3. static int add(int a,int b,int c){return a+b+c;}  
  4. }  
Java Method Overriding example

  1. class Animal{  
  2. void eat(){System.out.println("eating...");}  
  3. }  
  4. class Dog extends Animal{  
  5. void eat(){System.out.println("eating bread...");}  
  6. }  

How to set ADB Path in System Variable ? : Android , Mobile automation testing

Check the installation path and if it's installed in C:\Program Files (x86)\Android . This is the default installation location.
So update the PATH variable with following line.
C:\Program Files (x86)\Android\android-sdk\tools\;C:\Program Files (x86)\Android\android-sdk\platform-tools\

Now you can start ADB server from CMD regardless of where the prompt is at.

Android SDK ADB server in CMD screen

How to edit a system variable

Here's a short how-to for the newbies. What you need is the Environment Variables dialog.
  1. Click Start (Orb) menu button.
  2. Right click on Computer icon.
  3. Click on Properties. This will bring up System window in Control Panel.
  4. Click on Advanced System Settings on the left. This will bring up the System Properties window with Advanced tab selected.
  5. Click on Environment Variables button on the bottom of the dialog. This brings up the Environment Variables dialog.
  6. In the System Variables section, scroll down till you see Path.
  7. Click on Path to select it, then the Edit button. This will bring up the Edit System Variable dialog.
  8. While the Variable value field is selected, press the End key on your keyboard to go to the right end of the line, or use the arrow keys to move the marker to the end.
  9. Type in ;C:\Program Files (x86)\Android\android-sdk\tools\;C:\Program Files (x86)\Android\android-sdk\platform-tools\ and click OK.
  10. Click OK again, then OK once more to save and exit out of the dialogs.
That's it! You can now start any Android SDK tool, e.g. ADB or Fastboot, regardless of what your current directory is in CMD. For good measure here's what the dialog looks like. This is where you edit the Path variable.

environment variables

List of tools used for automated cross-browser testing of Ajax websites

Sahi ( can also be added to this list. Some good points about Sahi:

1) It handles AJAX/page load delays automatically. No need for explicit wait statements in code.

2) It can handle applications with dynamically generated ids. It has easy identification mechanisms to relate one element to another (example. click the delete button near user "Ram"). ExtJS, ZkOSS, GWT, SmartGWT etc. have been handled via Sahi.
Sample link:

3) It traverses frames/iframes automatically.

4) It can record or spy on objects on all browsers. Particularly useful if Internet Explorer is a target browser.

For standalone tools, you are pretty much limited to the following two choices:

- Webdriver / Selenium 2.0 :
- Watir:

Note that these are two low-level tools that let you control browsers, they are not complete automation frameworks on their own.  To complete the framework check out wrappers including:
- Capybara
- Webrat
- Cucumber

*Note that the above frameworks are not mutually exclusive, it is possible to create a cucumber / capybara framework for example.

There is also at least one cloud based service which seems to offer a complete solution:

So out of all that, what is the best?  It depends on your needs and skills.
In my opinion Webdriver offers the best support for AJAX heavy websites and the most flexibility overall to do any kind of testing you might need (checking emails, interrogating or changing the state of a DB).  The flexibility comes at the price of some complexity however.

I dont have experience with Saucelabs, but alot of people like it, so for a team with limited skills and/or testing infrastucture it might make alot of sense to use a service based solution, Saucelabs is definitely the leader here.

Another good cloud based service which offers cross browser testing by using Selenium is

You can basically test in all major browsers and even in an Android emulator. You create 1 test with Selenium IDE, upload it to the website and set it up to run on a daily basis. When the test fails they can alert you.

I know a good testing tool-TestingWhiz ( Its quite intuitive and affordable tool as compared to Selenium and QTP but limited to functional testing.The main thing about it is that it requires less or no coding which is not the case with other tools.

In my opinion working with Ajax websites could be quite easier with TestingWhiz also supports cross-browser testing.All you need is to record the test cases and you can run in any browser within few clicks.

Out of my personal experience, working with AJAX based websites is a bit tricky
But WATIN works like a charm

The framework has been written in c# and its easy to learn c# rather than any other framework which needs scripting capabilities like RUBY.

The flow has been seemless with no jerly moments. Since that is also has screen recording facility, you can capture enough no. of flows.

Why is everyone Obssessed with BIG Data ?

What is Big Data?
Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.

Big data refers to a process that is used when traditional data mining and handling techniques cannot uncover the insights and meaning of the underlying data. Data that is unstructured or time sensitive or simply very large cannot be processed by relational database engines. This type of data requires a different processing approach called big data, which uses massive parallelism on readily-available hardware.

This is indeed the era of Big Data revolution. Whether it is healthcare, IT, Industrial, Manufacturing, Food Corporations, Agriculture or any large scale or small scale industries, there are terabytes and petabytes of data generated each day. The daily functioning of all the companies in all sectors relies on the extracting meaningful information from structured and unstructured data.

With this veritable explosion, Big Data is going to have an effect on every business in this Universe. Data is expanding at a much faster rate than before, and it is predicted that after five years, around 1.7 megabytes of novel information will be generated every second for every human being on this planet.

Simply put, we sincerely need Experts to analyze and handle these immeasurable volumes of data. Keeping up with the spirit, most top notch technical brands have discovered complex and intricate technologies, platforms and softwares to administer and employ converting bytes into structured, readable information. Most MNCs using Big Data technology in their operations are on a recruitment spree looking out for skilled workforce in various platforms like Hadoop, NoSQL, Cassandra, MongoDB, HBase, Data Science, Spark, Storm, Scala and others.

Individuals, aiming to excel in their technology careers, can’t overpass this data eruption and need to prepare now for the bigger and better.These nice platforms can’t be self-learned and require learning from adroit trainers.The current trends involve integrated learning of Big Data + Data science courses as it helps individuals expand their scope of getting identified by top-paying companies.

Present and Future Outlook
A recent report by Gartner reveals that more than 75% of the world’s companies are preparing to invest a considerable capital in Big Data and related platforms in the next two years. According to the survey, the organizations aim at improving customer services, rationalizing current business processes, acquiring more traffic and optimizing costs using big data. In 2015, most big data projects are initiated by CIO (32%) and Unit Heads (31%).

A study conducted by Forbes in association with McKinsey and Teradata lately declared the indispensable urge for data-learned professionals. Forbes questioned around 300 global executives and found that 59% of the respondents appraised big data among the top five ways to gain a competitive edge over others. Most of them, in fact, ranked big data on number one.

The Teradata-Forbes Insights’ survey of top-decision makers further announced that big data analytics enterprises have had a considerable impact on ROI.

Jobs in Big Data
While Hadoop, MapReduce, Cassandra, HBase, MongoDB, Spark, Storm and Scala are the most in-demand platforms for processing of Big Data, there are thousands of jobs generated every month. Most companies across the globe are seeking for Specialists and Professionals who can be productive from Day 1 and hold proficiency in managing high data volumes.

t was correctly estimated last year by the Senior Vice President of Gartner and Global Head Research that there would be 4.4 million IT jobs all over the world to drive Big data, creating around 1.9 million IT open positions in the US.

The average Salary of big data-related skills is over $120,000 per annum. According to Payscale, this figure is calculated as Rs. 607,193 per year.

Interestingly, top companies that use big data and related platforms on a frequent basis include Google, Hortonworks, Cloudera, LinkedIn, Facebook, Twitter, IBM, PWC, SAS, Oracle, Teradata, SAP, Dell, HP.

Prerequisites for learning Big data skills
Experience and mind bend towards any object-oriented Programming language will help learners grab the curriculum faster and easily. Basic Command knowledge of UNIX and SQL Scripting can be an added advantage.

Why do data people get excited about large data? Typically the larger the data the harder it becomes to do even basic analysis and processing. For example much more sophisticated things can be done very simply in matlab or numpy or R than are practical with Hadoop. Furthermore large data tends to be at heart just a collection of small data sets repeated millions of times (for each user, or web page, or whatever).

I understand why this interests infrastructure people, but I would think data people would be more turned on by the analytical sophistication than the number of bytes.

The accuracy & nature of answers you get on large data sets can be completely different from what you see on small samples.  Big data provides a competitive advantage.  For the web data sets you describe, it turns out that having 10x the amount of data allows you to automatically discover patterns that would be impossible with smaller samples (think Signal to Noise).  The deeper into demographic slices you want to dive, the more data you will need to get the same accuracy.

Take note of recent innovations in artificial intelligence like IBM Watson and Google's self driving cars.  These advancements were made possible by leveraging large diverse data sets and computational horsepower.

I think the main reasons for the obsession with Big Data are the following :

  1. Information Asymmetry: Storage gets cheaper every day, and if you have more data you can make better decisions than competitors.
  2. The whole is greater than the sum of its parts - many data sets combined tell you more than they do separately
  3. Game changing advancements in machine learning are coming from leveraging Big Data
The importance of large datasets for allowing more sophisticated statistical models that capture more juice, yield more predictive power, has already been discussed and I agree. But additional elements are worth mentioning:

* Many internet-related datasets involve very sparse variables and rarely occurring events: if the things you care most about happen only once in a million user visits, then you need that much more data to capture that phenomenon. For example, in the computational advertising area, with the kinds of big data we are using with, the events of interest are clicks, or even worse, when someone buys something in response to seeing an ad. These are extremely rare.

* If you are going to use your machine learning device to take decisions that can sometime work for you and sometimes against you (you lose money), you really need to estimate and minimize your risk. Applying your device in a context involving a huge number of decisions is the easiest way to reduce your risk (and this is especially important when there are rare events involved). Of course you need big data to validate that and meaningfully compare strategies.

* One should be very careful in assessing the effect of dataset size; it depends on the type of machine learning or statistical model used. The performance of a logistic regression with a reasonably small input size will quickly saturate as you increase the amount of data. Other more sophisticated models (in general going towards the non-parametric, or allowing the capacity of the model increase with the amount of data) will gradually become more relatively advantageous as the amount of data is increased.

* The effect of dataset size also depends on the task, of course. Easier tasks will be solved with smaller datasets. However, as mentioned in the above posts, for many of the more interesting, AI-related tasks, we seem to never have enough data. This is connected to the so-called curse of dimensionality: a "stupid" non-parametric statistical model will 'want' an amount of data that grows with the number of ups and downs of the function we want to estimate, that can easily grow exponentially with the number of variables involved (because of the number of configurations of factors of interest can grow that fast).

* Advanced machine learning research is trying to go beyond the limitations of "stupid" non-parametric learning algorithms, to be able to generalize to zillions of configurations of the input variables never seen, or even close to, any of those seen in the training set. We know that it must be possible to do that, because brains do that. Humans, mammals and birds learn very sophisticated things from a number of examples that is actually much much smaller than what Google needs to answer our queries or get the sense that two images talk about the same thing. A general way to achieve this is through what is called "sharing of statistical strength", and this comes up in many guises.

The "current obsession" with Big Data is not new. During the last 25 years there have been numerous periods of great interest in storing and analysing large data sets. In 1983 Teradata installed brought on Wells Fargo as their first beta site. In 1986 this software was Fortune Magazine's "Product of the Year" - it was exciting because it pioneered the ability to analyse terabyte-sized data sets. By the early 90's most big banks had all their data in a data warehouse of some sort, and there was a lot of work going on in trying to work out how to actually use that data.

Next was the big OLAP craze. Cognos, Holos, Microsoft OLAP Services (as it was then called), etc. were what all the cool kids were talking about. It was still expensive to store very large data sets, so through much of the 90's Big Data was still restricted to bigger companies - especially in financial services, where lots of data was being collected. (These companies had to store complete transactional records for operational and legal reasons, so they already were collecting and storing the data - that's another reason they were amongst the first to leverage these approaches.)

Also important in the 90's was the development of neural networks. For the first time companies were able to use flexible models, without being bound by the constraints of parametric models such as GLMs. Because standard CPUs weren't able to process data fast enough to train neural nets on large data sets, companies such as HNC produced plugin boards which used custom silicon to greatly speed up processing. Decision trees such as CHAID were also big at this time.

So by the time the new millenium rolled around, many of the bigger companies had been doing a lot of work with summarising (OLAP) and modelling (neural nets / decision trees) data. The skills to do these things were still not widely available, so getting help cost lots of money, and the software was still largely proprietary and expensive.

During the 2000's came the next Big Data craze - for the first time, everyone was on the web, and everyone was putting their processes online, which meant now everyone had lots of data to analyse. It wasn't just the financial services companies any more. Much of the interest during this time was in analysing web logs, and people looked enviously at the ability of companies like Google and Amazon who were using predictive modelling algorithms to surge ahead. It was during this time that Big Data became accessible - more people were learning the skills to store and analyse large data sets, because they could see the benefits, and the resources to do it were coming down in price. Open source software (both for storing and extracting - e.g. MySQL, and for analysing - e.g. R) on home PCs could now do what before required million-dollar infrastructure.

The most recent Big Data craze really kicked off with Google's paper about their Map/Reduce algorithm, and the follow-up work from many folks in trying to replicate their success. Today, much of this activity is centred around the Apache Foundation (Hadoop, Cassandra, etc.) Less trendy but equally important development has been happening in programming languages which now support lazy list evaluation, and therefore are no longer constrained by memory when running models (e.g. Parallel LINQ in .Net, List comprehensions in Python, the rise of functional languages like Haskell and F#).

I've been involved in analysing large data sets throughout this time, and it has always been an exciting and challenging business. Much was written about the Data Warehouse craze, the Neural Net craze, the Decision Tree craze, the OLAP craze, the Log Analysis craze, and the many other Big Data crazes over the last 25 years.

Today, the ability to store, extract, summarise, and model large data sets is more widely accessible than it has ever been. The hardest parts of a problem will always attract the most interest, so right now that's where the focus is - for instance, mining web-scale link graphs, or analysing high-speed streams in algorithmic trading. Just because these are the issues that get the most words written about them doesn't mean they're the most important - it just means that's where the biggest development challenges are right now.

A large dataset of data that all has the same bias (systematic error) will not give you better insight into a question. Instead, it will give you a very precise measurement of your flawed answer.  For example, it doesn't really matter if you ask 100 teens or a million teens about the best movie of all time.  You'll still get an answer that discounts older movies no matter how many you ask. 

Larger datasets will tend to include more diversity so you can control for this error.  Netflix's data includes all different kinds of people, so their algorithms can account for the bias that may come from potentially having more teens in their database.  But they can't say anything about the entertainment preferences of people who don't like to use the internet (or mail order dvds) and no amount of their data will help.  Twitter data has an even worse problem as the kind of people who tweet are even more unrepresentative.  I'd much rather have a smaller dataset of more representative users to answer questions that people often use Twitter for. 

It's not just sampling that causes error.  Consider Facebook likes.  I believe that Texas A&M is the most liked college on Facebook.  What does that mean?  Since there is no dislike button, it's hard to even say that they are the most popular.  We also can't say whether it has a good alumni group, has popular sports, or is good academically, and the volume of data won't help us.  Instead, we need more detail.

The gist here is that larger datasets are generally better datasets.  But there are lots of things that are more important than the size of a dataset, among them being sampling, diversity of measurement, and detail of measurement.

t is a consequence of the progress in hard disk and network development. We are finally able to affordably store and process petabytes of data in a 19" rack-- something that many people have dreamed about for a long time.

One reason why people dreamed about storing big amounts of data is that they want to infer knowledge from that data. So, even if something is repeated over and over again within that collection, maybe this repetition is the signal we are looking for?

Big Data has become a hot topic because people are collecting more data than ever before (of course not only, but especially on the Web). This data screams for being mined, to get valuable information out of it.

Actually, when we say "Big Data" in the context of statistical analysis, data mining or information extraction and retrieval, most of the time we mean "Representative Data".

"Representative" is a fuzzy term, but I would define it as "having roughly the same properties as the whole thing that we are interested in".

Representativeness is very important for drawing any conclusion from data, because if your data is not representative, you might conclude anything.

For example: you count frequencies of some event in your tiny dataset, relate it to the size of the collection, and claim this is the "true" probability of that event. Take another small dataset (of the same problem domain), and you find a different probability? Turned out that your data was not representative.

Thus, size is key, but so is the retrieval method. If you induce a bias (source bias, topic bias...) in collecting your data, you might run into the same problems as with little data. So it is indeed a good strategy to collect as much data as possible, from as much diverse data sources as possible. Especially if you do not know in the beginning what you might be looking for.

Now, the cool thing with big data is: It's not really difficult to work with it! In many, if not all cases, at various levels the data is governed by power laws, i.e. the absolute number/frequency of some aspect in the data is actually less relevant than the accompanied order of magnitude. It turns out that only a few things stand out in orders of magnitude (and the more data you have, the slower that number grows).

If your whole dataset abides by power laws, a random sample will do so as well. I call such datasets "Zipf-representative", in remembrance of George Kingsley Zipf, who spent a lifetime in finding and formalizing power law phenomena.

That is, even though you are now able to retrieve and store enormous amounts of data, in most cases you actually do not have to go the hard way and run your tasks (analyses, human assessments, whatever) over each and everything. Instead, you random-sample a fraction from it, and there you go. Run these small bites sequentially or in parallel and you approach the complete dataset.

In fact, the overall frequent things are not so interesting, and they in fact appear already in smaller datasets (like the "stop-words" in text).  What's more interesting are the "unexpectedly" frequent things, i.e., you want to know what is characteristic to a particular subset in your data. In many cases, these subsets need to be constructed on-demand (e.g., for search) and so you just need have big data in order to ensure that your dataset is representative for many different scenarios.

The "big data" strategy has its limitations, of course. First, we never get enough data to statistically mine "all possible" relevant information. Second, there are always scenarios where field experts may get sufficiently good results with less data. Third, one might always be intrigued in finding super-surprising properties in the data and neglect statistical significance (avoid over-fitting your models).

Nevertheless, the strategy works well in many scenarios, and this is why so many people like it.

"Big Data" is a very subjective term. While it can mean management and analysis of large, static data sets, to me it also means handling real-time data streams at high speed and the analytics and decision-making tools necessary to alter behavior. These techniques are not ends in themselves; they are merely vehicles for handling the inexorable rise in the quantity of data being collected. One of the key Big Data questions from my perspective is: how can you transform Big Data into Small Data? In any given data set it is likely that only a small portion has true information value; how does one decide which elements of the data set on which to focus in order to reduce the costs of storage and compute while generating better, more actionable decisions. I believe Big Data is a hot topic because so few firms manage their data stack well, from core database architecture to processing to predictive analytics, all in real-time. There is no silver bullet to solving a given company's data problem; at this point the issue is less about tools and more about culture.

There are several things that are causing this interest:

1. More users on the Internet than ever before, especially due to mobile computing.

2. Our word-of-mouth systems, like Twitter and Facebook, are more efficient than ever before, which is causing companies to see faster and bigger spikes in usage.

3. Our needs for more data than ever before. Look at Quora. In the old days this would be a simple forum with maybe four rows in a database. Today? We're seeing lots more of pieces of info being captured (related articles, votes, thanks, comments, who is following, traffic, etc).

4. Lots more machine generated data. Logs, etc.

The reason "big data" is so exciting is because it is poorly defined, mysterious, has lots of spy-like implications, and high-tech marketing teams have glommed onto and now they are stuck with it. If you are in software and you don't do "big data" you might as well just hang it up. Such bullshit.

Forget about big data. Just know that we have to deal with more data than we've ever been used to before, no matter the size of the data sets themselves. As you seem to be implying, we need to figure out ways to make that data accessible without exposing all the gory details, so that we can uncover new and interesting ways to analyze it, hopefully for the benefit of humans, the economy, nature, and the built environment.

Ultimately, maybe the way to think about 'big data' is really as a confluence of three forces (some names and examples are included here for context):

An explosion of diverse data that exists inside and outside an enterprise

  • Internal Data (e.g. customer service interactions, financial transactions/payments, sales channel)
  • External Data (e.g. transaction histories, social media activity)

Advanced technologies built to aggregate, organize and analyze that data

  • Next-Generation Data Layers (e.g. Hadoop, MongoDB)
  • Advanced Analytics Tools (e.g. DataRobot, Drill, Palantir, R, Tableau)
  • High-Performance Hardware (e.g. Calxeda, FusionIO, PernixData)

A new way of making decisions and interacting with customers
  • 360 Degree Views of a Customer & Household (e.g. client portals, customized product recommendations)
  • Data- Heavy Decision Making (e.g. data science, predictive modeling)
  • Simplified Access to and Use of Data (e.g. data procurement, cleansing, preparation)




Would you like to see you post published here ?

Want reach out to millions of readers out there? Send in your posts to
We will help you reach out to the vast community of testers and let the world notice you.