Pages

Thursday, May 26, 2016

Big Data

I recently gave a friend who wants to career-change into data science some advice for starting to learn about big data, and thought I'd share it here.

Firstly, decide whether you want to specialize in "data science" / "analysis" or whether you want to specialize in making the databases such people work with, well, work.

  • For the former, get good at statistics and program something at least once.
  • For the latter, get good at programming and expose yourself to a few statistics & algebra fundamentals if you didn't as a youth.
  • For either, learn what I suggest below about big data itself.

Okay, so you've decided you are willing to learn stats or programming - or you already know them and you want to talk your way into a job that will let you learn the "big data" details on the job. What do you need to know about "big data" to show that you have the fundamental knowledge to pick up the details after you start work?

Your homework, for learning about "big data," is not just to read, but to study all 21 articles in this series by Pinal Dave. And by that, I mean be able to answer the questions I suggest below about anything you learn from it (and from various Google tangents I hope you follow as you run into unfamiliar terms).

  • #1: Pay special attention to "the 3 v's" (velocity, variety, & volume - the 3 data problems that make people call data "big"). Always ask yourself:
    • "Which 3-v's problems does this approach to handling big data attempt to solve?"
    • "How?"
    • "How well is/isn't it doing so?
      • "In what situations is it strong/weak at solving those problems?"
      • "Why?"

Do not skip those questions. Seriously - study this like it's school. Grab a blank notebook, take notes on things you learn, and in those notes, answer these questions.

It is being able to answer these questions about the things you learn that will make you understand it so well "you could explain it to a 6-year-old." Which also means it you can explain it to the non-technical department heads interviewing you for a job and explain how well-suited this deep understanding makes you to solve their particular business problems. See what I'm doing here?


If you want to go deeper, I suggest that you especially learn about the 5 main types of database involved in "big data" (again, do the taking-notes-with-the-above-questions thing as you learn about these).

  1. "Relational"
    (This is your classic FileMakerPro / Microsoft Acces / Oracle database. The one you think of as a database. It's a bunch of Excel spreadsheets that cross-reference each other, and each record in a "spreadsheet" has a defined set of values you're allowed to fill in (think of the column headers.))
  2. "Key-Value"
  3. "Columnar"
    (A specialized form of key-value database. Learn how, plus why it's different enough to get its own name.)
  4. "Document"
    (A specialized form of key-value database. Learn how, plus why it's different enough to get its own name.)
  5. "Graph"
    (Some versions are a specialized form of key-value database. Learn how, plus why it's different enough to get its own name.)

About each of these 5 database types (deeply enough to do "compare & contrast" between them), learn:

  • "Which '3 Vs' problems + other problems does it try to address, how, and how well / for what types of data storage-retrieval needs?"
  • "What types of data storage & retrieval is it optimized for, speed-wise?"
  • "What types of data storage & retrieval is it optimized for, coding-wise? (What operations do & don't give programmers a nervous breakdown?)"
  • "How well can it be 'distributed' so that multiple computers can break up & simultaneously work on sub-pieces of a 'store' or 'retrieve' or 'retrieve-and-aggregate' (min, max, avg, etc.) request?"

Here's why:

If you want to start a successful catering business out of your home, you need to have some sense of how a kitchen's layout and tools impact what's easy to cook in it.

No point adding ice cream to the menu if you can't easily freeze things.

Same idea with "big data." It's important to be able to recognize the pros & cons of a given environment + set of tools for solving a given problem.

Again, if an interviewer says a company is having trouble with their [insert brand here] database and that they're trying to solve [insert problem here], how important to the company would you be if you're the person who can see that there's a fundamental mismatch between what they're trying to do and how they're organizing the data they need to do it with? (Or if there isn't a mismatch and they just need a good analyst/programmer on board, yay, you will be able to recognize that you could have interesting tasks ahead of you.)

For this deeper dive, I highly recommend a book called "Making Sense of NoSQL."


Finally, be aware that this is all just new ways of thinking about applying very old mathematics and logic to solve data storage/retrieval/analysis/visualization problems in light of the problems being "bigger" (see "3-V's").

Which is to say that you can't go wrong studying the old stuff.

  • For the analysts, it's particularly heavy in statistics and visualization from around the 1800's.
  • For the programmers, it's particularly heavy in "information retrieval" principles from throughout the 1900's.

Good luck!