Top 10 Linux Command for Data Scientist

Linux/Unix is most popular platform for Development and Analytics . I have seen many developers and data scientist struggles in basic command of linux .Actually they are very easy but because of little laziness we ignore to document them .  Its just five minutes game to explore them .Lets see in this article – ” Top 10 Linux Command for Data Scientist ” . The best part is I have only shortlisted 10 most popular out of the big list of commands . I always believe in small steps for big success . I am a data scientist and it was my biggest pain area . Hence I have documented them in this article . You may book mark it if you think , you forget them easily .

Top 10 Linux Command for Data Scientist:

  1. find –

This command helps to search file in a directory . It recursively search them . Here is the syntax for them .

find [Directory] [regular_expression] [-options] [search_type]

➜  etc find . -name '*trans*' -type f
./filetransfer.txt

2. grep –

If you find the file . Now you need to search any pattern inside the file . You may use grep command . There are many options which make this search more effective .Lets Understand them one by one –

Syntax:
grep "WhatToSearch" filename

variations –

Note –

  1. You may use regex at the place of string ( WhatToSearch) and filename as well .
  2. By default grep command is case sensitive . In order to make it case insensitive use “grep -i “.For Example –
cut -d ',' -f 5 filename.csv

 

grep -i "whatToSearch" filename

Get  more details on grep command .

3. Cut –

This is very useful for quick filtering . It gives best result with column data .Lets first see an example for cut command –

cut -d 'separator' -f column_no filename
cut -d ',' -f 5 filename.csv

4. Wget Command –

Incase you need to download something from remote location , Use this command . Here is the simple syntax –

~$ wget taget_link

5. history-

We must face this situation that we worded over some command but it get disappear from the scree . When we again need to use it , We search for that . The smart solution is use history command for that –

~$ history

6. head –

Often we need to see the structure of the file .We need not to open the file for that just print some top line from it . It usually required to see the header of csv/excel type of file . In most of the analytics software the column name is required to mapped with file . Next time use this command that scenario. Here is the syntax for head command –

~$ head -n 5 filename

here the value of n denotes the number of the line from header .

7. tail –

Quite similar to the head command but opposite in nature  . Basically it will print from last .Please refer the below for syntax –

tail -n 15 filename

8. awk –

It is a complete topic for learning . The truth is covering it inline here will be a big injustice with it .Just I have put because I really want you to search for it .  Awk will process and filter text files specially . I think you should refer a detail content on this(awk)  here .

9. wc –

This Linux command /shell command helps data scientist in finding or estimating the the number of lines , words under a file .

For example –

$ wc -l filename.txt

Here wc -l gives the number of the line in this file . Again if you want to estimate the number of words inside the file . Here is the way

$ wc -w filename.txt

10 . cat –

Coming at the end at the list but not the list . In fact it is one of those command which is most popular among us . We use cat command to print the content of any file . Along with it we can merge /concatenate two files into one using this command . Here is the syntax for cat command –

cat input1.csv input2.data > output.csv

This is the one the most required command for me as a data scientist . I hope will be the same for you . It almost finish my 80 percent linux stuff everyday .

Conclusion –

Some time these little learning helps a lot . Usually what happen when we see or decide to learn something . We invest time in finding the best tutorial around .  We usually get the detailed one but we do not start . Some time we start but stop early because it seems big to us . This article is not a tutorial content but it is actually a mind set of taking small steps . Let me know your views on this . I mean this mind set . Does this article effect your performance anyway ? Please let us know . Again If you any doubt related to above mention commands , Please write back to us.

Thanks

Data Science Learner Team