Top 10 Linux Command for Data Scientist


Linux/Unix is most popular platform for Development and Analytics . I have seen many developers and data scientist struggles in basic command of linux .Actually they are very easy but because of little laziness we ignore to document them .  Its just five minutes game to explore them .Lets see in this article – ” Top 10 Linux Command for Data Scientist ” . The best part is I have only shortlisted 10 most popular out of the big list of commands . I always believe in small steps for big success . I am a data scientist and it was my biggest pain area . Hence I have documented them in this article . You may book mark it if you think , you forget them easily .

Top 10 Linux Command for Data Scientist:

  1. find –

This command helps to search file in a directory . It recursively search them . Here is the syntax for them .

find [Directory] [regular_expression] [-options] [search_type]

➜  etc find . -name '*trans*' -type f

2. grep –

If you find the file . Now you need to search any pattern inside the file . You may use grep command . There are many options which make this search more effective .Lets Understand them one by one –

grep "WhatToSearch" filename

variations –

Note –

  1. You may use regex at the place of string ( WhatToSearch) and filename as well .
  2. By default grep command is case sensitive . In order to make it case insensitive use “grep -i “.For Example –
cut -d ',' -f 5 filename.csv


grep -i "whatToSearch" filename

Get  more details on grep command .

3. Cut –

This is very useful for quick filtering . It gives best result with column data .Lets first see an example for cut command –

cut -d 'separator' -f column_no filename
cut -d ',' -f 5 filename.csv

4. Wget Command –

Incase you need to download something from remote location , Use this command . Here is the simple syntax –

~$ wget taget_link

5. history-

We must face this situation that we worded over some command but it get disappear from the scree . When we again need to use it , We search for that . The smart solution is use history command for that –

~$ history

6. head –

Often we need to see the structure of the file .We need not to open the file for that just print some top line from it . It usually required to see the header of csv/excel type of file . In most of the analytics software the column name is required to mapped with file . Next time use this command that scenario. Here is the syntax for head command –

~$ head -n 5 filename

here the value of n denotes the number of the line from header .

7. tail –

Quite similar to the head command but opposite in nature  . Basically it will print from last .Please refer the below for syntax –

tail -n 15 filename

8. awk –

It is a complete topic for learning . The truth is covering it inline here will be a big injustice with it .Just I have put because I really want you to search for it .  Awk will process and filter text files specially . I think you should refer a detail content on this(awk)  here .

9. wc –

This Linux command /shell command helps data scientist in finding or estimating the the number of lines , words under a file .

For example –

$ wc -l filename.txt

Here wc -l gives the number of the line in this file . Again if you want to estimate the number of words inside the file . Here is the way

$ wc -w filename.txt

10 . cat –

Coming at the end at the list but not the list . In fact it is one of those command which is most popular among us . We use cat command to print the content of any file . Along with it we can merge /concatenate two files into one using this command . Here is the syntax for cat command –

cat input1.csv > output.csv

This is the one the most required command for me as a data scientist . I hope will be the same for you . It almost finish my 80 percent linux stuff everyday .

Conclusion –

Some time these little learning helps a lot . Usually what happen when we see or decide to learn something . We invest time in finding the best tutorial around .  We usually get the detailed one but we do not start . Some time we start but stop early because it seems big to us . This article is not a tutorial content but it is actually a mind set of taking small steps . Let me know your views on this . I mean this mind set . Does this article effect your performance anyway ? Please let us know . Again If you any doubt related to above mention commands , Please write back to us.


Data Science Learner Team 


Meet Abhishek ( Chief Editor) , a data scientist with major expertise in NLP and Text Analytics. He has worked on various projects involving text data and have been able to achieve great results. He is currently manages, where he and his team share knowledge and help others learn more about data science.
