recent

Titulo

File Formatting using Awk

As a Database Administrator or database developer, you  may move data from file to database or from one database to another. The data coming to you from another source in the form or excel or text file may not be in the format that your destination database supports. Sometime, the file has more columns than your table,  the phone or date format is not supported by the destination database. Before loading a file, data needs proper formatting to your need and then load it. How do you scrub or massage data and format it to your need? You can write a code in Java, Perl or C++ to read from a file, format data to your need and spit out to a file. Writing a program just to format to our need may take hours in these language. To avoid expensive and  long programming hours, someone came with AWK programming which is famously known for one-line code. You can almost do anything with a single line code using AWK.

What is AWK? It is a programming language for text processing, extracting and a reporting tool that comes standard for most Linux-like operating systems. AWK was developed in Bell Labs in 1970's by Alfred Aho, Peter Weingberger, and Brian Kernighan which got the name from their surname.

We are going to run through some examples to understand how AWK works? We have two files with data that are not in the correct format. We will work from these examples to slice and dice these files to format or extract the only data we need. We have two file statecapital.txt and employee.txt

pbaniya@dbarepublic.com:~ $cat statecapital.txt
S/N STATE  CAPITAL
1) Alabama  Montgomery
2) Alaska   Juneau
3) Arizona  Phoenix
4) Arkansas  Little Rock
5) California  Sacramento
6) Colorado  Denver
7) Connecticut  Hartford
8) Delaware  Dover
9) Florida  Tallahassee
10) Georgia  Atlanta


Printing all Field using AWK
awk '{print}' statecapital.txt
pbaniya@dbarepublic.com:~ $ awk '{printf}' state_capital.txt
S/N     STATE           CAPITAL
1)      Alabama         Montgomery
2)      Alaska          Juneau
3)      Arizona         Phoenix
4)      Arkansas        Little Rock
5)      California      Sacramento
6)      Colorado        Denver
7)      Connecticut     Hartford
8)      Delaware        Dover
9)      Florida         Tallahassee
10)     Georgia         Atlanta


Printing STATE and CAPITAL only
pbaniya@dbarepublic.com:~ $ awk '{print $2, $3}' state_capital.txt
STATE CAPITAL
Alabama Montgomery
Alaska Juneau
Arizona Phoenix
Arkansas Little
California Sacramento
Colorado Denver
Connecticut Hartford
Delaware Dover
Florida Tallahassee
Georgia Atlan


Printing Searched Column: Print row that contains Alabama
pbaniya@dbarepublic.com:~ $ awk '/Alabama/' state_capital.txt
1)      Alabama         Montgomery


Adding Double Quote on State and Capital
pbaniya@dbarepublic.com:~ $ awk '{printf "\""$2"\"" "\t" "\""$3"\""  "\n"}' state_capital.txt
"STATE"          "CAPITAL"
"Alabama"       "Montgomery"
"Alaska"        "Juneau"
"Arizona"       "Phoenix"
"Arkansas"      "Little"
"California"    "Sacramento"
"Colorado"      "Denver"
"Connecticut"   "Hartford"
"Delaware"      "Dover"
"Florida"       "Tallahassee"
"Georgia"       "Atlanta"


AWK Escape character: \
new line: \n
tab : \t

Redirection Operator to write result to a file.
pbaniya@dbarepublic.com:~ $ awk '{printf "\""$2"\"" "\t" "\""$3"\""  "\n"}' state_capital.txt > stat_capital_formatted.txt
pbaniya@dbarepublic.com:~ $ cat state_capital_formatted.txt

"STATE"         "CAPITAL"
"Alabama"       "Montgomery"
"Alaska"        "Juneau"
"Arizona"       "Phoenix"
"Arkansas"      "Little"
"California"    "Sacramento"
"Colorado"      "Denver"
"Connecticut"   "Hartford"
"Delaware"      "Dover"
"Florida"       "Tallahassee"
"Georgia"       "Atlanta"


Remove Double Quote: awk '{print $1, $2 }' state_capital_formatted.txt | sed 's/\"//g'
pbaniya@dbarepublic.com:~ $ awk '{print $1 "\t" $2 }' state_capital_formatted.txt | sed 's/\"//g' STATE            CAPITAL
Alabama          Montgomery
Alaska           Juneau
Arizona          Phoenix
Arkansas         Little
California       Sacramento
Colorado         Denver
Connecticut      Hartford
Delaware         Dover
Florida          Tallahassee
Georgia          Atlanta


pbaniya@dbarepublic.com:~/sql_loader $ awk '{print }' employee_data.txt EMPLOYEE_ID","FIRST_NAME","LAST_NAME","EMAIL","PHONE_NUMBER","HIRE_DATE","JOB_ID","SALARY","COMMISSION_PCT","MANAGER_ID","DEPARTMENT_ID"
100,"Steven","King","SKING","515.123.4567",17-JUN-87,"AD_PRES",24000,,,90
101,"Neena","Kochhar","NKOCHHAR","515.123.4568",21-SEP-89,"AD_VP",17000,,100,90
102,"Lex","De Haan","LDEHAAN","515.123.4569",13-JAN-93,"AD_VP",17000,,100,90
103,"Alexander","Hunold","AHUNOLD","590.423.4567",03-JAN-90,"IT_PROG",9000,,102,60
104,"Bruce","Ernst","BERNST","590.423.4568",21-MAY-91,"IT_PROG",6000,,103,60
105,"David","Austin","DAUSTIN","590.423.4569",25-JUN-97,"IT_PROG",4800,,103,60
106,"Valli","Pataballa","VPATABAL","590.423.4560",05-FEB-98,"IT_PROG",4800,,103,60
107,"Diana","Lorentz","DLORENTZ","590.423.5567",07-FEB-99,"IT_PROG",4200,,103,60
108,"Nancy","Greenberg","NGREENBE","515.124.4569",17-AUG-94,"FI_MGR",12000,,101,100
109,"Daniel","Faviet","DFAVIET","515.124.4169",16-AUG-94,"ACCOUNT",9000,,108,100
110,"John","Chen","JCHEN","515.124.4269",28-SEP-97,"ACCOUNT",8200,,108,100
111,"Ismael","Sciarra","ISCIARRA","515.124.4369",30-SEP-97,"ACCOUNT",7700,,108,100
112,"Jose Manuel","Urman","JMURMAN","515.124.4469",07-MAR-98,"ACCOUNT",7800,,108,100
113,"Luis","Popp","LPOPP","515.124.4567",07-DEC-99,"ACCOUNT",6900,,108,100
114,"Den","Raphaely","DRAPHEAL","515.127.4561",07-DEC-94,"PU_MAN",11000,,100,30
115,"Alexander","Khoo","AKHOO","515.127.4562",18-MAY-95,"PU_CLERK",3100,,114,30


Removing Double Quote from employee.txt file
pbaniya@dbarepublic.com:~/sql_loader $ awk '{print }' employee_data.txt | sed 's/\"//g' EMPLOYEE_ID,FIRST_NAME,LAST_NAME,EMAIL,PHONE_NUMBER,HIRE_DATE,JOB_ID,SALARY,COMMISSION_PCT,MANAGER_ID,DEPARTMENT_ID
100,Steven,King,SKING,515.123.4567,17-JUN-87,AD_PRES,24000,,,90
101,Neena,Kochhar,NKOCHHAR,515.123.4568,21-SEP-89,AD_VP,17000,,100,90
102,Lex,De Haan,LDEHAAN,515.123.4569,13-JAN-93,AD_VP,17000,,100,90
103,Alexander,Hunold,AHUNOLD,590.423.4567,03-JAN-90,IT_PROG,9000,,102,60
104,Bruce,Ernst,BERNST,590.423.4568,21-MAY-91,IT_PROG,6000,,103,60
105,David,Austin,DAUSTIN,590.423.4569,25-JUN-97,IT_PROG,4800,,103,60
106,Valli,Pataballa,VPATABAL,590.423.4560,05-FEB-98,IT_PROG,4800,,103,60
107,Diana,Lorentz,DLORENTZ,590.423.5567,07-FEB-99,IT_PROG,4200,,103,60
108,Nancy,Greenberg,NGREENBE,515.124.4569,17-AUG-94,FI_MGR,12000,,101,100
109,Daniel,Faviet,DFAVIET,515.124.4169,16-AUG-94,ACCOUNT,9000,,108,100
110,John,Chen,JCHEN,515.124.4269,28-SEP-97,ACCOUNT,8200,,108,100
111,Ismael,Sciarra,ISCIARRA,515.124.4369,30-SEP-97,ACCOUNT,7700,,108,100
112,Jose Manuel,Urman,JMURMAN,515.124.4469,07-MAR-98,ACCOUNT,7800,,108,100
113,Luis,Popp,LPOPP,515.124.4567,07-DEC-99,ACCOUNT,6900,,108,100
114,Den,Raphaely,DRAPHEAL,515.127.4561,07-DEC-94,PU_MAN,11000,,100,30
115,Alexander,Khoo,AKHOO,515.127.4562,18-MAY-95,PU_CLERK,3100,,114,30


AWK field seperator: -F
pbaniya@dbarepublic.com:$ awk -F "," '{print $1, $2 }' employee_data.txt 100 Steven
101 Neena
102 Lex
103 Alexander
104 Bruce
105 David
106 Valli
107 Diana
108 Nancy
109 Daniel
110 John
111 Ismael
112 Jose Manuel
113 Luis
114 Den
115 Alexander


To everyone, picture talks thousands words; for database developer, developer, and dba, these examples talk million words. No explanation is needed. I hope these examples are useful to you as they were for me.

Interested in working with me? I can be reached at pbaniya04[at]gmail.com for any questions, consulting opportunities or you may drop a line to say HELLO. Thank your again for visiting my blog and looking forward to serving you more.

Have a Database-ious Day!

No comments

Powered by Blogger.