본문 바로가기

Big Data & Data Structure

Introduction to Data and Data Files

1. Introduction to Data

Data is a various form of source and a numerical value caused by human action or control.

There are many types of data in the world. For example, phone call history, payment record at the convenience store, saved file of the online game, pulse during the workout, watched Youtube video last night, marked the location of the new restaurant, etc. Text, video, record, numbers read by sensors is a data. 

These data gather into a database and analyze to make overall information. Find pattern, inspect the highest and lowest feature, calculate the average and find the relationship. With these processes, we can notify the particular information about the person, group, city, nation, or even global progress.

 

2. Data Formats 

 

The most basic data is "Text". They're an example of text

The HL Road Frame is our lightest and best quality aluminum frame made from the newest alloy; it is welded and heat-treated for strength. Our innovative design results in maximum comfort and performance. Its list price is $1431.50

This text has information about frame like name, price, etc. This is called "unstructured data". This is just a large text containing information. We can understand the content but it is hard to use by structure. 

So, one of the most common ways to struct by row or field like names, description, price. This information is divided by comma(,). The prior example can be sorted like below.

Product, Description, Price
HL Road Fram, Our lightest..., 1431.50
Sport Helmet, Lightweight vented..., 34.99
...

Text files can be read by different programs.

The first is Microsoft Excel. Excel can read text file or *CSV and transform into excel spreadsheet.

* CSV (Comma-Separated Value) is a text file that has divided fields by comma.

 

The second program is XML(Extensible Markup Language). Here is an example.

<?xml version="1.0" encoding="utf-8"?>
<order incoivenumber = "1234">
    <item id="123" price="1.99" quantity="2"/>
    <item id="321" price="2.45" quantity="1"/>
</order>

XML can be read by Visual Studio. It means Visual Studio can recognize the element of XML codes. Also, you can add, delete, revise the XML code. However, These days, XML is barely used. Because XML is too verbose. It has many tags and text. So, It is hard to parsing it. So, they found the solution.

 

The third is JSON(JavaScript Object Notation). Here is an example. 

{"invoicenumber":1234}
 "item": [
    {"id":123, "price":1.99 "quantity":2},
    {"id":321, "price":2.45 "quantity":1},
 ]
}

Compare to XML, There are fewer quotations (") but they can be still read by numbers. In addition, items are marked by a square brace. So, it can be easily recognized. In Visual Studio, you can express or hide other data by control.

 

3. Encoding

Encoding is changing characters or marks into signals that computers can read. In the beginning, people used ASCII(American Standard Code for Information Interchange) that contains the English alphabet and some marks. However, computers began to spread to the world, various people in the world need to put their own language characters.  Also, The number of ASCII was not enough to express all the word at all.

 

So, The Unicode made it. Unicode can express all the words in the world and add new marks that ASCII didn't have. Though, Eventually, All the characters should be "encoded" to numerical numbers. 

Hello!
72      101      108      108       111      33
48       65        6C       6C        6F        21
1001000 1100101 1101100 1101100 1101111 0100001

On the basis of ASCII, English can be changed to 1byte(8 bit) and also an exclamation mark, too.

Γειά σου!
915 949 953 945  32  963 945 962 33
393 3B5 3B9  3B1 20 3C3 3B1 3C2 21
000001110010011 0000001110110101 0000001110111001
0000001110110001 0000000000100000 0000001111000011
0000001110110001 0000001111000010 0000000000100001

But, the other language(Example word's meaning is "hello!" in Greek character). Because of the Unicode, Some languages need 2 byte(16 bit) to encode to binary number. but when we look space( ) and exclamation mark(!) are containing less than even 1 byte. So, there is waste in memory. but this is the way that we choose to encode.

This is called "UTF-16", encoding at least 2 byte(16 bit). In some cases, we even have to call more byte like 4 bytes.

Γειά σου!
206+147 206+181 206+185 206+177 32 207+131 206+177 207+130 33
CE+93    CE+B5    CE+B9  CE+B1   20  CF+83    CE+B1    CF+82    21
11001110 10010011 11001110 10110101 11001110 10111001
11001110 10110001 00100000 11001111 10000011 11001110 
10110001 11001111 10000010 00100001                            

Some of the binary numbers have to take two-byte, a pair. However, space( ) and exclamation(!) are using

1 byte (8bit) so they don't waste memories. This is called "UTF-8"

 

The most important thing about encoding is that the character fit to right encoding program. When it doesn't fit to encoding, you can have a loss of data.

 

 

 

 

'Big Data & Data Structure' 카테고리의 다른 글

Database Basic  (0) 2020.08.09