1. Introduction to Data
Data is a various form of source and a numerical value caused by human action or control.
There are many types of data in the world. For example, phone call history, payment record at the convenience store, saved file of the online game, pulse during the workout, watched Youtube video last night, marked the location of the new restaurant, etc. Text, video, record, numbers read by sensors is a data.
These data gather into a database and analyze to make overall information. Find pattern, inspect the highest and lowest feature, calculate the average and find the relationship. With these processes, we can notify the particular information about the person, group, city, nation, or even global progress.
2. Data Formats
The most basic data is "Text". They're an example of text
| The HL Road Frame is our lightest and best quality aluminum frame made from the newest alloy; it is welded and heat-treated for strength. Our innovative design results in maximum comfort and performance. Its list price is $1431.50 |
This text has information about frame like name, price, etc. This is called "unstructured data". This is just a large text containing information. We can understand the content but it is hard to use by structure.
So, one of the most common ways to struct by row or field like names, description, price. This information is divided by comma(,). The prior example can be sorted like below.
| Product, Description, Price HL Road Fram, Our lightest..., 1431.50 Sport Helmet, Lightweight vented..., 34.99 ... |
Text files can be read by different programs.
The first is Microsoft Excel. Excel can read text file or *CSV and transform into excel spreadsheet.
* CSV (Comma-Separated Value) is a text file that has divided fields by comma.
The second program is XML(Extensible Markup Language). Here is an example.
| <?xml version="1.0" encoding="utf-8"?> <order incoivenumber = "1234"> <item id="123" price="1.99" quantity="2"/> <item id="321" price="2.45" quantity="1"/> </order> |
XML can be read by Visual Studio. It means Visual Studio can recognize the element of XML codes. Also, you can add, delete, revise the XML code. However, These days, XML is barely used. Because XML is too verbose. It has many tags and text. So, It is hard to parsing it. So, they found the solution.
The third is JSON(JavaScript Object Notation). Here is an example.
| {"invoicenumber":1234} "item": [ {"id":123, "price":1.99 "quantity":2}, {"id":321, "price":2.45 "quantity":1}, ] } |
Compare to XML, There are fewer quotations (") but they can be still read by numbers. In addition, items are marked by a square brace. So, it can be easily recognized. In Visual Studio, you can express or hide other data by control.
3. Encoding
Encoding is changing characters or marks into signals that computers can read. In the beginning, people used ASCII(American Standard Code for Information Interchange) that contains the English alphabet and some marks. However, computers began to spread to the world, various people in the world need to put their own language characters. Also, The number of ASCII was not enough to express all the word at all.
So, The Unicode made it. Unicode can express all the words in the world and add new marks that ASCII didn't have. Though, Eventually, All the characters should be "encoded" to numerical numbers.
| Hello! 72 101 108 108 111 33 48 65 6C 6C 6F 21 1001000 1100101 1101100 1101100 1101111 0100001 |
On the basis of ASCII, English can be changed to 1byte(8 bit) and also an exclamation mark, too.
| Γειά σου! 915 949 953 945 32 963 945 962 33 393 3B5 3B9 3B1 20 3C3 3B1 3C2 21 000001110010011 0000001110110101 0000001110111001 0000001110110001 0000000000100000 0000001111000011 0000001110110001 0000001111000010 0000000000100001 |
But, the other language(Example word's meaning is "hello!" in Greek character). Because of the Unicode, Some languages need 2 byte(16 bit) to encode to binary number. but when we look space( ) and exclamation mark(!) are containing less than even 1 byte. So, there is waste in memory. but this is the way that we choose to encode.
This is called "UTF-16", encoding at least 2 byte(16 bit). In some cases, we even have to call more byte like 4 bytes.
| Γειά σου! 206+147 206+181 206+185 206+177 32 207+131 206+177 207+130 33 CE+93 CE+B5 CE+B9 CE+B1 20 CF+83 CE+B1 CF+82 21 11001110 10010011 11001110 10110101 11001110 10111001 11001110 10110001 00100000 11001111 10000011 11001110 10110001 11001111 10000010 00100001 |
Some of the binary numbers have to take two-byte, a pair. However, space( ) and exclamation(!) are using
1 byte (8bit) so they don't waste memories. This is called "UTF-8"
The most important thing about encoding is that the character fit to right encoding program. When it doesn't fit to encoding, you can have a loss of data.
'Big Data & Data Structure' 카테고리의 다른 글
| Database Basic (0) | 2020.08.09 |
|---|