DataFrames
Aggregating Data
Slicing and Indexing Data
Creating and Visualizing Data

새로운 데이터를 받아서 분석하기 위해서 Pandas를 자주 사용합니다. 하지만 가끔씩 제대로 기억이 나지 않는 부분들이 있어 한꺼번에 복습 겸 정리를 해둘려고 이 포스트를 만들어봤습니다.

DataSet은 Kaggle의 Titanic 데이터를 사용했습니다.
기본적인 Pandas 사용방법을 정리하기 때문에 train.csv를 df로 저장했습니다

1. DataFrames

사실 pandas는 numpy와 matplotlib 위에 빌드됩니다. numpy는 다차원 배열 객체를 제공하고 matplotlib는 시각화를 담당합니다.

df.head()
데이터 구조가 어떻게 생겼나 보기

df .head()

df.info()
각 colum별 데이터 타입과 결측값이 있는지 확인 할 수 있음

df .info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

df.shpae
데이터 형태 보기 150행 5열(튜플 형태)

df .shape

(891, 12)

df.describe()
데이터 셋의 요약 통계를 볼 수 있음

df .describe()

df.values
2차원 array 형태로 값들을 볼 수 있음

df .values

array([[1, 0, 3, ..., 7.25, nan, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, nan, 'S'],
       ...,
       [889, 0, 3, ..., 23.45, nan, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)

df.columns
열 이름 확인할 수 있음

df .columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

df.index
행 번호 또는 행 이름이 포함됩니다.

df .index

RangeIndex(start=0, stop=891, step=1)

df.sort_values(*컬럼명, ascending=True or False)
해당 컬럼에 대해 오름차순(default)으로 정렬
내림차순 정렬시 ascending=False

df .sort_values("Age")

df.sort_values(["Pclass", 'Age'])

df.sort_values(["Pclass", 'Age'], ascending=[True, False])

df['컬럼명']
두 개 이상의 컬럼을 볼 때는 괄호를 2번 사용해야 합니다. 바깥쪽은 DataFrame의 부분 집합을 담당하고 안쪽은 하위 집합의 열 이름의 목록을 담습니다.

df['Embarked']

0      S
1      C
2      S
3      S
4      S
      ..
886    S
887    S
888    S
889    C
890    Q
Name: Embarked, Length: 891, dtype: object

df[['Pclass', 'Embarked']]

논리 연산자 >, <, ==
대괄호를 이용해 500명 이상인 행의 하위 집합을 만들 수도 있습니다. 뿐만아니라 텍스트, 날짜 데이터도 가능합니다.

df['Age'] > 30

0      False
1       True
2      False
3       True
4       True
       ...  
886    False
887    False
888    False
889    False
890     True
Name: Age, Length: 891, dtype: bool

df[df['Age'] > 18]

df[df['Embarked'] == 'S']

is_male= df['Sex'] =='male'
is_adult= df['Age'] > 19
# State = Ak and Count=0 인 행
df[is_male & is_adult][:10]

df[(df['Sex']== 'male') & (df['Age']>19)][:10]

df['컬럼명'].isin()
범주형 여러 값을 필터링 하는 경우에 사용
마찬가지로 여러개를 넣기 위해서는 대괄호 사용

is_C_or_S = df['Embarked'].isin(['C', 'S'])
df[is_C_or_S].head()

컬럼 추가하기
df['컬럼명']= ___
예를 들어서 SibSp와 Parch라는 Feature를 더해 Family_member라는 변수를 만들어 보겠습니다.

df['Family_member']=df['SibSp']+df['Parch']+1
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
803	804	1	3	Thomas, Master. Assad Alexander	male	0.42	0	1	2625	8.5167	NaN	C
755	756	1	2	Hamalainen, Master. Viljo	male	0.67	1	1	250649	14.5000	NaN	S
644	645	1	3	Baclini, Miss. Eugenie	female	0.75	2	1	2666	19.2583	NaN	C
469	470	1	3	Baclini, Miss. Helene Barbara	female	0.75	2	1	2666	19.2583	NaN	C
78	79	1	2	Caldwell, Master. Alden Gates	male	0.83	0	2	248738	29.0000	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
859	860	0	3	Razi, Mr. Raihed	male	NaN	0	0	2629	7.2292	NaN	C
863	864	0	3	Sage, Miss. Dorothy Edith "Dolly"	female	NaN	8	2	CA. 2343	69.5500	NaN	S
868	869	0	3	van Melkebeke, Mr. Philemon	male	NaN	0	0	345777	9.5000	NaN	S
878	879	0	3	Laleff, Mr. Kristo	male	NaN	0	0	349217	7.8958	NaN	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
305	306	1	1	Allison, Master. Hudson Trevor	male	0.92	1	2	113781	151.5500	C22 C26	S
297	298	0	1	Allison, Miss. Helen Loraine	female	2.00	1	2	113781	151.5500	C22 C26	S
445	446	1	1	Dodge, Master. Washington	male	4.00	0	2	33638	81.8583	A34	S
802	803	1	1	Carter, Master. William Thornton II	male	11.00	1	2	113760	120.0000	B96 B98	S
435	436	1	1	Carter, Miss. Lucile Polk	female	14.00	1	2	113760	120.0000	B96 B98	S
...	...	...	...	...	...	...	...	...	...	...	...	...
859	860	0	3	Razi, Mr. Raihed	male	NaN	0	0	2629	7.2292	NaN	C
863	864	0	3	Sage, Miss. Dorothy Edith "Dolly"	female	NaN	8	2	CA. 2343	69.5500	NaN	S
868	869	0	3	van Melkebeke, Mr. Philemon	male	NaN	0	0	345777	9.5000	NaN	S
878	879	0	3	Laleff, Mr. Kristo	male	NaN	0	0	349217	7.8958	NaN	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
630	631	1	1	Barkworth, Mr. Algernon Henry Wilson	male	80.0	0	0	27042	30.0000	A23	S
96	97	0	1	Goldschmidt, Mr. George B	male	71.0	0	0	PC 17754	34.6542	A5	C
493	494	0	1	Artagaveytia, Mr. Ramon	male	71.0	0	0	PC 17609	49.5042	NaN	C
745	746	0	1	Crosby, Capt. Edward Gifford	male	70.0	1	1	WE/P 5735	71.0000	B22	S
54	55	0	1	Ostby, Mr. Engelhart Cornelius	male	65.0	0	1	113509	61.9792	B30	C
...	...	...	...	...	...	...	...	...	...	...	...	...
859	860	0	3	Razi, Mr. Raihed	male	NaN	0	0	2629	7.2292	NaN	C
863	864	0	3	Sage, Miss. Dorothy Edith "Dolly"	female	NaN	8	2	CA. 2343	69.5500	NaN	S
868	869	0	3	van Melkebeke, Mr. Philemon	male	NaN	0	0	345777	9.5000	NaN	S
878	879	0	3	Laleff, Mr. Kristo	male	NaN	0	0	349217	7.8958	NaN	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S