Data Transformation
openclean provides its own set of operations to transform datasets whilst keeping row indices intact.
We use a sample of NYC open data with completed job codes at various locations in New York City to demonstrate some examples.
import os
path_to_file = os.path.join(os.getcwd(), 'source', 'data')
from openclean.data.load import dataset
ds = dataset(os.path.join(path_to_file, 'job_locations.csv'))
ds.head()
Job # | Borough | Street Name | GIS_NTA_NAME | |
---|---|---|---|---|
0 | 140915936 | MANHATTAN | EAST 93RD STREET | Upper East Side-Carnegie Hill |
1 | 340737929 | BROOKLYN | 16TH AVE | Bensonhurst West |
2 | 440601733 | QUEENS | AUSTIN STREET | Kew Gardens |
3 | 340737901 | BROOKLYN | 3RD STREET | Carroll Gardens-Columbia Street-Red Hook |
4 | 240275910 | BRONX | CEDAR AVENUE | University Heights-Morris Heights |
Selecting
One can select columns from a dataset using the select operation:
from openclean.operator.transform.select import select
selected = select(ds, columns=['Job #', 'GIS_NTA_NAME'], names=['job_id','neighborhood'])
selected.head()
job_id | neighborhood | |
---|---|---|
0 | 140915936 | Upper East Side-Carnegie Hill |
1 | 340737929 | Bensonhurst West |
2 | 440601733 | Kew Gardens |
3 | 340737901 | Carroll Gardens-Columbia Street-Red Hook |
4 | 240275910 | University Heights-Morris Heights |
Inserting
To insert a new column into a dataset, use the inscol operation. We use a Const eval function to define values for the new ‘City’ column.
from openclean.operator.transform.insert import inscol
from openclean.function.eval.base import Const
new_col = inscol(ds, names=['City'], pos=4, values=Const('New York'))
new_col.head()
Job # | Borough | Street Name | GIS_NTA_NAME | City | |
---|---|---|---|---|---|
0 | 140915936 | MANHATTAN | EAST 93RD STREET | Upper East Side-Carnegie Hill | New York |
1 | 340737929 | BROOKLYN | 16TH AVE | Bensonhurst West | New York |
2 | 440601733 | QUEENS | AUSTIN STREET | Kew Gardens | New York |
3 | 340737901 | BROOKLYN | 3RD STREET | Carroll Gardens-Columbia Street-Red Hook | New York |
4 | 240275910 | BRONX | CEDAR AVENUE | University Heights-Morris Heights | New York |
To insert a new row, use the insrow operation. Let’s add a dummy row at the start with all zeros.
from openclean.operator.transform.insert import insrow
new_row = insrow(ds, pos=0, values=[0,0,0,0])
new_row.head()
Job # | Borough | Street Name | GIS_NTA_NAME | |
---|---|---|---|---|
-1 | 0 | 0 | 0 | 0 |
0 | 140915936 | MANHATTAN | EAST 93RD STREET | Upper East Side-Carnegie Hill |
1 | 340737929 | BROOKLYN | 16TH AVE | Bensonhurst West |
2 | 440601733 | QUEENS | AUSTIN STREET | Kew Gardens |
3 | 340737901 | BROOKLYN | 3RD STREET | Carroll Gardens-Columbia Street-Red Hook |
Updating
Updating a preexisting column is straightforward. The update operator takes the column name and a func argument which can be a callable or an Eval function. The following snippet updates the ‘Borough’ column to Title case. The func can be a dictionary, a scalar, or a function.
from openclean.operator.transform.update import update
title_case = update(ds, columns='Borough', func=str.title)
title_case.head()
Job # | Borough | Street Name | GIS_NTA_NAME | |
---|---|---|---|---|
0 | 140915936 | Manhattan | EAST 93RD STREET | Upper East Side-Carnegie Hill |
1 | 340737929 | Brooklyn | 16TH AVE | Bensonhurst West |
2 | 440601733 | Queens | AUSTIN STREET | Kew Gardens |
3 | 340737901 | Brooklyn | 3RD STREET | Carroll Gardens-Columbia Street-Red Hook |
4 | 240275910 | Bronx | CEDAR AVENUE | University Heights-Morris Heights |
Filtering
openclean filters records from a dataset using the filter operation, which requires a predicate. The predicate is a list or dataframe of Booleans. Here, we use the Col eval function to create the predicate that translates to; show all rows that have the value ‘BROOKLYN’ in the ‘Borough’ column.
from openclean.operator.transform.filter import filter
from openclean.function.eval.base import Col
filtered = filter(ds, predicate=Col('Borough')=='BROOKLYN')
filtered.head()
Job # | Borough | Street Name | GIS_NTA_NAME | |
---|---|---|---|---|
1 | 340737929 | BROOKLYN | 16TH AVE | Bensonhurst West |
3 | 340737901 | BROOKLYN | 3RD STREET | Carroll Gardens-Columbia Street-Red Hook |
8 | 322050748 | BROOKLYN | WILLIAMS AVENUE | Canarsie |
12 | 340737938 | BROOKLYN | SHORE ROAD | Bay Ridge |
21 | 310123381 | BROOKLYN | JEFFERSON AVENUE | Bedford |
Moving
Changing the column order is efficiently straight forward too. Let’s move Job # to a different position.
from openclean.operator.transform.move import movecols
moved_col = movecols(ds, 'Job #', 2)
moved_col.head()
Borough | Street Name | Job # | GIS_NTA_NAME | |
---|---|---|---|---|
0 | MANHATTAN | EAST 93RD STREET | 140915936 | Upper East Side-Carnegie Hill |
1 | BROOKLYN | 16TH AVE | 340737929 | Bensonhurst West |
2 | QUEENS | AUSTIN STREET | 440601733 | Kew Gardens |
3 | BROOKLYN | 3RD STREET | 340737901 | Carroll Gardens-Columbia Street-Red Hook |
4 | BRONX | CEDAR AVENUE | 240275910 | University Heights-Morris Heights |
To move an existing row to a different position, use the moverows operation. Here is an example:
from openclean.operator.transform.move import move_rows
moved_row = move_rows(ds, 0, 2)
moved_row.head()
Job # | Borough | Street Name | GIS_NTA_NAME | |
---|---|---|---|---|
1 | 340737929 | BROOKLYN | 16TH AVE | Bensonhurst West |
2 | 440601733 | QUEENS | AUSTIN STREET | Kew Gardens |
0 | 140915936 | MANHATTAN | EAST 93RD STREET | Upper East Side-Carnegie Hill |
3 | 340737901 | BROOKLYN | 3RD STREET | Carroll Gardens-Columbia Street-Red Hook |
4 | 240275910 | BRONX | CEDAR AVENUE | University Heights-Morris Heights |
Sorting
To sort values in a column, openclean provides a sort operation. Let’s try to sort the dataset in descending Job #s.
from openclean.operator.transform.sort import order_by
sorted = order_by(ds, columns='Job #', reversed=True)
sorted.head()
Job # | Borough | Street Name | GIS_NTA_NAME | |
---|---|---|---|---|
8563 | 540182354 | STATEN ISLAND | CLOVE ROAD | West New Brighton-New Brighton-St. George |
7176 | 540182345 | STATEN ISLAND | ARDEN AVENUE | Arden Heights |
9986 | 540178555 | STATEN ISLAND | LINWOOD AVENUE | Old Town-Dongan Hills-South Beach |
9381 | 540178537 | STATEN ISLAND | HASBROUCK ROAD | Todt Hill-Emerson Hill-Heartland Village-Light... |
9348 | 540178494 | STATEN ISLAND | VICTORY BOULEVARD | New Springville-Bloomfield-Travis |