Data Transformation

openclean provides its own set of operations to transform datasets whilst keeping row indices intact.

We use a sample of NYC open data with completed job codes at various locations in New York City to demonstrate some examples.

import os
path_to_file = os.path.join(os.getcwd(), 'source', 'data')
from openclean.data.load import dataset

ds = dataset(os.path.join(path_to_file, 'job_locations.csv'))

ds.head()
Job # Borough Street Name GIS_NTA_NAME
0 140915936 MANHATTAN EAST 93RD STREET Upper East Side-Carnegie Hill
1 340737929 BROOKLYN 16TH AVE Bensonhurst West
2 440601733 QUEENS AUSTIN STREET Kew Gardens
3 340737901 BROOKLYN 3RD STREET Carroll Gardens-Columbia Street-Red Hook
4 240275910 BRONX CEDAR AVENUE University Heights-Morris Heights

Selecting

One can select columns from a dataset using the select operation:

from openclean.operator.transform.select import select

selected = select(ds, columns=['Job #', 'GIS_NTA_NAME'], names=['job_id','neighborhood'])

selected.head()
job_id neighborhood
0 140915936 Upper East Side-Carnegie Hill
1 340737929 Bensonhurst West
2 440601733 Kew Gardens
3 340737901 Carroll Gardens-Columbia Street-Red Hook
4 240275910 University Heights-Morris Heights

Inserting

To insert a new column into a dataset, use the inscol operation. We use a Const eval function to define values for the new ‘City’ column.

from openclean.operator.transform.insert import inscol
from openclean.function.eval.base import Const

new_col = inscol(ds, names=['City'], pos=4, values=Const('New York'))

new_col.head()
Job # Borough Street Name GIS_NTA_NAME City
0 140915936 MANHATTAN EAST 93RD STREET Upper East Side-Carnegie Hill New York
1 340737929 BROOKLYN 16TH AVE Bensonhurst West New York
2 440601733 QUEENS AUSTIN STREET Kew Gardens New York
3 340737901 BROOKLYN 3RD STREET Carroll Gardens-Columbia Street-Red Hook New York
4 240275910 BRONX CEDAR AVENUE University Heights-Morris Heights New York

To insert a new row, use the insrow operation. Let’s add a dummy row at the start with all zeros.

from openclean.operator.transform.insert import insrow

new_row = insrow(ds, pos=0, values=[0,0,0,0])

new_row.head()
Job # Borough Street Name GIS_NTA_NAME
-1 0 0 0 0
0 140915936 MANHATTAN EAST 93RD STREET Upper East Side-Carnegie Hill
1 340737929 BROOKLYN 16TH AVE Bensonhurst West
2 440601733 QUEENS AUSTIN STREET Kew Gardens
3 340737901 BROOKLYN 3RD STREET Carroll Gardens-Columbia Street-Red Hook

Updating

Updating a preexisting column is straightforward. The update operator takes the column name and a func argument which can be a callable or an Eval function. The following snippet updates the ‘Borough’ column to Title case. The func can be a dictionary, a scalar, or a function.

from openclean.operator.transform.update import update

title_case = update(ds, columns='Borough', func=str.title)

title_case.head()
Job # Borough Street Name GIS_NTA_NAME
0 140915936 Manhattan EAST 93RD STREET Upper East Side-Carnegie Hill
1 340737929 Brooklyn 16TH AVE Bensonhurst West
2 440601733 Queens AUSTIN STREET Kew Gardens
3 340737901 Brooklyn 3RD STREET Carroll Gardens-Columbia Street-Red Hook
4 240275910 Bronx CEDAR AVENUE University Heights-Morris Heights

Filtering

openclean filters records from a dataset using the filter operation, which requires a predicate. The predicate is a list or dataframe of Booleans. Here, we use the Col eval function to create the predicate that translates to; show all rows that have the value ‘BROOKLYN’ in the ‘Borough’ column.

from openclean.operator.transform.filter import filter
from openclean.function.eval.base import Col

filtered = filter(ds, predicate=Col('Borough')=='BROOKLYN')

filtered.head()
Job # Borough Street Name GIS_NTA_NAME
1 340737929 BROOKLYN 16TH AVE Bensonhurst West
3 340737901 BROOKLYN 3RD STREET Carroll Gardens-Columbia Street-Red Hook
8 322050748 BROOKLYN WILLIAMS AVENUE Canarsie
12 340737938 BROOKLYN SHORE ROAD Bay Ridge
21 310123381 BROOKLYN JEFFERSON AVENUE Bedford

Moving

Changing the column order is efficiently straight forward too. Let’s move Job # to a different position.

from openclean.operator.transform.move import movecols

moved_col = movecols(ds, 'Job #', 2)

moved_col.head()
Borough Street Name Job # GIS_NTA_NAME
0 MANHATTAN EAST 93RD STREET 140915936 Upper East Side-Carnegie Hill
1 BROOKLYN 16TH AVE 340737929 Bensonhurst West
2 QUEENS AUSTIN STREET 440601733 Kew Gardens
3 BROOKLYN 3RD STREET 340737901 Carroll Gardens-Columbia Street-Red Hook
4 BRONX CEDAR AVENUE 240275910 University Heights-Morris Heights

To move an existing row to a different position, use the moverows operation. Here is an example:

from openclean.operator.transform.move import move_rows

moved_row = move_rows(ds, 0, 2)

moved_row.head()
Job # Borough Street Name GIS_NTA_NAME
1 340737929 BROOKLYN 16TH AVE Bensonhurst West
2 440601733 QUEENS AUSTIN STREET Kew Gardens
0 140915936 MANHATTAN EAST 93RD STREET Upper East Side-Carnegie Hill
3 340737901 BROOKLYN 3RD STREET Carroll Gardens-Columbia Street-Red Hook
4 240275910 BRONX CEDAR AVENUE University Heights-Morris Heights

Sorting

To sort values in a column, openclean provides a sort operation. Let’s try to sort the dataset in descending Job #s.

from openclean.operator.transform.sort import order_by

sorted = order_by(ds, columns='Job #', reversed=True)

sorted.head()
Job # Borough Street Name GIS_NTA_NAME
8563 540182354 STATEN ISLAND CLOVE ROAD West New Brighton-New Brighton-St. George
7176 540182345 STATEN ISLAND ARDEN AVENUE Arden Heights
9986 540178555 STATEN ISLAND LINWOOD AVENUE Old Town-Dongan Hills-South Beach
9381 540178537 STATEN ISLAND HASBROUCK ROAD Todt Hill-Emerson Hill-Heartland Village-Light...
9348 540178494 STATEN ISLAND VICTORY BOULEVARD New Springville-Bloomfield-Travis