I transform data every day and I usually do 2 kinds of transformations, changing the data format so I can use it in a tool (CSV to parquet) or the shape, like running an aggregation so I can understand it. Iâm using LLMs more and more for this because it saves me a lot of time (and itâs cool)
I see two approaches for using LLMs to transform data:
Asking to transform the data and output it
cat file | llm âsystem âtransform the data into <format>âAsking to generate code to transform the data
cat file | llm âsystem âgenerate a python code to transform this data into <format> â > trasform.py && python trasform.py file > transformedIf you donât know what the llm command is, please, go to check the fantastic llm cli tool from
.
The second one has many benefits:
The code will run way faster, LLMs are still slow compared to regular CPUs
The transformation can be audited and fixed
Letâs test it. I have a file with NMEA records from a GPS. NMEA according to Wikipedia âis a combined electrical and data specification for communication between marine electronics such as echo sounder, sonars, anemometer, gyrocompass, autopilot, GPS receivers and many other types of instrumentsâ. I NMEA was invested today would have been NDJSON but at that time machines were sending data through a 9600 bauds per second comm line so they needed to optimize. Parsing is also super easy (probably they couldnât afford to spend a lot of code for the parsing) but letâs get back to the transformation thing.
I have some data I got from my carâs GPS (which still sends the info using NMEA these days) in a file, I grep GPRMC sentences (the ones that have the coordinates) and pipe into the llm command (using gemini-2.0 code execution). This would be the command (I shortened it for clarity)
â head -n 40 cheste_fixed.nmea | grep RMC | llm -m gemini-2.0-flash-exp -o code_execution 1 --system "write and EXECUTE a program to parse that data and generate ndjson with lat, lon, time and speed attributes. No explanations, just write and execute"
```python
import json
import re
from datetime import datetime, timedelta
data = """
$GPRMC,172633.00,A,3929.0184,N,00037.8573,W,80.7,54.1,160923,,*22
$GPRMC,172634.00,A,3929.0331,N,00037.8280,W,84.0,55.5,160923,,*24
$GPRMC,172635.00,A,3929.0495,N,00037.7985,W,93.5,54.8,160923,,*22
...
"""
def parse_nmea(line):
match = re.match(r'\$GPRMC,([\d.]+),A,([\d.]+),([NS]),([\d.]+),([EW]),([\d.]+),([\d.]+),(\d+),,', line)
if match:
time_str, lat_str, lat_dir, lon_str, lon_dir, speed_str, _, date_str = match.groups()
time_obj = datetime.strptime(date_str+ time_str.split(".")[0], "%d%m%y%H%M%S")
lat = float(lat_str[:2]) + float(lat_str[2:]) / 60.0
if lat_dir == 'S':
lat = -lat
lon = float(lon_str[:3]) + float(lon_str[3:]) / 60.0
if lon_dir == 'W':
lon = -lon
speed = float(speed_str) * 1.852
return {
"lat": lat,
"lon": lon,
"time": time_obj.isoformat(),
"speed": speed
}
return None
parsed_data = []
for line in data.strip().splitlines():
parsed_line = parse_nmea(line)
if parsed_line:
parsed_data.append(parsed_line)
for item in parsed_data:
print(json.dumps(item))
```
```
{"lat": 39.48364, "lon": -0.630955, "time": "2023-09-16T17:26:33", "speed": 149.4564}
{"lat": 39.483885, "lon": -0.6304666666666667, "time": "2023-09-16T17:26:34", "speed": 155.568}
{"lat": 39.48415833333333, "lon": -0.629975, "time": "2023-09-16T17:26:35", "speed": 173.162}
{"lat": 39.48441833333333, "lon": -0.6294983333333334, "time": "2023-09-16T17:26:36", "speed": 177.9772}
{"lat": 39.48466666666667, "lon": -0.6290183333333333, "time": "2023-09-16T17:26:37", "speed": 177.9772}
{"lat": 39.48491666666666, "lon": -0.628515, "time": "2023-09-16T17:26:38", "speed": 179.27360000000002}
It sounds like it did the right transformation (indeed, checking the data, itâs accurate). Just in case you are checking the data carefully, speed attribute feels like too high but itâs a car in a race track, so itâs expected.
But how could we make sure itâs doing it right? Iâd not trust the transformed data right away but I can use what we have been using in software development for years: tests. So letâs ask the LLM to generate not just the transform, but also the test with the backwards transformation:
$ head -n 10 ~/Downloads/lap_timer/cheste_fixed.nmea | grep RMC | llm -m gemini-2.0-flash-exp -o code_execution 1 --system "write and EXECUTE a program to parse that data and generate ndjson with lat, lon, time and speed attributes. Also write the code that transforms the output back to original format and check the transformation back. No explanations, just write and execute"
```python
import json
import pynmea2
from datetime import datetime, timedelta
nmea_data = """$GPRMC,172633.00,A,3929.0184,N,00037.8573,W,80.7,54.1,160923,,*22
$GPRMC,172634.00,A,3929.0331,N,00037.8280,W,84.0,55.5,160923,,*24
$GPRMC,172635.00,A,3929.0495,N,00037.7985,W,93.5,54.8,160923,,*22
$GPRMC,172636.00,A,3929.0651,N,00037.7699,W,96.1,54.8,160923,,*28
$GPRMC,172637.00,A,3929.0800,N,00037.7411,W,96.1,55.5,160923,,*2D"""
def parse_nmea_to_ndjson(nmea_str):
ndjson_output = []
for line in nmea_str.strip().splitlines():
try:
msg = pynmea2.parse(line)
if isinstance(msg, pynmea2.RMC):
time_obj = datetime.combine(msg.datestamp, msg.timestamp)
ndjson_output.append(json.dumps({
"lat": msg.latitude,
"lon": msg.longitude,
"time": time_obj.isoformat(),
"speed": msg.spd_over_grnd
}))
except pynmea2.ParseError:
print(f"Could not parse: {line}")
continue
return "\n".join(ndjson_output)
def transform_ndjson_to_nmea(ndjson_str):
nmea_output = []
for line in ndjson_str.strip().splitlines():
try:
data = json.loads(line)
time_obj = datetime.fromisoformat(data['time'])
nmea_time = time_obj.strftime("%H%M%S.%f")[:-3]
nmea_date = time_obj.strftime("%d%m%y")
lat_deg = int(abs(data['lat']))
lat_min = (abs(data['lat']) - lat_deg) * 60
lat_dir = "N" if data['lat'] >= 0 else "S"
lon_deg = int(abs(data['lon']))
lon_min = (abs(data['lon']) - lon_deg) * 60
lon_dir = "E" if data['lon'] >= 0 else "W"
nmea_speed = data['speed']
nmea_str = f"$GPRMC,{nmea_time},A,{lat_deg}{lat_min:.4f},{lat_dir},{lon_deg}{lon_min:.4f},{lon_dir},{nmea_speed:.1f},{nmea_speed:.1f},{nmea_date},,"
checksum = 0
for char in nmea_str[1:]:
checksum ^= ord(char)
nmea_str += f"*{hex(checksum)[2:].upper()}"
nmea_output.append(nmea_str)
except json.JSONDecodeError:
print(f"Could not decode json: {line}")
continue
return "\n".join(nmea_output)
ndjson_data = parse_nmea_to_ndjson(nmea_data)
print("NDJSON output:")
print(ndjson_data)
nmea_transformed = transform_ndjson_to_nmea(ndjson_data)
print("\nTransformed back to NMEA:")
print(nmea_transformed)
print("\nOriginal NMEA:")
print(nmea_data)
def check_transformation(original, transformed):
original_lines = original.strip().splitlines()
transformed_lines = transformed.strip().splitlines()
if len(original_lines) != len(transformed_lines):
print("Number of lines do not match.")
return False
for orig_line, trans_line in zip(original_lines, transformed_lines):
try:
orig_msg = pynmea2.parse(orig_line)
trans_msg = pynmea2.parse(trans_line)
if not isinstance(orig_msg, pynmea2.RMC) or not isinstance(trans_msg, pynmea2.RMC):
print("Message type is incorrect.")
return False
if not (abs(orig_msg.latitude - trans_msg.latitude) < 0.00001 and
abs(orig_msg.longitude - trans_msg.longitude) < 0.00001 and
abs(orig_msg.spd_over_grnd - trans_msg.spd_over_grnd) < 0.0001 and
orig_msg.timestamp == trans_msg.timestamp and
orig_msg.datestamp == trans_msg.datestamp
):
print(f"Original: {orig_msg}")
print(f"Transformed:{trans_msg}")
return False
except pynmea2.ParseError:
print(f"Could not parse: {orig_line} or {trans_line}")
return False
print("Transformation check passed!")
return True
check_transformation(nmea_data, nmea_transformed)
```
```
ExecuteCode failed: MODULE_NOT_FOUND.
msg_err: Traceback (most recent call last):
File "<string>", line 3, in <module>
ModuleNotFoundError: No module named 'pynmea2'
```It fails to run because of the pynmea2 dependency but if you run it locally it manages to do it. So running that self-test gives me *some* confidence about the transformation function and Iâd trust it to be in a pull request.
BTW, using the first approach also work, at least for gemini 2.0 it manages to transform NMEA to JSON, but it takes way more time than generating the script (as expected)


