[Feature #21943] Add StringScanner#integer_at to extract capture group as Integer directly#192
[Feature #21943] Add StringScanner#integer_at to extract capture group as Integer directly#192jinroq wants to merge 7 commits intoruby:masterfrom
Conversation
This avoids intermediate String allocation when only the integer value of a captured group is needed (e.g. date parsing). The method converts the matched byte range to an Integer at the C level using rb_cstr2inum, which is semantically equivalent to `scanner[n].to_i` but without allocating a temporary String object.
kou
left a comment
There was a problem hiding this comment.
Can we use better name than get_int?
We already have #get_byte that returns the next byte not the data from the matched group.
ext/strscan/strscan.c
Outdated
| ptr = S_PBEG(p) + beg; | ||
|
|
||
| { | ||
| char *buffer = RB_ALLOCV_N(char, buffer_v, len + 1); |
There was a problem hiding this comment.
Thank you for your feedback!
The original implementation did allocate a C string via RB_ALLOCV_N, which undermines the purpose of avoiding intermediate allocations. While RB_ALLOCV_N uses alloca (stack allocation) for small sizes and is much cheaper than a Ruby String, it is still an unnecessary allocation for the common case.
Updated in 4600b0a. For values up to 18 digits on 64-bit (9 on 32-bit), digits are now parsed directly from the source string's byte buffer with no allocation at all — neither Ruby String nor C string. The RB_ALLOCV_N + rb_cstr2inum path is only used as a fallback for bignum-range values (more than 18 digits).
In the primary use case of date component parsing (1-4 digit values), this means zero allocation of any kind.
There was a problem hiding this comment.
Sorry, forgotten that is declared just in an internal header.
Maybe we should move them to a public header?
There was a problem hiding this comment.
If we need to use rb_int_parse_cstr() in out side of Ruby (strscan), we should make it public.
Parse digits directly from the source string's byte buffer for values that fit in a long (up to 18 digits on 64-bit, 9 on 32-bit), avoiding the RB_ALLOCV_N + rb_cstr2inum overhead entirely. Fall back to rb_cstr2inum with a temporary buffer only for bignum-range values. This eliminates all allocation (both Ruby String and C buffer) in the primary use case of date component parsing (1-4 digit values).
Here are some alternatives to
I recommend |
kou
left a comment
There was a problem hiding this comment.
OK. Let's use integer_at(index).
Alternative idea: scanner[index, :integer]
ext/strscan/strscan.c
Outdated
| { | ||
| long j = 0; | ||
| int negative = 0; | ||
| long digit_count; | ||
|
|
||
| if (ptr[0] == '-') { negative = 1; j = 1; } | ||
| else if (ptr[0] == '+') { j = 1; } | ||
|
|
||
| digit_count = len - j; | ||
|
|
||
| if (digit_count <= GET_INT_MAX_DIGITS) { | ||
| long result = 0; | ||
| for (; j < len; j++) { | ||
| result = result * 10 + (ptr[j] - '0'); | ||
| } | ||
| return LONG2NUM(negative ? -result : result); | ||
| } | ||
| } |
There was a problem hiding this comment.
What does happen when matched string contains non numbers such as .?
There was a problem hiding this comment.
It's a bug because it returns an unexpected value. 82b93ad fixes it.
The name "get_int" conflicts with the existing "get_byte" convention, which advances the scanner position. "integer_at" better reflects that this method reads from a capture group (like "values_at") rather than advancing the scanner.
Previously, non-digit characters in the captured substring produced silently wrong results (e.g. "1.5" returned 85). Now both the fast path and bignum path validate that all characters after an optional sign are digits, raising ArgumentError otherwise.
Replace the hand-rolled fast path / slow path integer parsing with a single call to rb_int_parse_cstr, which accepts a length parameter and parses directly from the source buffer without NUL-terminated copy. This removes the INTEGER_AT_MAX_DIGITS constant, the manual digit validation loops, and the RB_ALLOCV_N buffer allocation.
ext/strscan/strscan.c
Outdated
| end = adjust_register_position(p, p->regs.end[i]); | ||
| len = end - beg; | ||
|
|
||
| if (len <= 0) return INT2FIX(0); |
There was a problem hiding this comment.
Looks like this integer_at parses more strict than Integer(string) that accepts "1_234".
So I think empty string case should also raise error like Integer("") raises ArgumentError.
In any case, I think it's worth adding empty string matched test case
Consistent with Integer("") raising ArgumentError, empty captures now also raise instead of silently returning 0.
Cover sign-only captures (+/-), signed numbers (-42/+42), leading zeros (007), and full match containing non-digit separators (e.g. "2024-06-15" at index 0).
You can close this PR and re-open a new PR with a topic branch. |
| /* rb_int_parse_cstr is declared in internal/bignum.h which is not | ||
| * available to extensions. Declare it here since the symbol is | ||
| * exported from libruby. */ | ||
| VALUE rb_int_parse_cstr(const char *str, ssize_t len, char **endp, | ||
| size_t *ndigits, int base, int flags); | ||
| #define RB_INT_PARSE_SIGN 0x01 | ||
|
|
There was a problem hiding this comment.
Could you move this from between the call-seq comment and strscan_integer_at() definition for RDoc?
|
BTW, could you enable GitHub Actions on your fork to run CI in your side too. |
* The problem is otherwise it would include PR workflows from a fork branch named "master", like ruby/strscan#192.
see: https://bugs.ruby-lang.org/issues/21943
relation: https://bugs.ruby-lang.org/issues/21932